由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - [Data Science Project Case] Bias Correction - second try (转载)
相关主题
[Data Science Project Case] Bias Correction - third try (转载)approximate CDF problem
need help on bias correctionReject Inference question in Credit Scoring
一个combine scored models的问题topic for PhD dissertation
Machine learning里的variance怎么算?请大家帮忙看一下我这样想对不对。谢谢。
about PROC KDE请教PCA
怎么解决biased sample的问题?Wavelet analysis in R?
请教个概率计算的问题[合集] how to randomly generate a n by n positive definite matrix
问一个关于skewness的问题how to Use PCA to get eignen vector and eigen value
相关话题的讨论汇总
话题: data话题: panel话题: bias话题: level话题: site
进入Statistics版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: [Data Science Project Case] Bias Correction - second try
发信站: BBS 未名空间站 (Fri Jan 24 18:08:30 2014, 美东)
Hi all,
First thank you all so much for your inputs! They were extremely helpful!
Here is what we are doing as a second try (actually maybe 5th try, but we
only count major overhauls here).
Again, any input is extremely welcome! Thanks!
Situation Brief
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Challenge:
1. Our panel is perceived to be skewed in many ways, such as age, gender,
income, tech and financial expertise, etc., due to the ways we acquire
users and data
2. Our data is incomplete in that other than shopping cart data, only a
small percentage of our panel has third party demographic data
3. We cannot completely trust the third party data, even though we try to
get close to comScore data as a benchmark
4. What is a good metric to measure “closeness”
5. How the other bias, for which we have no data, interact with the bias
in demographics; as well as whether new bias can be introduced when taking
samples with particular information
Technical logic:
1. First we need to decide the level of analysis: individual level, site/
brand level or panel level.
a. Individual level: first cluster users in terms of similarity in search
and click behavior (natural language processing, see SO technical brief),
then label users using their nearest neighbor
b. Site/brand level: direct attempt towards the final product, first join
the inferred or third party individual gender labels with our own page
visit dataset, to obtain site-person-gender triples, then aggregate at the
site level for gender decomposition, and compare with the comScore data to
obtain a multiplier for each site (and later brand or site-brand pairs)
c. Panel level: this approach serves more as a testing, similar to the
site/brand approach, generate site decomposition, but adjust it for bias
using a panel level multiplier (which is the quotient of IBP ratio and panel
ratio – for the available users), then compare with the comScore data
2. Second we need to build a testing method: compare data from different
sources for confidence.
a. Bench mark: we need data we can trust as bench mark (anchor), we chose
comScore, see the panel level approach above for details
b. Error metric: we need a metric to measure performance of inferred or
third party data, we chose the K-S test
3. Third we need presentable results
s******0
发帖数: 1269
2
搬板凳坐等回复。
顺便问一句,你是哪里人,也喜欢凉皮阿
c***z
发帖数: 6348
3
Hunan, my wife is from Shanxi, she likes it :)
1 (共1页)
进入Statistics版参与讨论
相关主题
how to Use PCA to get eignen vector and eigen valueabout PROC KDE
线性回归中的trace从哪里来的?怎么解决biased sample的问题?
请教R里面算逆矩阵的问题请教个概率计算的问题
logistic regression 问题问一个关于skewness的问题
[Data Science Project Case] Bias Correction - third try (转载)approximate CDF problem
need help on bias correctionReject Inference question in Credit Scoring
一个combine scored models的问题topic for PhD dissertation
Machine learning里的variance怎么算?请大家帮忙看一下我这样想对不对。谢谢。
相关话题的讨论汇总
话题: data话题: panel话题: bias话题: level话题: site