由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - [Data Science Project Case] Bias Correction - third try (转载)
相关主题
[Data Science Project Case] Bias Correction - second try (转载)Estimated Coefficents calculation
新手请教一个weighting analysis的问题有关EM algorithm 的问题
有人知道怎么用统计方法处理selection bias吗Reject Inference question in Credit Scoring
分析有selection bias 的数据请教一道sas 题
need help on bias correction请教一下proc ttest, weighted data
入门级retrospective trial的统计问题跪求SAS大牛们一个简单问题
propensity score 1:2 matching怎们比较outcome?Under-sampling vs Weighting
怎么解决biased sample的问题?weighted correlation的问题
相关话题的讨论汇总
话题: data话题: panel话题: sites话题: rim话题: ibp
进入Statistics版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: [Data Science Project Case] Bias Correction - third try
发信站: BBS 未名空间站 (Tue Feb 11 18:26:40 2014, 美东)
Dear all, thank you so much for your earlier inputs! Now I am able to put my
thoughts together and understand the project better.
Let me write down the thing again. Any comments are extremely welcome!
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).

Technical logic: We have a biased sample of the IBP, among which only a
subset have third party demographic labels. Hence there are three
subproblems:
1. Bias correction (from panel to IBP): this is a special kind of missing
data problem, where the population stats are known. We compute and assign
weights to each subgroups (defined by demographics and brand/site). The
method here is Rim weighting. Another classical method is regression. The
weights can be obtained from and applied to three levels: guids, sites and
panel; hence overall there are nine ways. We are particularly interested in:
a. From panel, to guids (current approach);
b. From panel, to sites;
c. From sites, to guids;
d. From sites, to sites;
e. From sites, to panel.
2. Missing data (inside panel): we are missing a majority of the
demographic data of our panel, and the panel stats are unknown. This is the
typical missing data problem. There are several ways:
a. Drop the incomplete records;
b. Use the mean/median or other sensible stat from the known data;
c. Reconstruct the sample using bootstrapping, to fit the IBP stats;
d. Infer the missing data with supervised learning (e.g. decision trees);
e. Infer the missing data with unsupervised learning (e.g. clustering);
f. Rim weighting also helps with missing data, with some assumptions.
3. Data quality (subset of panel): we use Exelate/Latome demographic data
as seed for above tasks, however we cannot completely trust the third party
data. We have designed several ways to test for quality, using the K-S stat
and ROC as error metrics:
a. Use the subset of data where E and L agree;
b. Use independent data to compare with E and L (e.g. the naïve
Bayes one);
c. Aggregate from guid level to site level and compare with comScore.

I am currently focused on RIM weights, a simplified Propensity score
matching method, even though I have some reservations due to the assumptions
we make with RIM weights.
What would you think? Thanks a lot!
1 (共1页)
进入Statistics版参与讨论
相关主题
weighted correlation的问题need help on bias correction
weighted sum of independent bernoulli入门级retrospective trial的统计问题
Weighted logistic Regressionpropensity score 1:2 matching怎们比较outcome?
Propensity score or Instrumental Variable?怎么解决biased sample的问题?
[Data Science Project Case] Bias Correction - second try (转载)Estimated Coefficents calculation
新手请教一个weighting analysis的问题有关EM algorithm 的问题
有人知道怎么用统计方法处理selection bias吗Reject Inference question in Credit Scoring
分析有selection bias 的数据请教一道sas 题
相关话题的讨论汇总
话题: data话题: panel话题: sites话题: rim话题: ibp