c***z 发帖数: 6348 | 1 【 以下文字转载自 DataSciences 讨论区 】
发信人: chaoz (面朝大海,吃碗凉皮), 信区: DataSciences
标 题: [Data Science Project Case] Bias Correction - third try
发信站: BBS 未名空间站 (Tue Feb 11 18:26:40 2014, 美东)
Dear all, thank you so much for your earlier inputs! Now I am able to put my
thoughts together and understand the project better.
Let me write down the thing again. Any comments are extremely welcome!
Project name: Bias correction
Business objective: We have a panel of 25M users’ shopping cart information
, we want to infer national online sales by brand and channel. We do so by
finding and applying multipliers to each shopping cart item, based on our
panel size and selection bias towards particular population (e.g. if our
panel is more skewed towards low income people than the IBP, then their
shopping records should and have a smaller multiplier than those of high
income people).
Technical logic: We have a biased sample of the IBP, among which only a
subset have third party demographic labels. Hence there are three
subproblems:
1. Bias correction (from panel to IBP): this is a special kind of missing
data problem, where the population stats are known. We compute and assign
weights to each subgroups (defined by demographics and brand/site). The
method here is Rim weighting. Another classical method is regression. The
weights can be obtained from and applied to three levels: guids, sites and
panel; hence overall there are nine ways. We are particularly interested in:
a. From panel, to guids (current approach);
b. From panel, to sites;
c. From sites, to guids;
d. From sites, to sites;
e. From sites, to panel.
2. Missing data (inside panel): we are missing a majority of the
demographic data of our panel, and the panel stats are unknown. This is the
typical missing data problem. There are several ways:
a. Drop the incomplete records;
b. Use the mean/median or other sensible stat from the known data;
c. Reconstruct the sample using bootstrapping, to fit the IBP stats;
d. Infer the missing data with supervised learning (e.g. decision trees);
e. Infer the missing data with unsupervised learning (e.g. clustering);
f. Rim weighting also helps with missing data, with some assumptions.
3. Data quality (subset of panel): we use Exelate/Latome demographic data
as seed for above tasks, however we cannot completely trust the third party
data. We have designed several ways to test for quality, using the K-S stat
and ROC as error metrics:
a. Use the subset of data where E and L agree;
b. Use independent data to compare with E and L (e.g. the naïve
Bayes one);
c. Aggregate from guid level to site level and compare with comScore.
I am currently focused on RIM weights, a simplified Propensity score
matching method, even though I have some reservations due to the assumptions
we make with RIM weights.
What would you think? Thanks a lot! |
|