how do you deal with sparse data? - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - how do you deal with sparse data?

相关主题
● 想问一个关于评价prediction performance的问题	● classification 问题求教!!
● R-square of logistic regression	● 紧急求助一个LOGISTIC REGRESSION 问题.
● bagging 用于logistic regression because of unbalance data	● [合集] sas中proc logistic和genmond区别是？
● 说一下今天的电面	● proc logistic: how to build 2 X 2 classification table
● Maximum Likelihood estimation	● proc genmod offset
● What's deviation mean in logistic regression	● 请教个问题
● logistic regression 问题	● 新手求教：关于sas proc mianalyze
● goodness-of-fit test for logistic regression 大于.1怎么办？	● How to test the difference between two C statistics （want the P

相关话题的讨论汇总
话题: data话题: logistic话题: good话题: model话题: correct

进入Statistics版参与讨论

(共1页)

l***a
发帖数: 12410

for example, 1000 obs, 50 good and 1950 bad.
if you directly run logistic model on the data, the predictive ability for
good obs is very weak.
besides taking only a small portion of bad to combine with good to build the
model, any other good ideas?

j*****e
发帖数: 182

What do you mean? Why not analyze the whole data?
Do you want to ask how to analyze data with rare event?
Or, are you asking how to collect data when the event rate is low?

l***a
发帖数: 12410

i already have the data and trying to model the good/bad using logistic
regression. if I simply regress on the raw data, the model has a AUC of 0.75
and correct rate of 90% which looks not bad. but when I look at the correct
rate for good and bad the performance will look like
good: 1798 correct and 152 incorrect
bad: 2 correct and 48 incorrect
overall correct rate is 90% but for the bad part it's only 4%. I think that'
s due to the data is sparse. how do I work on it?

【在 j*****e 的大作中提到】

: What do you mean? Why not analyze the whole data?
: Do you want to ask how to analyze data with rare event?
: Or, are you asking how to collect data when the event rate is low?

o****o
发帖数: 8077

OP可能是说如何classify rare events

【在 j*****e 的大作中提到】

: What do you mean? Why not analyze the whole data?
: Do you want to ask how to analyze data with rare event?
: Or, are you asking how to collect data when the event rate is low?

D******n
发帖数: 2836

it is called skewed, unbalanced or imbalanced data. i guess

75
correct
that'

【在 l***a 的大作中提到】

: i already have the data and trying to model the good/bad using logistic
: regression. if I simply regress on the raw data, the model has a AUC of 0.75
: and correct rate of 90% which looks not bad. but when I look at the correct
: rate for good and bad the performance will look like
: good: 1798 correct and 152 incorrect
: bad: 2 correct and 48 incorrect
: overall correct rate is 90% but for the bad part it's only 4%. I think that'
: s due to the data is sparse. how do I work on it?

l***a
发帖数: 12410

probably it's what you understand... i may not express it precisly
http://mitbbs.com/article/Statistics/31211779_3.html
any idea?

【在 o****o 的大作中提到】

: OP可能是说如何classify rare events

l***a
发帖数: 12410

thanks, then how to deal with it in logistic regression modeling?

【在 D******n 的大作中提到】

: it is called skewed, unbalanced or imbalanced data. i guess
:
: 75
: correct
: that'

D******n
发帖数: 2836

http://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with-
logistic-regressions/

【在 l***a 的大作中提到】

: thanks, then how to deal with it in logistic regression modeling?

o****o
发帖数: 8077

in rare events classification, MLE of logistic model is biased
one quick remedy is biased sampling of events, and adjust the actual
probability using offset on the estimated intercept
if it is very rare event problem (say 0.01%), hot debate and no commonly-
agreed methods

【在 l***a 的大作中提到】

: probably it's what you understand... i may not express it precisly
: http://mitbbs.com/article/Statistics/31211779_3.html
: any idea?

l***a
发帖数: 12410

it's around 5% overall, so I should take biased sample from my data.
what I did was take all the 30 good events, and about 50-100 bad events from
the 1950 and build the model. total is 80-130 in this case.
since I just did this randomly and by my intuition, is there any theory
about how to do the biased sampling?
also, can you shed some light on how to "adjust the actual probability using
offset on the estimated intercept"? (I did realize this problem when doing
the previous sampling way because

【在 o****o 的大作中提到】

: in rare events classification, MLE of logistic model is biased
: one quick remedy is biased sampling of events, and adjust the actual
: probability using offset on the estimated intercept
: if it is very rare event problem (say 0.01%), hot debate and no commonly-
: agreed methods

相关主题
● What's deviation mean in logistic regression	● classification 问题求教!!
● logistic regression 问题	● 紧急求助一个LOGISTIC REGRESSION 问题.
● goodness-of-fit test for logistic regression 大于.1怎么办？	● [合集] sas中proc logistic和genmond区别是？
进入Statistics版参与讨论

l***a
发帖数: 12410

have to read it when I get back home. my office computer cannot access this
link :(
btw, is my case called "imbalanced data"? I thought that's about the balance
of treatments, in other words about independent variables. my case is about
the very rare event, which is about the dependent variable. maybe a silly
question

【在 D******n 的大作中提到】

: http://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with-
: logistic-regressions/

D******n
发帖数: 2836

yes, it is called imbalanced data, with very scarce amount of 1s and a lot o
f 0s or vise versa.

this
balance
about

【在 l***a 的大作中提到】

: have to read it when I get back home. my office computer cannot access this
: link :(
: btw, is my case called "imbalanced data"? I thought that's about the balance
: of treatments, in other words about independent variables. my case is about
: the very rare event, which is about the dependent variable. maybe a silly
: question

n*****s
发帖数: 10232

这个post作者介绍用bootstrap方法，缺点之一是这样就没有一个单一的model。看到下
面一个人的回复，我感觉更感兴趣
“I’ve built models with 90/10, 95/5 or worse without resampling with good
success whether using logistic regression, neural networks, or some kinds of
trees. The key is thresholding the posterior probability estimate from the
model at the level of the a priori probability (if you want to compute
classification accuracy or use confusion matrices). ”
我好像没用过这个prior/post probability的方法（是bayesian？）。这里说的
classification accuracy和confusion

【在 D******n 的大作中提到】

: yes, it is called imbalanced data, with very scarce amount of 1s and a lot o
: f 0s or vise versa.
:
: this
: balance
: about

j*****e
发帖数: 182

Suppose your data is a random sample.
Then, the marginal prob of good is 5%.
Given a set of predictor value x, P(good|x) will be low even though you did
observe good at x. If you compare P(good|x) with 0.5(the default cutoff), of
course, the correct rate of predict good is low. This is how it should be.
AUC is not the best to check model fit, neither does correct rate.
Change the cutoff to a lower value ,say 5%, and check the correct rate.
I don't think resampling will make any difference.
Howev

l***a
发帖数: 12410

thanks. just found a paper "logistic regression in rare events data"
http://gking.harvard.edu/files/0s.pdf
still reading it, hopefully this can provide some useful guidelines :)

did
of
to
parameter

【在 j*****e 的大作中提到】

: Suppose your data is a random sample.
: Then, the marginal prob of good is 5%.
: Given a set of predictor value x, P(good|x) will be low even though you did
: observe good at x. If you compare P(good|x) with 0.5(the default cutoff), of
: course, the correct rate of predict good is low. This is how it should be.
: AUC is not the best to check model fit, neither does correct rate.
: Change the cutoff to a lower value ,say 5%, and check the correct rate.
: I don't think resampling will make any difference.
: Howev

s*r
发帖数: 2757

if 90% of your input data are good, you can get a 90% correct rate even if
you predict every instance as good.
a better way to evaluate the utility or predictive power of your model would
be comparing PPV,npv with the overall percentage of good cases
i see the problem is just that you do not have good explanatory variables

75
correct
that'

【在 l***a 的大作中提到】

t********y
发帖数: 469

sparse data
你应该给那些outcome不同的weight
再来run 那个proc logistic
比如说 yes的obs是1个
no的obs是100个
可以把no的outcome先group一下，只取10的obs
再给yes的weight=1
no的obs的weight =10

j*****e
发帖数: 182

To turnanyway,it is totally wrong to do what you said.
Since when statistics throws most of the data, analyzes part of it and hope
everything will be OK?
Please understand how predition is made for binary outcome before you
comment.

t********y
发帖数: 469

我很肯定，你必须给sparse data 加一个weight，才能做proc logistic，不过怎么创建weight，你可以自己选择不同的方法，这个可以很灵活。比如说x的值，可以取group的mean,median之类的。具体的操作，我记不大清楚了，以前看别的组有人这么做过。
如果是proc genmod
你可以先group这些data再来run你的model
model y/n = x /dist=bin link=logit;

hope

【在 j*****e 的大作中提到】

: To turnanyway,it is totally wrong to do what you said.
: Since when statistics throws most of the data, analyzes part of it and hope
: everything will be OK?
: Please understand how predition is made for binary outcome before you
: comment.

j*****e
发帖数: 182

to turnanyway,
in proc genmod/logistic, there is a weight statement. This is to handle
different format of data input.
Also, if the data is a case-control data and you do know the true marginal
probability of the outcome, you can adjust it in proc logistic.
Neither situation is true here.

相关主题
● proc logistic: how to build 2 X 2 classification table	● 新手求教：关于sas proc mianalyze
● proc genmod offset	● How to test the difference between two C statistics （want the P
● 请教个问题	● How to express cut-off value
进入Statistics版参与讨论

t********y
发帖数: 469

sparse data
proc logistic 创建weight 变量，是我公司别的组作modeling的时候用的方法
proc genmod 用grouped data比ungrouped data的deviance更可靠，是学校的老师教授的。
你爱信不信。
你也甭跟我抬杠了，你觉得我说的不对，你就给个正确答案好了。

【在 j*****e 的大作中提到】

: to turnanyway,
: in proc genmod/logistic, there is a weight statement. This is to handle
: different format of data input.
: Also, if the data is a case-control data and you do know the true marginal
: probability of the outcome, you can adjust it in proc logistic.
: Neither situation is true here.

j*****e
发帖数: 182

Turnanyway,
you are so rude and ignorant.
You can use weighting in logistic regression. But this weighting is used to
accomodate non-iid sampling(such as case-control sampling).
Grouping the data will not change the value of deviance. The purpose is to
control the degree of freedom, so that the deviance can be better
approximated by Chi-square distribution for model goodness of fit test.
I have been teaching Categorical data analysis to graduate students for a
couple of years and analysis of cas

t********y
发帖数: 469

明明是你rude,ignorant吧？你披个马甲很了不起。
我再怎么样，也没像你一样充权威，不让别人说话。
我好心免费讲讲自己的经验，到了你这里，一再说我完全错误，完全不可取。
我说的有事实有依据。你不爱看可以不看。我在公司作binary outcome的model拿过大奖，model performance很好。我说的在proc logistic里设weight变量，是我们公司一个大头儿要求我们用的处理方法，是多年实践检验过的方法，那么大金额的项目，如果像你说的完全不可信不可行，那损失根本不能承担。
你当TA很了不起吗？只要是graduate student的基本上都当过TA吧。
jsdagre(na)个人资料
[jsdagre的博客]

身份： [用户]
伪币： 16.00
可用： 15.50
上站次数： [74]
发文数： [108]
经验值： [620](中级站友)
表现值： [16](还不错)
生命力： [730]
信箱

【在 j*****e 的大作中提到】

: Turnanyway,
: you are so rude and ignorant.
: You can use weighting in logistic regression. But this weighting is used to
: accomodate non-iid sampling(such as case-control sampling).
: Grouping the data will not change the value of deviance. The purpose is to
: control the degree of freedom, so that the deviance can be better
: approximated by Chi-square distribution for model goodness of fit test.
: I have been teaching Categorical data analysis to graduate students for a
: couple of years and analysis of cas

t********y
发帖数: 469

jsdagre，我随便察看了一些你的发言记录，我华丽丽被雷翻了！
原来在你眼里，ignorant的版友还不只我一个阿，合着就你专业，就你懂的多，就你配
在这里指点别人。
这下，我明白了，我也不生气了，哈哈~~~`
发信人: jsdagre (na), 信区: Statistics
标题: Re: 请教您这是 paired or independent samples tests?
发信站: BBS 未名空间站 (Mon Mar 16 20:17:45 2009)
To red leaves,
Please read other people's responses carefully before you make your comments
. Just because your method is easy, this doesn't make it right.
I got really irratated when some non-majors take a couple of intro-levelstat
courses, turn around, and start

(共1页)

进入Statistics版参与讨论

相关主题
● How to test the difference between two C statistics （want the P	● Maximum Likelihood estimation
● How to express cut-off value	● What's deviation mean in logistic regression
● 问大牛一个proc glm的问题	● logistic regression 问题
● 问个logistic model的面试问题	● goodness-of-fit test for logistic regression 大于.1怎么办？
● 想问一个关于评价prediction performance的问题	● classification 问题求教!!
● R-square of logistic regression	● 紧急求助一个LOGISTIC REGRESSION 问题.
● bagging 用于logistic regression because of unbalance data	● [合集] sas中proc logistic和genmond区别是？
● 说一下今天的电面	● proc logistic: how to build 2 X 2 classification table

相关话题的讨论汇总
话题: data话题: logistic话题: good话题: model话题: correct

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天