l***a 发帖数: 12410 | 1 for example, 1000 obs, 50 good and 1950 bad.
if you directly run logistic model on the data, the predictive ability for
good obs is very weak.
besides taking only a small portion of bad to combine with good to build the
model, any other good ideas? |
j*****e 发帖数: 182 | 2 What do you mean? Why not analyze the whole data?
Do you want to ask how to analyze data with rare event?
Or, are you asking how to collect data when the event rate is low? |
l***a 发帖数: 12410 | 3 i already have the data and trying to model the good/bad using logistic
regression. if I simply regress on the raw data, the model has a AUC of 0.75
and correct rate of 90% which looks not bad. but when I look at the correct
rate for good and bad the performance will look like
good: 1798 correct and 152 incorrect
bad: 2 correct and 48 incorrect
overall correct rate is 90% but for the bad part it's only 4%. I think that'
s due to the data is sparse. how do I work on it?
【在 j*****e 的大作中提到】 : What do you mean? Why not analyze the whole data? : Do you want to ask how to analyze data with rare event? : Or, are you asking how to collect data when the event rate is low?
|
o****o 发帖数: 8077 | 4 OP可能是说如何classify rare events
【在 j*****e 的大作中提到】 : What do you mean? Why not analyze the whole data? : Do you want to ask how to analyze data with rare event? : Or, are you asking how to collect data when the event rate is low?
|
D******n 发帖数: 2836 | 5 it is called skewed, unbalanced or imbalanced data. i guess
75
correct
that'
【在 l***a 的大作中提到】 : i already have the data and trying to model the good/bad using logistic : regression. if I simply regress on the raw data, the model has a AUC of 0.75 : and correct rate of 90% which looks not bad. but when I look at the correct : rate for good and bad the performance will look like : good: 1798 correct and 152 incorrect : bad: 2 correct and 48 incorrect : overall correct rate is 90% but for the bad part it's only 4%. I think that' : s due to the data is sparse. how do I work on it?
|
l***a 发帖数: 12410 | 6 probably it's what you understand... i may not express it precisly
http://mitbbs.com/article/Statistics/31211779_3.html
any idea?
【在 o****o 的大作中提到】 : OP可能是说如何classify rare events
|
l***a 发帖数: 12410 | 7 thanks, then how to deal with it in logistic regression modeling?
【在 D******n 的大作中提到】 : it is called skewed, unbalanced or imbalanced data. i guess : : 75 : correct : that'
|
D******n 发帖数: 2836 | 8 http://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with-
logistic-regressions/
【在 l***a 的大作中提到】 : thanks, then how to deal with it in logistic regression modeling?
|
o****o 发帖数: 8077 | 9 in rare events classification, MLE of logistic model is biased
one quick remedy is biased sampling of events, and adjust the actual
probability using offset on the estimated intercept
if it is very rare event problem (say 0.01%), hot debate and no commonly-
agreed methods
【在 l***a 的大作中提到】 : probably it's what you understand... i may not express it precisly : http://mitbbs.com/article/Statistics/31211779_3.html : any idea?
|
l***a 发帖数: 12410 | 10 it's around 5% overall, so I should take biased sample from my data.
what I did was take all the 30 good events, and about 50-100 bad events from
the 1950 and build the model. total is 80-130 in this case.
since I just did this randomly and by my intuition, is there any theory
about how to do the biased sampling?
also, can you shed some light on how to "adjust the actual probability using
offset on the estimated intercept"? (I did realize this problem when doing
the previous sampling way because
【在 o****o 的大作中提到】 : in rare events classification, MLE of logistic model is biased : one quick remedy is biased sampling of events, and adjust the actual : probability using offset on the estimated intercept : if it is very rare event problem (say 0.01%), hot debate and no commonly- : agreed methods
|
|
|
l***a 发帖数: 12410 | 11 have to read it when I get back home. my office computer cannot access this
link :(
btw, is my case called "imbalanced data"? I thought that's about the balance
of treatments, in other words about independent variables. my case is about
the very rare event, which is about the dependent variable. maybe a silly
question
【在 D******n 的大作中提到】 : http://zyxo.wordpress.com/2009/03/28/mining-highy-imbalanced-data-sets-with- : logistic-regressions/
|
D******n 发帖数: 2836 | 12 yes, it is called imbalanced data, with very scarce amount of 1s and a lot o
f 0s or vise versa.
this
balance
about
【在 l***a 的大作中提到】 : have to read it when I get back home. my office computer cannot access this : link :( : btw, is my case called "imbalanced data"? I thought that's about the balance : of treatments, in other words about independent variables. my case is about : the very rare event, which is about the dependent variable. maybe a silly : question
|
n*****s 发帖数: 10232 | 13 这个post作者介绍用bootstrap方法,缺点之一是这样就没有一个单一的model。看到下
面一个人的回复,我感觉更感兴趣
“I’ve built models with 90/10, 95/5 or worse without resampling with good
success whether using logistic regression, neural networks, or some kinds of
trees. The key is thresholding the posterior probability estimate from the
model at the level of the a priori probability (if you want to compute
classification accuracy or use confusion matrices). ”
我好像没用过这个prior/post probability的方法(是bayesian?)。这里说的
classification accuracy和confusion
【在 D******n 的大作中提到】 : yes, it is called imbalanced data, with very scarce amount of 1s and a lot o : f 0s or vise versa. : : this : balance : about
|
j*****e 发帖数: 182 | 14 Suppose your data is a random sample.
Then, the marginal prob of good is 5%.
Given a set of predictor value x, P(good|x) will be low even though you did
observe good at x. If you compare P(good|x) with 0.5(the default cutoff), of
course, the correct rate of predict good is low. This is how it should be.
AUC is not the best to check model fit, neither does correct rate.
Change the cutoff to a lower value ,say 5%, and check the correct rate.
I don't think resampling will make any difference.
Howev |
l***a 发帖数: 12410 | 15 thanks. just found a paper "logistic regression in rare events data"
http://gking.harvard.edu/files/0s.pdf
still reading it, hopefully this can provide some useful guidelines :)
did
of
to
parameter
【在 j*****e 的大作中提到】 : Suppose your data is a random sample. : Then, the marginal prob of good is 5%. : Given a set of predictor value x, P(good|x) will be low even though you did : observe good at x. If you compare P(good|x) with 0.5(the default cutoff), of : course, the correct rate of predict good is low. This is how it should be. : AUC is not the best to check model fit, neither does correct rate. : Change the cutoff to a lower value ,say 5%, and check the correct rate. : I don't think resampling will make any difference. : Howev
|
s*r 发帖数: 2757 | 16 if 90% of your input data are good, you can get a 90% correct rate even if
you predict every instance as good.
a better way to evaluate the utility or predictive power of your model would
be comparing PPV,npv with the overall percentage of good cases
i see the problem is just that you do not have good explanatory variables
75
correct
that'
【在 l***a 的大作中提到】 : i already have the data and trying to model the good/bad using logistic : regression. if I simply regress on the raw data, the model has a AUC of 0.75 : and correct rate of 90% which looks not bad. but when I look at the correct : rate for good and bad the performance will look like : good: 1798 correct and 152 incorrect : bad: 2 correct and 48 incorrect : overall correct rate is 90% but for the bad part it's only 4%. I think that' : s due to the data is sparse. how do I work on it?
|
t********y 发帖数: 469 | 17 sparse data
你应该给那些outcome不同的weight
再来run 那个proc logistic
比如说 yes的obs是1个
no的obs是100个
可以把no的outcome先group一下,只取10的obs
再给yes的weight=1
no的obs的weight =10 |
j*****e 发帖数: 182 | 18 To turnanyway,it is totally wrong to do what you said.
Since when statistics throws most of the data, analyzes part of it and hope
everything will be OK?
Please understand how predition is made for binary outcome before you
comment. |
t********y 发帖数: 469 | 19 我很肯定,你必须给sparse data 加一个weight,才能做proc logistic,不过怎么创建weight,你可以自己选择不同的方法,这个可以很灵活。比如说x的值,可以取group的mean,median之类的。具体的操作,我记不大清楚了,以前看别的组有人这么做过。
如果是proc genmod
你可以先group这些data再来run你的model
model y/n = x /dist=bin link=logit;
hope
【在 j*****e 的大作中提到】 : To turnanyway,it is totally wrong to do what you said. : Since when statistics throws most of the data, analyzes part of it and hope : everything will be OK? : Please understand how predition is made for binary outcome before you : comment.
|
j*****e 发帖数: 182 | 20 to turnanyway,
in proc genmod/logistic, there is a weight statement. This is to handle
different format of data input.
Also, if the data is a case-control data and you do know the true marginal
probability of the outcome, you can adjust it in proc logistic.
Neither situation is true here. |
|
|
t********y 发帖数: 469 | 21 sparse data
proc logistic 创建weight 变量,是我公司别的组作modeling的时候用的方法
proc genmod 用grouped data比ungrouped data的deviance更可靠,是学校的老师教授的。
你爱信不信。
你也甭跟我抬杠了,你觉得我说的不对,你就给个正确答案好了。
【在 j*****e 的大作中提到】 : to turnanyway, : in proc genmod/logistic, there is a weight statement. This is to handle : different format of data input. : Also, if the data is a case-control data and you do know the true marginal : probability of the outcome, you can adjust it in proc logistic. : Neither situation is true here.
|
j*****e 发帖数: 182 | 22 Turnanyway,
you are so rude and ignorant.
You can use weighting in logistic regression. But this weighting is used to
accomodate non-iid sampling(such as case-control sampling).
Grouping the data will not change the value of deviance. The purpose is to
control the degree of freedom, so that the deviance can be better
approximated by Chi-square distribution for model goodness of fit test.
I have been teaching Categorical data analysis to graduate students for a
couple of years and analysis of cas |
t********y 发帖数: 469 | 23 明明是你rude,ignorant吧?你披个马甲很了不起。
我再怎么样,也没像你一样充权威,不让别人说话。
我好心免费讲讲自己的经验,到了你这里,一再说我完全错误,完全不可取。
我说的有事实有依据。你不爱看可以不看。我在公司作binary outcome的model拿过大奖,model performance很好。我说的在proc logistic里设weight变量,是我们公司一个大头儿要求我们用的处理方法,是多年实践检验过的方法,那么大金额的项目,如果像你说的完全不可信不可行,那损失根本不能承担。
你当TA很了不起吗?只要是graduate student的基本上都当过TA吧。
jsdagre(na)个人资料
[jsdagre的博客]
身份: [用户]
伪币: 16.00
可用: 15.50
上站次数: [74]
发文数: [108]
经验值: [620](中级站友)
表现值: [16](还不错)
生命力: [730]
信箱
【在 j*****e 的大作中提到】 : Turnanyway, : you are so rude and ignorant. : You can use weighting in logistic regression. But this weighting is used to : accomodate non-iid sampling(such as case-control sampling). : Grouping the data will not change the value of deviance. The purpose is to : control the degree of freedom, so that the deviance can be better : approximated by Chi-square distribution for model goodness of fit test. : I have been teaching Categorical data analysis to graduate students for a : couple of years and analysis of cas
|
t********y 发帖数: 469 | 24 jsdagre,我随便察看了一些你的发言记录,我华丽丽被雷翻了!
原来在你眼里,ignorant的版友还不只我一个阿,合着就你专业,就你懂的多,就你配
在这里指点别人。
这下,我明白了,我也不生气了,哈哈~~~`
发信人: jsdagre (na), 信区: Statistics
标 题: Re: 请教您这是 paired or independent samples tests?
发信站: BBS 未名空间站 (Mon Mar 16 20:17:45 2009)
To red leaves,
Please read other people's responses carefully before you make your comments
. Just because your method is easy, this doesn't make it right.
I got really irratated when some non-majors take a couple of intro-levelstat
courses, turn around, and start |