How to transform predictor variable? - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - How to transform predictor variable?

相关主题
● a question on transformation	● 问个关于credit score model的问题
● Help! A data step problem	● 问一个求2个变量correlation的问题
● 请教logistic regression	● 问一下这个双变量变换的问题谢谢
● 请教一个简单的统计问题-pearson correlation	● 一个统计问题
● R classification tree model 请教	● [合集] help - sas proc sql with do loop
● support vector machine	● 求助SAS CODE：如何同时对90个variables进行log transformation?
● 请问一个Regression的问题	● 请问 SAS转到 R 的问题
● model和variables都sig.但每个category都不sig	● 如何处理很多的zero value？

相关话题的讨论汇总
话题: variable话题: fico话题: predictor话题: transform

进入Statistics版参与讨论

(共1页)

r********e
发帖数: 33

I was asked - How do you transform your predictor variables?
I said that - if there is long tail to right side, take log form to
make the variable distribution more normal.
Then I was asked - what about the predictor variable with both positive
/ negative values? How do you transform it?
I choked...

D******n
发帖数: 2836

u dont transform it in the first place.

【在 r********e 的大作中提到】

: I was asked - How do you transform your predictor variables?
: I said that - if there is long tail to right side, take log form to
: make the variable distribution more normal.
: Then I was asked - what about the predictor variable with both positive
: / negative values? How do you transform it?
: I choked...

F****r
发帖数: 151

Both negative and positive, you can do the location transformation first to
make all positive or negative....however, why the predictor variable
transformation was needed at the first place?

q******d
发帖数: 158

for my experience, i would like to know the business means of the predictor
variables first.
then take log, sqr, etc. depending on the predictor variables.

【在 r********e 的大作中提到】

A*******s
发帖数: 3942

i am having the same question too. can any NIUREN recommend some reviews
about predictors' transformation?

a**x
发帖数: 215

from a pure data perspective
u can take log of absolute value
then put sign back
but of course, knowing data is always crucial before u doing any
transformation

【在 r********e 的大作中提到】

p******r
发帖数: 1279

Just shift all the value by adding a constant, and then take log.

s********p
发帖数: 637

For /positive/negative values, you can transform the variable into binned variable, which is very useful in capturing non-linear relationships between inputs and target. Something like interval->ordinal.
Log is not always the optimal. Sometimes sqrt/sqr/cub et al are better.
You can list 20/30 transformation formulas often used, put transformed
variable into a regression along with the target and find the best
transformation with the largest chi-squares. It could be done by a macro
easily.

【在 r********e 的大作中提到】

l***a
发帖数: 12410

decide the best transformation by chi-square or r-square?
I normally use the largest r-square. also regression plot can visually help
on deciding the linear relationship between iv and dv

variable, which is very useful in capturing non-linear relationships between
inputs and target. Something like interval->ordinal.

【在 s********p 的大作中提到】

: For /positive/negative values, you can transform the variable into binned variable, which is very useful in capturing non-linear relationships between inputs and target. Something like interval->ordinal.
: Log is not always the optimal. Sometimes sqrt/sqr/cub et al are better.
: You can list 20/30 transformation formulas often used, put transformed
: variable into a regression along with the target and find the best
: transformation with the largest chi-squares. It could be done by a macro
: easily.

y*****n
发帖数: 5016

1．在业界里做模型，变量的选定要考虑business sense, 即每个变量都必须有比
较好解释的business上的含义。所以，不是每个变量都可以随意transform的。比如说
，也许你发现在regression 里用FICO的log form 比FICO本身significant, 但是你能
用FICO的log form作为模型的一个变量吗？ log(FICO)在business上是啥意思？所以，
不用浪费时间，你老板是不会认同的。
2．也许你在想理论上怎么样怎么样，但是业界里做模型的实际情况是：如果一个
变量自身在regression 里不significant的话，那么怎么transform都是不会变成非常
significant的。
3．如果你一开始有好几百个变量，如果每个都去试各种不同的transformation的
话，是根本不可行的，不管是对你还是对机器来讲。
4．所以，我的做法是：首先都用变量original form (有eminer的同学可以用
interactive grouping 变成 woe)做variable selection. 等到变量数目减少到20个以
下时，再考虑看看有没有对其中一些变量做transformation的必要。还是那句话，依我
的经验，transform前后significancy 的变化不会是翻天覆地的。
5．有过一些老板会偶尔问起我有没有试过这个或那个变量的transformation. 我
都会很自信地告诉他原变量的是多么不significant以至于没有必要浪费时间去try它的
transformation.

相关主题
● support vector machine	● 问个关于credit score model的问题
● 请问一个Regression的问题	● 问一个求2个变量correlation的问题
● model和variables都sig.但每个category都不sig	● 问一下这个双变量变换的问题谢谢
进入Statistics版参与讨论

k*****u
发帖数: 1688

不是box-cox变换么？
依稀记得方差分析的时候做这个变换的

s********p
发帖数: 637

I use WaldChiSq, but I think r-squrare is also fine.
The former is more associated with the p-value intuitively. Plus, if
variable is itself very significantly, WaldChiSq has more scale to reflect
the changes before and after the transformation.
Visual check is very intuitive and effective but not efficient. My macro can
find optimal transformations for 50~100 variable in about 15 mins which is
very helpful in early screening stage.

help
between

【在 l***a 的大作中提到】

: decide the best transformation by chi-square or r-square?
: I normally use the largest r-square. also regression plot can visually help
: on deciding the linear relationship between iv and dv
:
: variable, which is very useful in capturing non-linear relationships between
: inputs and target. Something like interval->ordinal.

s********p
发帖数: 637

Very informative! However, I doubt 2)
If the relation between iv and dv is non-linear, sometimes it will help

【在 y*****n 的大作中提到】

: 1．在业界里做模型，变量的选定要考虑business sense, 即每个变量都必须有比
: 较好解释的business上的含义。所以，不是每个变量都可以随意transform的。比如说
: ，也许你发现在regression 里用FICO的log form 比FICO本身significant, 但是你能
: 用FICO的log form作为模型的一个变量吗？ log(FICO)在business上是啥意思？所以，
: 不用浪费时间，你老板是不会认同的。
: 2．也许你在想理论上怎么样怎么样，但是业界里做模型的实际情况是：如果一个
: 变量自身在regression 里不significant的话，那么怎么transform都是不会变成非常
: significant的。
: 3．如果你一开始有好几百个变量，如果每个都去试各种不同的transformation的
: 话，是根本不可行的，不管是对你还是对机器来讲。

l***a
发帖数: 12410

makes sense :)
the reason I don't make a fully automatic macro to do this job is because I
want to factor in some subjective opinion in the decision process. e.g., if
the transformation does have some improvement while not really big (like
improve from 0.85 to 0.87), sometimes I will just use the original var to
make things simple. it depends on the nature of the var and the complexity
of the transformation - kind of a trade off.
so the whole process needs to be reviewed manually. as a result, inserting
those plots won't add too much work given that I will browse all the
transformation anyways :)

can
is

【在 s********p 的大作中提到】

: I use WaldChiSq, but I think r-squrare is also fine.
: The former is more associated with the p-value intuitively. Plus, if
: variable is itself very significantly, WaldChiSq has more scale to reflect
: the changes before and after the transformation.
: Visual check is very intuitive and effective but not efficient. My macro can
: find optimal transformations for 50~100 variable in about 15 mins which is
: very helpful in early screening stage.
:
: help
: between

l***a
发帖数: 12410

in most of my cases, I don't think a transformation of a var will make more
business sense than the original var itself. -__-//... correct me if I am
wrong here
about woe, I don't have the software that you mentioned, so what I do is to
use some help from the decision tree software (spss answer tree in my case)
to find the categories for the var first, then calculate the woe scores for
each category. pretty cumbersome and time consuming but it works and I only
do it after reducing var numbers.

【在 y*****n 的大作中提到】

y*****n
发帖数: 5016

of course it help. but will not bring a variable from very insignificant to
very significant.
or, try another way to say it: let's say a variable after a non-linear
transformation become very significant in a regression, then its original
form most likely is not too bad in the regression either.
you can try it on some big dataset and tell me if i am wrong.

【在 s********p 的大作中提到】

: Very informative! However, I doubt 2)
: If the relation between iv and dv is non-linear, sometimes it will help

s********p
发帖数: 637

我的理解是，transformation只和data有关，至于解释，则和变量有关。比如revenue
这个变量，取log后，model就是好很多。从商业上jiang，就认为revenue高，买的可能
性高，至于原始的形式还是log形式，那是尺度的问题，当然还有些nonlinear的东西在
里面
correct me if I am wrong here

more
to
)
for
only

【在 l***a 的大作中提到】

: in most of my cases, I don't think a transformation of a var will make more
: business sense than the original var itself. -__-//... correct me if I am
: wrong here
: about woe, I don't have the software that you mentioned, so what I do is to
: use some help from the decision tree software (spss answer tree in my case)
: to find the categories for the var first, then calculate the woe scores for
: each category. pretty cumbersome and time consuming but it works and I only
: do it after reducing var numbers.

s********p
发帖数: 637

基本上你是对的，只不过觉得你前面的话稍微决对了些。极端情况下，如果iv and dv
是非常强烈的非线性关系，transformation是有用的。只不过我们经常忽略，或数据本
身可以找到其他可替代的相关的变量

to

【在 y*****n 的大作中提到】

: of course it help. but will not bring a variable from very insignificant to
: very significant.
: or, try another way to say it: let's say a variable after a non-linear
: transformation become very significant in a regression, then its original
: form most likely is not too bad in the regression either.
: you can try it on some big dataset and tell me if i am wrong.

A*******s
发帖数: 3942

theoretically pure nonlinearity cannot be captured by linear model, say y=
sin(x). however in reality, 1) such case rarely happens; 2) if there are
many variables, it is not necessary to introduce nonlinearity.

dv

【在 s********p 的大作中提到】

: 基本上你是对的，只不过觉得你前面的话稍微决对了些。极端情况下，如果iv and dv
: 是非常强烈的非线性关系，transformation是有用的。只不过我们经常忽略，或数据本
: 身可以找到其他可替代的相关的变量
:
: to

s********p
发帖数: 637

yes, you are right, that is why I mentioned "数据本身可以找到其他可替代的相
关的变量", basically accords with 2) you said
Also, even "heoretically pure nonlinearity cannot be captured by linear model", if we know the exact nonlinearity (basically impossible) or we have a large set of function to fit, transformation still very helpful.

【在 A*******s 的大作中提到】

: theoretically pure nonlinearity cannot be captured by linear model, say y=
: sin(x). however in reality, 1) such case rarely happens; 2) if there are
: many variables, it is not necessary to introduce nonlinearity.
:
: dv

相关主题
● 一个统计问题	● 请问 SAS转到 R 的问题
● [合集] help - sas proc sql with do loop	● 如何处理很多的zero value？
● 求助SAS CODE：如何同时对90个variables进行log transformation?	● Clustering analysis with categorical variables
进入Statistics版参与讨论

y*****n
发帖数: 5016

嗬嗬，我知道我上面那些话跟教授讲会被打pp的。但是在公司里跟老板说话有时候就是
需要绝对一些，尤其是对不是很内行的老板，更要经常以自己的hand-on experience来
assure他，如果他感觉你的意见不够自信和果断，就会不断地让你试这试那，做很多无
用功。
说起non-linear transformation, 我倒是有一个例子，比如说年龄（from 20-100）对
risk的影响（caution: it is illegal to use age for modeling, in some cases ）
，很显然根号（年龄）应该比年龄本身显著。但是，有时候年龄在头几轮的variable
prescreening and selection中就被淘汰掉了。所以也就不用费力去试根号（年龄）了。

dv

【在 s********p 的大作中提到】

s********p
发帖数: 637

我赞同你的话啊，只是我话不说那么绝对而已,hehe
有时候模型的变量不是筛出来的，是marketing那帮人订出来的，他们觉得na个变量必
须要进去的，这个时候只好左try右try了。

了。

【在 y*****n 的大作中提到】

: 嗬嗬，我知道我上面那些话跟教授讲会被打pp的。但是在公司里跟老板说话有时候就是
: 需要绝对一些，尤其是对不是很内行的老板，更要经常以自己的hand-on experience来
: assure他，如果他感觉你的意见不够自信和果断，就会不断地让你试这试那，做很多无
: 用功。
: 说起non-linear transformation, 我倒是有一个例子，比如说年龄（from 20-100）对
: risk的影响（caution: it is illegal to use age for modeling, in some cases ）
: ，很显然根号（年龄）应该比年龄本身显著。但是，有时候年龄在头几轮的variable
: prescreening and selection中就被淘汰掉了。所以也就不用费力去试根号（年龄）了。
:
: dv

A*******s
发帖数: 3942

there are some other reasons i think, like danger with extrapolation and
some properties of high dimensions.

【在 s********p 的大作中提到】

: yes, you are right, that is why I mentioned "数据本身可以找到其他可替代的相
: 关的变量", basically accords with 2) you said
: Also, even "heoretically pure nonlinearity cannot be captured by linear model", if we know the exact nonlinearity (basically impossible) or we have a large set of function to fit, transformation still very helpful.

l***a
发帖数: 12410

i agree

revenue

【在 s********p 的大作中提到】

: 我的理解是，transformation只和data有关，至于解释，则和变量有关。比如revenue
: 这个变量，取log后，model就是好很多。从商业上jiang，就认为revenue高，买的可能
: 性高，至于原始的形式还是log形式，那是尺度的问题，当然还有些nonlinear的东西在
: 里面
: correct me if I am wrong here
:
: more
: to
: )
: for

s********p
发帖数: 637

So we have to do model validation

【在 A*******s 的大作中提到】

: there are some other reasons i think, like danger with extrapolation and
: some properties of high dimensions.

n******r
发帖数: 1247

If what you said is true, your company's stat team sucks

【在 y*****n 的大作中提到】

n******r
发帖数: 1247

y=x^2+e

to

【在 y*****n 的大作中提到】

a****m
发帖数: 693

给你们一个简单的理解，
就是说你在做linear regression时， it assume that the variable or residue has
to be normally distributed if you want to implement linear regression model
.
Box-Cox is an ideal way to do that. especially you can select the
transformation parameter according to some criteria.

s********p
发帖数: 637

textbooked Ph.d

has
model

【在 a****m 的大作中提到】

: 给你们一个简单的理解，
: 就是说你在做linear regression时， it assume that the variable or residue has
: to be normally distributed if you want to implement linear regression model
: .
: Box-Cox is an ideal way to do that. especially you can select the
: transformation parameter according to some criteria.

y*****n
发帖数: 5016

really? tell me how your superior stat team do things then. I am humble
enough to learn.

【在 n******r 的大作中提到】

: If what you said is true, your company's stat team sucks

相关主题
● 求教一个题目	● Help! A data step problem
● 问一个linear regression 的弱问题。	● 请教logistic regression
● a question on transformation	● 请教一个简单的统计问题-pearson correlation
进入Statistics版参与讨论

p******y
发帖数: 2252

占坑学习ing
大家别吵架啊！！！！

b*****n
发帖数: 685

同意前半句，把它变成正的就完了，也就是变一下intercept而已，然后再log之。

to

【在 F****r 的大作中提到】

: Both negative and positive, you can do the location transformation first to
: make all positive or negative....however, why the predictor variable
: transformation was needed at the first place?

y*****n
发帖数: 5016

嗬嗬，我不吵架。我只是想知道这个星云同学是不是已经在金统行业工作了。如果是，
我确实想知道他们是怎么做的，兴许确实有公司做法不同，届时我老跳槽的时候也有个
准备。如果他还在找工作，只是凭想象说话，就一笑置之好了。

【在 p******y 的大作中提到】

: 占坑学习ing
: 大家别吵架啊！！！！

y*****n
发帖数: 5016

Hehe, you were giving me a typical textbook answer.
Can you find some real and large data, run some multivariate logistic
regression, and show me some evidence that the p value for x is >0.1 but the
p value for x^2 is <0.001? (I always got plenty of variables with p value <
0.001 before the final variable selection.)
If not, then can I safely ignore all variables with p value >0.1 and do not
bother to try x^2?

【在 n******r 的大作中提到】

: y=x^2+e
:
: to

m******s
发帖数: 98

请问你说的significant怎么定义?
是在regression之后的模型结果还是regression之前通过看iv和dv之间的correlation
吗?

【在 y*****n 的大作中提到】

n******r
发帖数: 1247

FICO is in no way more explainable than ln(FICO)
Do you know how FICO is calculated and scaled? How do you know FICO is in a
linear relationship with your target variable?
Fair Issac can one day do a sqrt transformation of FICO and still give it
out as FICO. You still use it the same way or all of a sudden FICO^2 makes
more sense to your boss?
if FICO is not in a good linear relationship with the target, your
prediction can be very off when you look at the performance by different
FICO bands, i.e. 580-620, 621-660,661-720,721-850 even though FICO is
significant for the model with a very very small pvale. A proper
transformation to make FICO linear with the target variable will fix this
problem.
Looking at the model performance for differet FICO bands makes more business
sense, as you do expect different behaviors for prime and subprime
populations and a good prediction model should be robust across different
dimensions.
Only looking at p just shows you don't know statistics.
BTW:I just happend to know today which company you are working for and it
confirmed what I said yesterday that your company's stat team sucks.
Actually, your company doesn't even have a stat team and again you truely
don't know statistics. If you are humble enough as you said, please take my
advice to look at model performance by different predictor bands and apply
proper transformation when necessary as this will help your daily job in
collection.
Sorry if I sound offensive but what you said in your original post was BS
and your company doesn't represent the industry.

the
<
not

【在 y*****n 的大作中提到】

: Hehe, you were giving me a typical textbook answer.
: Can you find some real and large data, run some multivariate logistic
: regression, and show me some evidence that the p value for x is >0.1 but the
: p value for x^2 is <0.001? (I always got plenty of variables with p value <
: 0.001 before the final variable selection.)
: If not, then can I safely ignore all variables with p value >0.1 and do not
: bother to try x^2?

A*******s
发帖数: 3942

chill out man. both of u are Big Bulls and make sense.
I think yuxinyun believes his model is good enough. Introducing nonlinear
terms may help to increase AUC from .7138 to just .7142 for example. In such
case, it's not worthy to rock the boat if the boss does not 100% get ur
back.
My company has a very picky model review group. If u just tell them to look
at the P-values, they will suggest ur boss to fire u. :( But if u try some
new tricks to improve the model, they will ask ur boss to do more, so ur
boss still blame u. :) hard to get the balance... haha

a

【在 n******r 的大作中提到】

: FICO is in no way more explainable than ln(FICO)
: Do you know how FICO is calculated and scaled? How do you know FICO is in a
: linear relationship with your target variable?
: Fair Issac can one day do a sqrt transformation of FICO and still give it
: out as FICO. You still use it the same way or all of a sudden FICO^2 makes
: more sense to your boss?
: if FICO is not in a good linear relationship with the target, your
: prediction can be very off when you look at the performance by different
: FICO bands, i.e. 580-620, 621-660,661-720,721-850 even though FICO is
: significant for the model with a very very small pvale. A proper

y*****n
发帖数: 5016

Did I tell you that I didn’t use FICO bands? You are not a good listener in
the first place. And always misinterpret what people have said. How can a
people with such an attitude be successful in a company?
Communicating with you is just a waste of time.
Keep your head high and good luck with your career development.

a

【在 n******r 的大作中提到】

y*****n
发帖数: 5016

if you have eminer, then you can use the "interactive grouping node". it can
not only bin each variable into woe, but also calculate the information
value and Gini for each variable. you can prescreen variables there. some
may argue that variables with low IV may still be picked up in the stepwise
regression. However, your bosses may want to see the "stand alone"
relationship between each model attribute and the target. therefore, you
want to make sure that each candidate variables has high IV before the step
wise regression.(of course it is your call to decide the threshold). However
, if you don't have eminer, then yes, you may want to check for iv-dv
correlation AS WELL AS the distribution of the iv itself (i think these two
combined is somehow like a imperfect proxy of IV). and somehow you have to
manually code for binning.
then in the regression (i tend to use eguide rather than eminer because it
allows me intervene more for all kinds of business concerns ), you may use
stepwise with a criticize value (0.05, or 0.01, or...all up to you). and
meanwhile you may want to check for mulicolinearity by using proc reg from
time to time. after all of these and if you still end up with a model with
more variable than you prefer. then you will manually drop some variable,
you may drop those with lowest Wald Chi-square and/or those will bring you
troubles (like business/management concerns, implementation ability concerns
, etc).
anyway, we are serving the business rather than doing the academic study. if
your model cannot be accepted by business folks. your effort is in vain.

correlation

【在 m******s 的大作中提到】

: 请问你说的significant怎么定义?
: 是在regression之后的模型结果还是regression之前通过看iv和dv之间的correlation
: 吗?

s********p
发帖数: 637

I can not find "interactive grouping node" in EM, but only "interactive
binning node". Is that the node that you mentioned? thanks!

can
stepwise
step
However
two

【在 y*****n 的大作中提到】

: if you have eminer, then you can use the "interactive grouping node". it can
: not only bin each variable into woe, but also calculate the information
: value and Gini for each variable. you can prescreen variables there. some
: may argue that variables with low IV may still be picked up in the stepwise
: regression. However, your bosses may want to see the "stand alone"
: relationship between each model attribute and the target. therefore, you
: want to make sure that each candidate variables has high IV before the step
: wise regression.(of course it is your call to decide the threshold). However
: , if you don't have eminer, then yes, you may want to check for iv-dv
: correlation AS WELL AS the distribution of the iv itself (i think these two

相关主题
● 请教一个简单的统计问题-pearson correlation	● 请问一个Regression的问题
● R classification tree model 请教	● model和variables都sig.但每个category都不sig
● support vector machine	● 问个关于credit score model的问题
进入Statistics版参与讨论

A*******s
发帖数: 3942

make some terms clear: IV here means information value, not independent
variable.

can
stepwise
step
However
two

【在 y*****n 的大作中提到】

y*****n
发帖数: 5016

"interactive grouping node" is in the "credit scoring" module, maybe your
version doesn't have that module. however, "interactive binning node" is
similar, except that it won't calculate the woe (as far as i remember).

【在 s********p 的大作中提到】

: I can not find "interactive grouping node" in EM, but only "interactive
: binning node". Is that the node that you mentioned? thanks!
:
: can
: stepwise
: step
: However
: two

y*****n
发帖数: 5016

thank you. ^_^

【在 A*******s 的大作中提到】

: make some terms clear: IV here means information value, not independent
: variable.
:
: can
: stepwise
: step
: However
: two

n******r
发帖数: 1247

Dude, your company doesn't even have a stat team and you started your post
with "在业界里做模型，..." "也许你在想理论上怎么样怎么样，但是业界里做模型的
实际情况是：" and followed by BS? Since when is your company representing "
业界"? You'd better stay in your company where your boss won't approve the
use of things like log(FICO) as it is probably the only place that you can
be successful.

in

【在 y*****n 的大作中提到】

: Did I tell you that I didn’t use FICO bands? You are not a good listener in
: the first place. And always misinterpret what people have said. How can a
: people with such an attitude be successful in a company?
: Communicating with you is just a waste of time.
: Keep your head high and good luck with your career development.
:
: a

n******r
发帖数: 1247

Thanks man. I do need to chill out. If he hadn't had those "业界" words, I
wouldn't bother to argue with him. What's he said about idea of
transformation was very wrong and very misleading when he started it with "
业界". That's what pissed me off.
I have no problem with a boss saying what's the business meaning of this 3
way interaction or 2 way interaction. But the fact that a boss would
disapprove a monotone transformation of a predictor in the name of business
sense just shows how miserable his team is.

such
look

【在 A*******s 的大作中提到】

: chill out man. both of u are Big Bulls and make sense.
: I think yuxinyun believes his model is good enough. Introducing nonlinear
: terms may help to increase AUC from .7138 to just .7142 for example. In such
: case, it's not worthy to rock the boat if the boss does not 100% get ur
: back.
: My company has a very picky model review group. If u just tell them to look
: at the P-values, they will suggest ur boss to fire u. :( But if u try some
: new tricks to improve the model, they will ask ur boss to do more, so ur
: boss still blame u. :) hard to get the balance... haha
:

y*****n
发帖数: 5016

我真的觉得你很可怜。看起来你的自信心就只是建立在你呆在哪个公司里（哪怕是在端
茶倒水）。你以为别人没有能力去？不管你现在哪个公司里呆着，我原来公司的stat
team都不会比你现在这个小（we once have a team of 20 statisticians across the
company to work on just a single project. How about yours？）。而且当年我跳
槽的时候拿到的你崇拜的那些公司的offer多去了，包括你们芝加哥附近的那几个公司
（说不定就包括你这家），其中一个还把我的h1b转好了。So what? 我还是选择了我
现在这里，因为我不需要往那些公司里钻来寻找自信。直到现在你们芝加哥那边的几家
还是不停地在跟我联系。另外，我老板的简历拿出来吓死你，人家在什么样的公司里没
干过？而且早就混到了director。人家一样在3年前跑到我们这里来。我们都是主动选
择来这里的，你懂吗？你显然不懂。你就像是中南海里扫地的老头，除了动不动提一下
自己是中南海的，嘲笑别人的地方小之外，没有别的自信了。
另外，你说你的boss 会approve the use of things like log(fico)恰恰证明了你只
是个扫地老头的角色，因为你能互动的也就是技术这一层。而我不是，不是因为我的
boss没有统计知识，而是我们俩都要跟更高层的management 打交道，还有跟第一线的
operation 打交道，而这两层要的是都是直观，简单，有效。
其实你只不过是为了吵架而吵架。明明可以用binning/grouping 这种更直观的方式为
什么还要用log(fico)?
I have been working for a few leading firms/banks in the financial industry.
The projects I have been working on cover mortgage, credit card, consumer
credit… I have been building model in divisions of marketing, credit
scoring, and collections…if I can’t say “业界”, you can?
你如果真有能力真有经验，就好好的帮这版上的同学们找工作，帮他们改改简历，传授
点面试的经验。攻击同行，只能说明你很没品。

【在 n******r 的大作中提到】

: Dude, your company doesn't even have a stat team and you started your post
: with "在业界里做模型，..." "也许你在想理论上怎么样怎么样，但是业界里做模型的
: 实际情况是：" and followed by BS? Since when is your company representing "
: 业界"? You'd better stay in your company where your boss won't approve the
: use of things like log(FICO) as it is probably the only place that you can
: be successful.
:
: in

n******r
发帖数: 1247

Since when is a mananger at Sallie Mae becomes so attractive? Your industry
experience seems enough to make you a director there. Does it give you
confidence to work for a company that does not have a stat team? Learn some
statistics and stop bullshitting is the right way to get yourself some
confidence.

the

【在 y*****n 的大作中提到】

: 我真的觉得你很可怜。看起来你的自信心就只是建立在你呆在哪个公司里（哪怕是在端
: 茶倒水）。你以为别人没有能力去？不管你现在哪个公司里呆着，我原来公司的stat
: team都不会比你现在这个小（we once have a team of 20 statisticians across the
: company to work on just a single project. How about yours？）。而且当年我跳
: 槽的时候拿到的你崇拜的那些公司的offer多去了，包括你们芝加哥附近的那几个公司
: （说不定就包括你这家），其中一个还把我的h1b转好了。So what? 我还是选择了我
: 现在这里，因为我不需要往那些公司里钻来寻找自信。直到现在你们芝加哥那边的几家
: 还是不停地在跟我联系。另外，我老板的简历拿出来吓死你，人家在什么样的公司里没
: 干过？而且早就混到了director。人家一样在3年前跑到我们这里来。我们都是主动选
: 择来这里的，你懂吗？你显然不懂。你就像是中南海里扫地的老头，除了动不动提一下

y*****n
发帖数: 5016

Sigh….就算你说的全对，我说的全错，怎么样？你既然有时间冲着我bullshit来
bullshit去的，为什么不全面地简明易懂地回答一下楼主同学的问题，向大家详细科普
一下你是怎么对上百个甚至几百个predictor variables做全面的transformation并进
行有效率的挑选的？我相信如果你这样做了，版上的同学们会给你包子表示感谢的。我
会洗耳恭听，保证不插嘴。

industry
some

【在 n******r 的大作中提到】

: Since when is a mananger at Sallie Mae becomes so attractive? Your industry
: experience seems enough to make you a director there. Does it give you
: confidence to work for a company that does not have a stat team? Learn some
: statistics and stop bullshitting is the right way to get yourself some
: confidence.
:
: the

D******n
发帖数: 2836

其实怎么做都差不多。

【在 y*****n 的大作中提到】

: Sigh….就算你说的全对，我说的全错，怎么样？你既然有时间冲着我bullshit来
: bullshit去的，为什么不全面地简明易懂地回答一下楼主同学的问题，向大家详细科普
: 一下你是怎么对上百个甚至几百个predictor variables做全面的transformation并进
: 行有效率的挑选的？我相信如果你这样做了，版上的同学们会给你包子表示感谢的。我
: 会洗耳恭听，保证不插嘴。
:
: industry
: some

n******r
发帖数: 1247

同意，所以没有说具体方法，因为方法很多，即使同一个公司，方法也不断的在改进。
目前做project我们用treenet从近千个选到top50，
然后做empirical analysis，transformation， regression选变量，加interaction，
这几步反复。
以前有先对变量cluster，主成份，快速选取的。最复杂最慢但效果最好的有在
logistic regression里，对somer's D做敏感度分析一个一个选的。treenet的好处
是data preparation方便，linear nonlinear关系都能找到，top 50变量，加上
interaction一般模型可以达到很好的效果。
再具体也不说了，这个东西是靠日常在实际运用中积累经验的，不是广靠听就能会的。
对于面试被问到如何从上千个变量里选，我个人的观点是，除非你有过实际操作的经验
，否则也只能说听来的皮毛，不要太指望能和有经验的面试者做深入讨论。
对于楼主被问到的transformation的问题，这个对即使没有大型数据处理经验的fresh
graduate也是fair question。Transformation的目的是使得prediction尽可能和
target有线性关系，用什么transformation和好不好解释没有关系，用什么变量和解释
有关系。
另外具体操作起来，不推荐用楼上有人说的写macro把各种常用的transformation都做
一遍，然后用自动选个walds最大的。这样做的问题是所有的transformation可能都很
烂，选出来的那个只是相对好一点而已，这里p value不能完全概括linear
relationship的好坏，实际做了和不做可能区别不大。个人经验是，对于意义上最重要
的变量，要handcraft，对于一般function效果都不好的，如tub形状的，用2-3个
linear spline来fit可能是最快速有效的。

【在 D******n 的大作中提到】

: 其实怎么做都差不多。

相关主题
● 问一个求2个变量correlation的问题	● [合集] help - sas proc sql with do loop
● 问一下这个双变量变换的问题谢谢	● 求助SAS CODE：如何同时对90个variables进行log transformation?
● 一个统计问题	● 请问 SAS转到 R 的问题
进入Statistics版参与讨论

A*******s
发帖数: 3942

good one. marked.
i do see people using splines in my company. i am curious, is there anyone
using MARS in industries? it seems very promising to me, automatically
selecting variables and modeling nonlinearity and interactions. Wondering
how it works in reality.

【在 n******r 的大作中提到】

: 同意，所以没有说具体方法，因为方法很多，即使同一个公司，方法也不断的在改进。
: 目前做project我们用treenet从近千个选到top50，
: 然后做empirical analysis，transformation， regression选变量，加interaction，
: 这几步反复。
: 以前有先对变量cluster，主成份，快速选取的。最复杂最慢但效果最好的有在
: logistic regression里，对somer's D做敏感度分析一个一个选的。treenet的好处
: 是data preparation方便，linear nonlinear关系都能找到，top 50变量，加上
: interaction一般模型可以达到很好的效果。
: 再具体也不说了，这个东西是靠日常在实际运用中积累经验的，不是广靠听就能会的。
: 对于面试被问到如何从上千个变量里选，我个人的观点是，除非你有过实际操作的经验

D******n
发帖数: 2836

其实前面有人也提到，开发阶段怎么fancy都好，到了production，人家不一定实现得
了，或者成本太高，根本卖不出去这个产品。除非你的performance比一般方法胜出很
多，但是，对于很多modeling来说，好像logistic就是那么的好用。

【在 A*******s 的大作中提到】

: good one. marked.
: i do see people using splines in my company. i am curious, is there anyone
: using MARS in industries? it seems very promising to me, automatically
: selecting variables and modeling nonlinearity and interactions. Wondering
: how it works in reality.

n******r
发帖数: 1247

I don't have experience in MARS.

【在 A*******s 的大作中提到】

n******r
发帖数: 1247

logistic regression is an amazing thing.

【在 D******n 的大作中提到】

: 其实前面有人也提到，开发阶段怎么fancy都好，到了production，人家不一定实现得
: 了，或者成本太高，根本卖不出去这个产品。除非你的performance比一般方法胜出很
: 多，但是，对于很多modeling来说，好像logistic就是那么的好用。

s********p
发帖数: 637

受教了，很有启发
我就回复一点，既然你提到了我上面说的“写macro把各种常用的transformation都做
一遍，然后用自动选个walds最大的。这样做的问题是所有的transformation可能都很
烂”，我就解释一下。你说的，我同意，如果所有的transformation都很烂的话
，就基本能决定这个变量没有什么太多用或可以由其他变量替换，不必过多纠结。我一
般是在50个变量左右的时候进行变换筛选，实际上，对于比较重要的变量，10次有5次（大约吧）能找到比较好的变换，对modeling 很有帮助。
我不觉得我的方法有多好，不fancy,又没什么理论，虽然简单，但实际工作中却是有效。想想，如果常用的3,40个简单变换都不能比较好的拟合，那种假设存在的transformation还有什么意义？我虽然反驳过yuxuxin，但我的看法和他还是基本一致的，变量本身如果很不significant,基本上就不要过多纠结了，除非某些原因非要用这个不可。我这几年做的上百的model基本都是这个样子的。
我以前看过一本书，好像叫database marketing什么的，作者用遗传算法控制参数，选
择最优参数去找最好的变换，最简单的就是用box-cox变换了，号称不管别人用什么方
法，他的model的lift能高10%，我相信，引入优化，高10%算什么。你提到的用linear
spline来fit,我没试过，你说可能是最快速有效的，我相信，没什么是不能用样条完美
的拟合不出来的。但是，你的方法（我提到的那本书的方法也一样）我都不会去试，为
什么？-- Overfit!!!
用简单的方法做的东西虽然糙一点，但stable,越精雕细琢的东西，用起来反而不太可靠。

做一遍，然后用自动选个walds最大的。这样做的问题是所有的transformation可能都
很烂，选出来的那个只是相对好一点而已，这里p value不能完全概括linear
relationship的好坏，实际做了和不做可能区别不大。个人经验是，对于意义上最重要
的变量，要handcraft，对于一般function效果都不好的，如tub形状的，用2-3个
linear spline来fit可能是最快速有效的。

【在 n******r 的大作中提到】

: logistic regression is an amazing thing.

s********p
发帖数: 637

这句简单话，没有历练过是说不出来的

【在 D******n 的大作中提到】

: 其实怎么做都差不多。

s*********e
发帖数: 1051

never expect this topic will become so hot.
i am wondering if it is related to monotonic or non-monotonic transformation
. for non-monotonic, then it is extremely simple. WoE should govern all
monotonic transformations.
a simple test to do is to simulate as many monotonic transformations as you
can and then to calculate WoE for all. I am sure you will get the same
result.
for non-monotonic transformation, it might be little tricky. however, you
can always use spline to approximate it out and then approach spline by any
functional form or piecewise.

b*****n
发帖数: 685

我怎么觉得还有一方面的目的是避免outliers caused by long tail?我理解得不对吗？

对于楼主被问到的transformation的问题，这个对即使没有大型数据处理经验的fresh
graduate也是fair question。Transformation的目的是使得prediction尽可能和
target有线性关系，用什么transformation和好不好解释没有关系，用什么变量和解释
有关系。

【在 n******r 的大作中提到】

: logistic regression is an amazing thing.

(共1页)

进入Statistics版参与讨论

相关主题
● 如何处理很多的zero value？	● R classification tree model 请教
● Clustering analysis with categorical variables	● support vector machine
● 求教一个题目	● 请问一个Regression的问题
● 问一个linear regression 的弱问题。	● model和variables都sig.但每个category都不sig
● a question on transformation	● 问个关于credit score model的问题
● Help! A data step problem	● 问一个求2个变量correlation的问题
● 请教logistic regression	● 问一下这个双变量变换的问题谢谢
● 请教一个简单的统计问题-pearson correlation	● 一个统计问题

相关话题的讨论汇总
话题: variable话题: fico话题: predictor话题: transform

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天