由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Data Monitoring
相关主题
这样的数据怎么处理有关归类
工作中遇到的一个现象,问问大家怎么解释 (转载)p value被摈弃了?如何算confidence interval之类的东西?
[Data Science Project Case]Future Income predicting求解一个水塘抽样题 (转载)
应用统计硕士选课求教Data Science方向 (转载)请教一道面试题
suggestion on geospatial data? (转载)请问有关t-test(包子酬谢!) (转载)
datascientist几个基本问题请问分母自由度 (转载)
how to split samples/data for A/B testData Science和Analytics有什么区别?
ask for help for R programming (转载)instacart 和 airbnb的 data scientist challenge, 谁做过?
相关话题的讨论汇总
话题: data话题: model话题: your话题: project话题: still
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
This is some project I worked on in the past. Still it would be great to
hear what people might have done differently. :)
Say your team have a model with 300 input variables, however as market
changes, you need to see if your model will still work, before you run your
model, which takes hours.
One way is to take a small sample and test run the model. However, we all
know the Simpson's paradox, and that samples can be misleading.
The other way is to directly compare the current data with historical data,
to check whether:
1) the current data is very different from historical data;
2) the current data resembles a bad day in that the market was bad;
3) the current data resembles a bad day in that the model worked poorly.
What would you do?
Thanks!
l******n
发帖数: 9344
2
You need to define what is "your model will still work". The easy checkups
do not make sense because as long as the default assumptions in the model
are not violated, it is always valid to say they model works. It is just a
matter of how good it is.

your
,

【在 c***z 的大作中提到】
: This is some project I worked on in the past. Still it would be great to
: hear what people might have done differently. :)
: Say your team have a model with 300 input variables, however as market
: changes, you need to see if your model will still work, before you run your
: model, which takes hours.
: One way is to take a small sample and test run the model. However, we all
: know the Simpson's paradox, and that samples can be misleading.
: The other way is to directly compare the current data with historical data,
: to check whether:
: 1) the current data is very different from historical data;

c***z
发帖数: 6348
3
Sure, by still work I meant work well, precisely, the C-stat should be high,
or the RMSE should be low.

【在 l******n 的大作中提到】
: You need to define what is "your model will still work". The easy checkups
: do not make sense because as long as the default assumptions in the model
: are not violated, it is always valid to say they model works. It is just a
: matter of how good it is.
:
: your
: ,

l******n
发帖数: 9344
4
C-stat and RMSE correspond to very different problems...
In general, look at the top 10 most important factors in your old model then
compare the effects with the new data. If they are not far away, you can
proceed to check more or just keep one variable in the model and compare the
C-stat or RMSE. This is more than checking model works or not, because it
also considers the contribution of each variable at least important ones.

high,

【在 c***z 的大作中提到】
: Sure, by still work I meant work well, precisely, the C-stat should be high,
: or the RMSE should be low.

c***z
发帖数: 6348
5
Yeah, they are for binary response or continuous response
Your method (look at the top 10 most important factors) should work, as well
as PCA, for medium sized data. However, what if the data have billions of
rows? It still takes lots of time to run the mapreduce job...
The next step of your method is like step wise feature selection, if I
understood it right? It is time consuming as well...
We are not doing model validation, sorry if I confused you. We just want to
see if the data has changed or not.
Model validation is a separate issue. :)

then
the

【在 l******n 的大作中提到】
: C-stat and RMSE correspond to very different problems...
: In general, look at the top 10 most important factors in your old model then
: compare the effects with the new data. If they are not far away, you can
: proceed to check more or just keep one variable in the model and compare the
: C-stat or RMSE. This is more than checking model works or not, because it
: also considers the contribution of each variable at least important ones.
:
: high,

l******n
发帖数: 9344
6
You confuse me. You said you want to know if the model works for the new
data, now you say that you want to see if the data has changed or not.These
are complete two different issues. The first you have to include the model
any way, the second you only need to check the data characteristics.

well
to

【在 c***z 的大作中提到】
: Yeah, they are for binary response or continuous response
: Your method (look at the top 10 most important factors) should work, as well
: as PCA, for medium sized data. However, what if the data have billions of
: rows? It still takes lots of time to run the mapreduce job...
: The next step of your method is like step wise feature selection, if I
: understood it right? It is time consuming as well...
: We are not doing model validation, sorry if I confused you. We just want to
: see if the data has changed or not.
: Model validation is a separate issue. :)
:

c***z
发帖数: 6348
7
You are right. I need to improve my communication skills. :)
We only need to check the data characteristics.

These

【在 l******n 的大作中提到】
: You confuse me. You said you want to know if the model works for the new
: data, now you say that you want to see if the data has changed or not.These
: are complete two different issues. The first you have to include the model
: any way, the second you only need to check the data characteristics.
:
: well
: to

s*********e
发帖数: 1051
8
population stability index

your
,

【在 c***z 的大作中提到】
: This is some project I worked on in the past. Still it would be great to
: hear what people might have done differently. :)
: Say your team have a model with 300 input variables, however as market
: changes, you need to see if your model will still work, before you run your
: model, which takes hours.
: One way is to take a small sample and test run the model. However, we all
: know the Simpson's paradox, and that samples can be misleading.
: The other way is to directly compare the current data with historical data,
: to check whether:
: 1) the current data is very different from historical data;

D******n
发帖数: 2836
9
model is y|x. If model is good, even distribution of x changes, y|x is still
good.
also, if your model output is just for rank ordering, it might be more
resilient to x changes.

【在 c***z 的大作中提到】
: You are right. I need to improve my communication skills. :)
: We only need to check the data characteristics.
:
: These

c***z
发帖数: 6348
10
我当时用的是K-S test和permutation test,和30天MA比较。
后来头说要更简单,于是就只把历史数据normalize到0~1,如果当前数据小于0或者大
于1就发警报;然后搞了两个logit reg,response就是market performance和model
performance
说实在的,这个project的要求不是特别高,虚假警报也没什么严重后果。:)
M***e
发帖数: 531
11
我是这么想的啊:
如果原有模型里,对于各个feature的importance有一些概念,那么新数据来了以后,
如果某个feature和原来的historical data分布很不同,而这个feature又对原有model
很重要,那可能就需要重新build model,否则可以按原有模型给个估算值
w*********y
发帖数: 7895
12
从我的背景来说,如果你有足够大的SAMPLE,你可以采用特殊
的SAMPLING TECHNIQUES来取一些SAMPLE来代表着个大SAMPLE,
这个方法在教育界广泛应用。。。就是说,你SAMPLE的结果应该是
LESS BIASED。。。。

your
,

【在 c***z 的大作中提到】
: This is some project I worked on in the past. Still it would be great to
: hear what people might have done differently. :)
: Say your team have a model with 300 input variables, however as market
: changes, you need to see if your model will still work, before you run your
: model, which takes hours.
: One way is to take a small sample and test run the model. However, we all
: know the Simpson's paradox, and that samples can be misleading.
: The other way is to directly compare the current data with historical data,
: to check whether:
: 1) the current data is very different from historical data;

w*********y
发帖数: 7895
13
谢谢分享你做过的项目啊。。。很有意思!

your
,

【在 c***z 的大作中提到】
: This is some project I worked on in the past. Still it would be great to
: hear what people might have done differently. :)
: Say your team have a model with 300 input variables, however as market
: changes, you need to see if your model will still work, before you run your
: model, which takes hours.
: One way is to take a small sample and test run the model. However, we all
: know the Simpson's paradox, and that samples can be misleading.
: The other way is to directly compare the current data with historical data,
: to check whether:
: 1) the current data is very different from historical data;

1 (共1页)
进入DataSciences版参与讨论
相关主题
instacart 和 airbnb的 data scientist challenge, 谁做过?suggestion on geospatial data? (转载)
data challenge ... 现在公司都咋tmd想的datascientist几个基本问题
[Data Science Project Case] Bias Correctionhow to split samples/data for A/B test
look alike model 有什么学习资料吗?ask for help for R programming (转载)
这样的数据怎么处理有关归类
工作中遇到的一个现象,问问大家怎么解释 (转载)p value被摈弃了?如何算confidence interval之类的东西?
[Data Science Project Case]Future Income predicting求解一个水塘抽样题 (转载)
应用统计硕士选课求教Data Science方向 (转载)请教一道面试题
相关话题的讨论汇总
话题: data话题: model话题: your话题: project话题: still