[Data Science Project Case] Data Monitoring - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Data Monitoring

相关主题
● 这样的数据怎么处理	● 有关归类
● 工作中遇到的一个现象，问问大家怎么解释 (转载)	● p value被摈弃了？如何算confidence interval之类的东西？
● [Data Science Project Case]Future Income predicting	● 求解一个水塘抽样题 (转载)
● 应用统计硕士选课求教Data Science方向 (转载)	● 请教一道面试题
● suggestion on geospatial data? (转载)	● 请问有关t-test（包子酬谢！） (转载)
● datascientist几个基本问题	● 请问分母自由度 (转载)
● how to split samples/data for A/B test	● Data Science和Analytics有什么区别？
● ask for help for R programming (转载)	● instacart 和 airbnb的 data scientist challenge，谁做过？

相关话题的讨论汇总
话题: data话题: model话题: your话题: project话题: still

进入DataSciences版参与讨论

1

(共1页)

c***z 发帖数: 6348	1 This is some project I worked on in the past. Still it would be great to hear what people might have done differently. :) Say your team have a model with 300 input variables, however as market changes, you need to see if your model will still work, before you run your model, which takes hours. One way is to take a small sample and test run the model. However, we all know the Simpson's paradox, and that samples can be misleading. The other way is to directly compare the current data with historical data, to check whether: 1) the current data is very different from historical data; 2) the current data resembles a bad day in that the market was bad; 3) the current data resembles a bad day in that the model worked poorly. What would you do? Thanks!
l******n 发帖数: 9344	2 You need to define what is "your model will still work". The easy checkups do not make sense because as long as the default assumptions in the model are not violated, it is always valid to say they model works. It is just a matter of how good it is. your , 【在 c***z 的大作中提到】 : This is some project I worked on in the past. Still it would be great to : hear what people might have done differently. :) : Say your team have a model with 300 input variables, however as market : changes, you need to see if your model will still work, before you run your : model, which takes hours. : One way is to take a small sample and test run the model. However, we all : know the Simpson's paradox, and that samples can be misleading. : The other way is to directly compare the current data with historical data, : to check whether: : 1) the current data is very different from historical data;
c***z 发帖数: 6348	3 Sure, by still work I meant work well, precisely, the C-stat should be high, or the RMSE should be low. 【在 l******n 的大作中提到】 : You need to define what is "your model will still work". The easy checkups : do not make sense because as long as the default assumptions in the model : are not violated, it is always valid to say they model works. It is just a : matter of how good it is. : : your : ,
l******n 发帖数: 9344	4 C-stat and RMSE correspond to very different problems... In general, look at the top 10 most important factors in your old model then compare the effects with the new data. If they are not far away, you can proceed to check more or just keep one variable in the model and compare the C-stat or RMSE. This is more than checking model works or not, because it also considers the contribution of each variable at least important ones. high, 【在 c***z 的大作中提到】 : Sure, by still work I meant work well, precisely, the C-stat should be high, : or the RMSE should be low.
c***z 发帖数: 6348	5 Yeah, they are for binary response or continuous response Your method (look at the top 10 most important factors) should work, as well as PCA, for medium sized data. However, what if the data have billions of rows? It still takes lots of time to run the mapreduce job... The next step of your method is like step wise feature selection, if I understood it right? It is time consuming as well... We are not doing model validation, sorry if I confused you. We just want to see if the data has changed or not. Model validation is a separate issue. :) then the 【在 l******n 的大作中提到】 : C-stat and RMSE correspond to very different problems... : In general, look at the top 10 most important factors in your old model then : compare the effects with the new data. If they are not far away, you can : proceed to check more or just keep one variable in the model and compare the : C-stat or RMSE. This is more than checking model works or not, because it : also considers the contribution of each variable at least important ones. : : high,
l******n 发帖数: 9344	6 You confuse me. You said you want to know if the model works for the new data, now you say that you want to see if the data has changed or not.These are complete two different issues. The first you have to include the model any way, the second you only need to check the data characteristics. well to 【在 c***z 的大作中提到】 : Yeah, they are for binary response or continuous response : Your method (look at the top 10 most important factors) should work, as well : as PCA, for medium sized data. However, what if the data have billions of : rows? It still takes lots of time to run the mapreduce job... : The next step of your method is like step wise feature selection, if I : understood it right? It is time consuming as well... : We are not doing model validation, sorry if I confused you. We just want to : see if the data has changed or not. : Model validation is a separate issue. :) :
c***z 发帖数: 6348	7 You are right. I need to improve my communication skills. :) We only need to check the data characteristics. These 【在 l******n 的大作中提到】 : You confuse me. You said you want to know if the model works for the new : data, now you say that you want to see if the data has changed or not.These : are complete two different issues. The first you have to include the model : any way, the second you only need to check the data characteristics. : : well : to
s*********e 发帖数: 1051	8 population stability index your , 【在 c***z 的大作中提到】 : This is some project I worked on in the past. Still it would be great to : hear what people might have done differently. :) : Say your team have a model with 300 input variables, however as market : changes, you need to see if your model will still work, before you run your : model, which takes hours. : One way is to take a small sample and test run the model. However, we all : know the Simpson's paradox, and that samples can be misleading. : The other way is to directly compare the current data with historical data, : to check whether: : 1) the current data is very different from historical data;
D******n 发帖数: 2836	9 model is y\|x. If model is good, even distribution of x changes, y\|x is still good. also, if your model output is just for rank ordering, it might be more resilient to x changes. 【在 c***z 的大作中提到】 : You are right. I need to improve my communication skills. :) : We only need to check the data characteristics. : : These
c***z 发帖数: 6348	10 我当时用的是K-S test和permutation test，和30天MA比较。后来头说要更简单，于是就只把历史数据normalize到0~1，如果当前数据小于0或者大于1就发警报；然后搞了两个logit reg，response就是market performance和model performance 说实在的，这个project的要求不是特别高，虚假警报也没什么严重后果。：）
M***e 发帖数: 531	11 我是这么想的啊：如果原有模型里，对于各个feature的importance有一些概念，那么新数据来了以后，如果某个feature和原来的historical data分布很不同，而这个feature又对原有model 很重要，那可能就需要重新build model，否则可以按原有模型给个估算值
w*********y 发帖数: 7895	12 从我的背景来说，如果你有足够大的SAMPLE，你可以采用特殊的SAMPLING TECHNIQUES来取一些SAMPLE来代表着个大SAMPLE，这个方法在教育界广泛应用。。。就是说，你SAMPLE的结果应该是 LESS BIASED。。。。 your , 【在 c***z 的大作中提到】 : This is some project I worked on in the past. Still it would be great to : hear what people might have done differently. :) : Say your team have a model with 300 input variables, however as market : changes, you need to see if your model will still work, before you run your : model, which takes hours. : One way is to take a small sample and test run the model. However, we all : know the Simpson's paradox, and that samples can be misleading. : The other way is to directly compare the current data with historical data, : to check whether: : 1) the current data is very different from historical data;
w*********y 发帖数: 7895	13 谢谢分享你做过的项目啊。。。很有意思！ your , 【在 c***z 的大作中提到】 : This is some project I worked on in the past. Still it would be great to : hear what people might have done differently. :) : Say your team have a model with 300 input variables, however as market : changes, you need to see if your model will still work, before you run your : model, which takes hours. : One way is to take a small sample and test run the model. However, we all : know the Simpson's paradox, and that samples can be misleading. : The other way is to directly compare the current data with historical data, : to check whether: : 1) the current data is very different from historical data;

1

(共1页)

进入DataSciences版参与讨论

相关主题
● instacart 和 airbnb的 data scientist challenge，谁做过？	● suggestion on geospatial data? (转载)
● data challenge ... 现在公司都咋tmd想的	● datascientist几个基本问题
● [Data Science Project Case] Bias Correction	● how to split samples/data for A/B test
● look alike model 有什么学习资料吗？	● ask for help for R programming (转载)
● 这样的数据怎么处理	● 有关归类
● 工作中遇到的一个现象，问问大家怎么解释 (转载)	● p value被摈弃了？如何算confidence interval之类的东西？
● [Data Science Project Case]Future Income predicting	● 求解一个水塘抽样题 (转载)
● 应用统计硕士选课求教Data Science方向 (转载)	● 请教一道面试题

相关话题的讨论汇总
话题: data话题: model话题: your话题: project话题: still

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)