c****s 发帖数: 395 | 1 想起前一阵面试一家公司
被问到做model选多少records做sample
当时说选了10,000 from 2.5 millions
很明显这个答案遭到对方鄙视
回来后有点不解,
选多少个做sample算不上什么有技术含量的问题
sample多少根本不会影响你的model结果
难道是离学术界久了,这个问题很tricky? |
l***a 发帖数: 12410 | 2 有多少选多少,留下validate的
【在 c****s 的大作中提到】 : 想起前一阵面试一家公司 : 被问到做model选多少records做sample : 当时说选了10,000 from 2.5 millions : 很明显这个答案遭到对方鄙视 : 回来后有点不解, : 选多少个做sample算不上什么有技术含量的问题 : sample多少根本不会影响你的model结果 : 难道是离学术界久了,这个问题很tricky?
|
c****s 发帖数: 395 | 3 不可能做Model based on whole dataset or half of the dataset,it is huge
【在 l***a 的大作中提到】 : 有多少选多少,留下validate的
|
s*r 发帖数: 2757 | 4 depends on how many predictor variables you have, and how complex of the
models you are talking about |
A*******s 发帖数: 3942 | 5 what kind of model u used? for regression, millions of record is practical
for SAS.
【在 c****s 的大作中提到】 : 不可能做Model based on whole dataset or half of the dataset,it is huge
|
c****s 发帖数: 395 | 6 logistics regression
i tried using both whole data and a sample.
the interesting is that using whole data some levels which are not
significant under small sample becomes highly significant .
【在 A*******s 的大作中提到】 : what kind of model u used? for regression, millions of record is practical : for SAS.
|
A*******s 发帖数: 3942 | 7 the most intuitive reason is that u happen to sample a subset in which those
variables are significant.
my further guess is smaller sample size brings in collinearity and it then
causes changes in coefficient estimates.
【在 c****s 的大作中提到】 : logistics regression : i tried using both whole data and a sample. : the interesting is that using whole data some levels which are not : significant under small sample becomes highly significant .
|