由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Statistics版 - 紧急请教两个关于resampling的概念问题
相关主题
有个bootstrap的问题想找人讨论下。how do you deal with sparse data?
请教一个bootstrapping的问题。Resampling和monte Carlo用什么软件最好?
关于Bootstrap法需要强调的是classification 问题 求教!!
[合集] 有个bootstrap的问题想找人讨论下。求用R做bootstrap的example script
请问R里面有什么package来算wilcoxon signed rank test的sample size 么?请教如何做normalization,找峰值
如何做samplingbootstrap真的能让让我们逼近“真理”吗
求助:哪位同学能提供下算CI的公式random forest里面为什么是"可放回"的resample呢?
Re: 讨论讨论Bootstrap和resampling吧bagging 用于logistic regression because of unbalance data
相关话题的讨论汇总
话题: sample话题: size话题: mean话题: 抽样话题: 1000
进入Statistics版参与讨论
1 (共1页)
r******y
发帖数: 39
1
统计大菜鸟,希望大家帮我回复下面的两个问题,非常谢谢!
问题1:
有一个小population, size N = 150, 我从这个population里面抽sample,sample
size n = 5, 抽sample用的是 sampling without replacement, 然后我这样重复抽了
1000次,最后得到1000个小sample,每个sample的size是5. 这种resampling的方法,
有没有一个统计专业名字?这算是 simulation 吗?
问题2:
这个小population里面是150个人,我要关注的variable是height。我这么抽1000个
resample的目的是想要知道: when sample size n = 5, how likely is the mean
height of any randomly drawn sample (n=5) similar or different from the
mean height of the population (N=150).
我的问题是:我这么抽了1000次,是主观决定的,感觉好像1000次差不多了,有没有什
么公式是可以帮我证明抽1000次是足够了的?
非常谢谢!!
w*******9
发帖数: 1433
2
looks like you want to get the distribution of the mean of a size-5 random
sample. Since it is fairly quick to do such simple resampling why not use a
number as big as possible? (You can keep in mind the sqrt(size) rule
to gauge the accuracy of your approximation; or you can try different runs
and assess the variations between runs to determine if 1000 is enough; since
the total different number of combinations is C(150, 5), any number beyond
that is not helpful).
r******y
发帖数: 39
3
非常谢谢回复!
Since it is fairly quick to do such simple resampling why not use a number
as big as possible?
=> 非常好的问题,也是问到了最让我头疼的地方。
1. 我这实验是过去做的,现在解释为什么选用1000作为抽样次数,没有办法回到过去
修改抽样次数
2. 如果我再重新做实验,我应该会选择更大的抽样次数,比如说5000或者10000.但即
便如此,我依然还是需要解释为什么5000或者10000是个理想的抽样次数
since the total different number of combinations is C(150, 5), any number
beyond that is not helpful).
=> 很好的见解,谢谢提示!
or you can try different runs and assess the variations between runs to
determine if 1000 is enough
=> 这步是做过了的,把1000和2000和5000抽样次数比较了,变化不大,但是感觉这种
鉴定方法也是很粗糙。如果有更加clear-cut的验证方法,就更好了。
You can keep in mind the sqrt(size) rule to gauge the accuracy of your
approximation
=> 第一步里面,你提到的sqrt(size) rule,我不了解这个rule,请问可以给一些解释
或者文献吗?谢谢
另外,我看到了一篇文章:Determining the number of simulation runs:Treating
simulations as theories by not sampling their behavior
文章链接:http://acs.ist.psu.edu/papers/ritterSQKip.pdf
这篇文章里提出了几个观点:1)抽样次数越多越好;2)抽样次数多的时候,sample
mean distribution会呈正态分布模式,抽样次数少的时候,则不是;3)可以根据
mean effect size 和 predefined power 计算出所需要的抽样次数。
我现在没办法回到过去抽更多的样本,但是,我在想是否可以借用这文章里的第2、3观
点来解决我当前的问题,比如说,我可以看看我的1000次抽样的sample mean
distribution. 如果我的distribution呈正态,我是否就可以claim抽样次数是足够的
呢?我也在考虑计算一个所需要的抽样次数,但是我没读懂这篇文章里的mean effect
size是怎么算出来的..
希望楼上朋友继续回复我,谢谢!

a
runs
since
beyond

【在 w*******9 的大作中提到】
: looks like you want to get the distribution of the mean of a size-5 random
: sample. Since it is fairly quick to do such simple resampling why not use a
: number as big as possible? (You can keep in mind the sqrt(size) rule
: to gauge the accuracy of your approximation; or you can try different runs
: and assess the variations between runs to determine if 1000 is enough; since
: the total different number of combinations is C(150, 5), any number beyond
: that is not helpful).

w*******9
发帖数: 1433
4
I was too quick to say any number beyond c(150, 5) is not helpful -- it's
wrong.
Method 3) in the paper you refer to is the most rigorous approach to
calculate the minimum sample size needed in order to achieve certain length
in confidence interval, or equivalently, certain power in hypothesis test.
See for example http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/BS704_Power4.html
If I understand your question correctly, you're trying to estimate the
probability (say p) that the mean of the sub-sample differs the whole sample
mean by a given threshold.
Let's say you get an estimate p_hat in your previous simulation of 1000
times. We can obtain an asymptotic 95% confidence interval for p as p_hat +-
1.96*sqrt(p_hat * (1 - p_hat). See if you can argue the width of this
interval is shorter than a pre-specified length which serves your study
purpose.
r******y
发帖数: 39
5
谢谢'那山那水’的时间和回复。大牛厉害!对的,你对我的research question的理解
是正确的 :)
然我的research question是比较 sample mean vs population mean, 但是我目前面临
的问题却是需要解释为什么我的实验里面选择了重复抽样1000次,所以我并不需要理睬
如何计算sample size的问题,而是number of simulation runs.
我也觉得那篇文章里的第三个方法是最准确的,但是那个文章里的计算公式里有个变量
叫做 mean effect size, 全文都没解释这个 mean effect size 是怎么计算的,感觉
像是Cohen D, 却又好像不太说的通,因为直觉上,Cohen D越大,需要的simulation
runs也应该更多次,但是文中的结论却是相反(具体请看p.28, table 5). 所以我一直
想知道这个 mean effect size 到底代表什么。
我还有个问题想问你:
I was too quick to say any number beyond c(150, 5) is not helpful -- it's
wrong.
=> 请问为什么是错的?

length
sample

【在 w*******9 的大作中提到】
: I was too quick to say any number beyond c(150, 5) is not helpful -- it's
: wrong.
: Method 3) in the paper you refer to is the most rigorous approach to
: calculate the minimum sample size needed in order to achieve certain length
: in confidence interval, or equivalently, certain power in hypothesis test.
: See for example http://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/BS704_Power4.html
: If I understand your question correctly, you're trying to estimate the
: probability (say p) that the mean of the sub-sample differs the whole sample
: mean by a given threshold.
: Let's say you get an estimate p_hat in your previous simulation of 1000

b*****s
发帖数: 11267
6
To call in the statistician after the experiment is done may be no more than
asking him to perform a post-mortem examination: he may be able to say what
the experiment died of.
Ronald Fisher

【在 r******y 的大作中提到】
: 统计大菜鸟,希望大家帮我回复下面的两个问题,非常谢谢!
: 问题1:
: 有一个小population, size N = 150, 我从这个population里面抽sample,sample
: size n = 5, 抽sample用的是 sampling without replacement, 然后我这样重复抽了
: 1000次,最后得到1000个小sample,每个sample的size是5. 这种resampling的方法,
: 有没有一个统计专业名字?这算是 simulation 吗?
: 问题2:
: 这个小population里面是150个人,我要关注的variable是height。我这么抽1000个
: resample的目的是想要知道: when sample size n = 5, how likely is the mean
: height of any randomly drawn sample (n=5) similar or different from the

r******y
发帖数: 39
7
:(
haha... 赞直接。。。
谢回复,这条comment我表示开心的接受 :(

than
what

【在 b*****s 的大作中提到】
: To call in the statistician after the experiment is done may be no more than
: asking him to perform a post-mortem examination: he may be able to say what
: the experiment died of.
: Ronald Fisher

1 (共1页)
进入Statistics版参与讨论
相关主题
bagging 用于logistic regression because of unbalance data请问R里面有什么package来算wilcoxon signed rank test的sample size 么?
请教:如何做regression model的validation?如何做sampling
紧急求助一个LOGISTIC REGRESSION 问题.求助:哪位同学能提供下算CI的公式
请教大神们关于bootstrapRe: 讨论讨论Bootstrap和resampling吧
有个bootstrap的问题想找人讨论下。how do you deal with sparse data?
请教一个bootstrapping的问题。Resampling和monte Carlo用什么软件最好?
关于Bootstrap法需要强调的是classification 问题 求教!!
[合集] 有个bootstrap的问题想找人讨论下。求用R做bootstrap的example script
相关话题的讨论汇总
话题: sample话题: size话题: mean话题: 抽样话题: 1000