问个面试问题 - Statistics版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Statistics版 - 问个面试问题

相关主题
● 求助 R sample in matrix form	● 请教一个Sample Size的问题在Categorical Data Analysis
● quant analyst 一道概率的面试题	● 一个Sample Size计算的问题
● [R]how to sample all possible continuous subset from ordered data	● 请教各位统计大拿一个categorical data的sample size 够不够大的问题
● 大牛们，帮忙解决个很小的统计问题	● 样本数量问题求助
● 弱问一个概念	● 请问dip sample （sampling）中文是什么？
● 如何重复运行sas程序100次，并把100次的结果全部output?	● 紧急请教两个关于resampling的概念问题
● 请问SAS里面有没有类似R里面的sample之类的抽样函数？如果没有怎样实现呢	● 菌斑有统计大拿吗？我问个问题 (转载)
● 海量SAS data的处理	● Help on understanding how to Creating a Random Sample without Replacement

相关话题的讨论汇总
话题: p2话题: p1话题: iter话题: ind话题: bernolli

进入Statistics版参与讨论

(共1页)

p*********g
发帖数: 226

给了 n 个独立的 Bernolli 随机变量X_i，p(X_i = 1) = p_i. 现要采样（X_1, ...,
X_n) 使得它们的和为 m (m <=n). 如何高效地采样？
当然规定了和之后，它们就不再独立了。
简单的方法：独立采样所有 X_i, 如果和不为m，扔掉，重新采。当然效率太低了。

a****m
发帖数: 693

p_1+p_2+...=m/n,
multiple normal distribution

p*********g
发帖数: 226

对不起，能具体点吗？

【在 a****m 的大作中提到】

: p_1+p_2+...=m/n,
: multiple normal distribution

k*****u
发帖数: 1688

怎么觉得像sequential的问题？
美丑一次就检查一下，小雨m，继续抽样，直到到了m为止。 sequential test是不是就是这么做的？以前学的东西都忘了
你说不是ｍ就扔掉，这样太浪费了吧

f***a
发帖数: 329

Here you are. coded in R. It is easy to read and get the idea directly from
the code. Have fun!
###############
n <- 10 # how many Bernolli variables
m <- 6 # how many "1"s you want
nout <- 100 # how many such samples you want
p <- runif(n) # pre-specified probability of Bernolli variables
pp <- p/sum(p)
out <- matrix(0,nout,n)
for(irun in 1:nout)
{
ind <- numeric(m)
iter <- 1
while(iter<=m)
{
res <- sum(runif(1) > cumsum(pp))+1
if(!(res %in% ind)) {ind[iter] <- res; iter <- iter+1}
}
out[irun,ind] <- 1
}
out # the sample!

p*********g
发帖数: 226

多谢。不过这里有个典型的陷阱，那就是这些随机变量在给定和之后就不再独立了。而
且每采完一个 X_i 之后，剩下的（条件）概率要随之调整。
举个反例吧。取 p1 = 1/2, p2 = 1/3, p3 = 1/4. 现要和 s = 2.
一方面
（*） p(x2 = 1 | s = 2) = p(x2 = 1, s = 2) / p(s = 2) = 2/3
（列举一下所有的8种可能，然后算这两个概率：
x1 x2 x3 prob
0 0 0 1/4
0 0 1 1/12
0 1 0 1/8
0 1 1 1/24
1 0 0 1/4
1 0 1 1/12
1 1 0 1/8
1 1 1 1/24
故 p(s = 2) = 1/24 + 1/12 + 1/8 = 1/4
p(x2 = 1, s = 2) = 1/24 + 1/8 = 1/6.)
另一方面，依照你的采样方法，列举一下采样到的概率：
先1 后2：6/13 * 4/7
先1 后3：6/13 * 3/7
先2 后1：4/13 * 2/3
先2 后3：4/13 * 1/3
先3 后1：3/13 * 3/5
先3 后2：3/13 * 2/5
故 p(x2 = 1 | s = 2)
= 6/13 * 4/7 + 4/13 * 2/3 + 4/13 * 1/3 + 3/13 * 2/5
= 302 / 455，
而非（*）里的 2/3。

from

【在 f***a 的大作中提到】

: Here you are. coded in R. It is easy to read and get the idea directly from
: the code. Have fun!
: ###############
: n <- 10 # how many Bernolli variables
: m <- 6 # how many "1"s you want
: nout <- 100 # how many such samples you want
: p <- runif(n) # pre-specified probability of Bernolli variables
: pp <- p/sum(p)
: out <- matrix(0,nout,n)
: for(irun in 1:nout)

p*********g
发帖数: 226

俺先确认对你的算法理解正确：
S = empty
loop until |S| = m
sample i with prob pi / sum_{j=1}^n p_j
if i is not in S
add i to S
end
end
return S

n <- 10 # how many Bernolli variables
m <- 6 # how many "1"s you want
nout <- 100 # how many such samples you want
p <- runif(n) # pre-specified probability of Bernolli variables
pp <- p/sum(p)
out <- matrix(0,nout,n)
for(irun in 1:nout)
{
ind <- numeric(m)
iter <- 1
while(iter<=m)
{
res <- sum(runif(1) > cumsum(pp))+1
if(!(res %in% ind)) {ind[iter] <- res; iter <- iter+1}
}
out[irun,ind] <- 1
}
out # the sample!

【在 f***a 的大作中提到】

f***a
发帖数: 329

你理解是对的，不过我的算法是错的。。。。
譬如p_i=1的时候
有空再想，要走了要走了

【在 p*********g 的大作中提到】

: 俺先确认对你的算法理解正确：
: S = empty
: loop until |S| = m
: sample i with prob pi / sum_{j=1}^n p_j
: if i is not in S
: add i to S
: end
: end
: return S
:

h***i
发帖数: 3844

去哪里？

【在 f***a 的大作中提到】

: 你理解是对的，不过我的算法是错的。。。。
: 譬如p_i=1的时候
: 有空再想，要走了要走了

k*******a
发帖数: 772

没怎么看懂题目。。。可不可以这样，
第一步，X_i中找和m/n 最近的那个p，抽样
第二步，如果第一个样品是1，那么剩下的找和 m-1/n-1最近的p的抽样
反之，找和 m/n-1最近的p抽样
。。。。直到最后一个

相关主题
● 如何重复运行sas程序100次，并把100次的结果全部output?	● 请教一个Sample Size的问题在Categorical Data Analysis
● 请问SAS里面有没有类似R里面的sample之类的抽样函数？如果没有怎样实现呢	● 一个Sample Size计算的问题
● 海量SAS data的处理	● 请教各位统计大拿一个categorical data的sample size 够不够大的问题
进入Statistics版参与讨论

a****m
发帖数: 693

不好意思，可能自己的理解有偏差：
m/(m1*m2*m3...(m-m1-m2-m3-...))*(p1)^m1*p2^m2*p3^m3...(m/n-p1-p2-p3-...)
感觉好像是求能得到这样的概率，而不是怎么有效的抽样。

【在 p*********g 的大作中提到】

: 对不起，能具体点吗？

p*********g
发帖数: 226

其实就是先定义一个(X1,...,Xn) 的联合分布，
p(X1,...,Xn) = \prod_i p(X_i)
p(X_i = 1) = pi.
然后按
p(X1, ..., Xn | \sum_i X_i = m)
抽样。

【在 k*******a 的大作中提到】

: 没怎么看懂题目。。。可不可以这样，
: 第一步，X_i中找和m/n 最近的那个p，抽样
: 第二步，如果第一个样品是1，那么剩下的找和 m-1/n-1最近的p的抽样
: 反之，找和 m/n-1最近的p抽样
: 。。。。直到最后一个

p*********g
发帖数: 226

能给点 intuition 为什么这样能 work 吗？多谢。

【在 k*******a 的大作中提到】

f***a
发帖数: 329

回来了回来了。
重新想了下，这个其实就是在一堆iid variables之间加了一个constraint。貌似
sample起来不难。
以最简单的n=2,m=1为例：
Without constraint, outputs space is {(0,0),(0,1),(1,0),(1,1)}.
The corresponding probability space is {(1-p1)*(1-p2), ..., p1*p2}.
With constraint, outputs space is O={(0,1),(1,0)}.
The corresponding probability space is P={(1-p1)*(p2), p1*(1-p2)}.
Under the constrain, standardize the probability space into
P.std={P1/(P1+P2),P2/(P1+P2)}.
Then under constrain, output (0,1) has the probability P1/(P1+P2) to be
sampled and (1,0) has the probability P2/(P1+P2) to be sampled.
Generalized this logic into arbitrary numbers for n and m. And the R code is
listed below.
##############
n <- 5 # how many Bernolli variables
m <- 3 # how many "1"s you want
set.seed(123)
p <- round(runif(n),2) # pre-specified probabilities for Bernolli variables
library(wle) #try: binary(15,dim=4)$binary
tt <- t(matrix(unlist(apply(data.frame(x=0:(2^n-1)),1,
function(t)binary(t,dim=n)$binary)),n,))
ind <- which(apply(tt,1,sum)==m)
sample.l <- tt[ind,]#this is the list of all possible outputs!
p.s <- matrix(rep(p,times=nrow(sample.l)),nrow(sample.l),,byrow=TRUE)
p.s1 <- p.s; p.s2 <- 1-p.s;
p.s1[sample.l==0] <- 0
p.s2[sample.l==1] <- 0
p.ss <- p.s1+p.s2
p.out <- apply(p.ss,1,prod)
p.out.std <- p.out/sum(p.out) #this is the list of probabilities

#w.r.t. all possbile outputs!
N <- 100 #how many such samples you want
ind.s <- sample(1:nrow(sample.l),N,replace =TRUE, prob =p.out.std )
res <- sample.l[ind.s,]# Here you are!

p*********g
发帖数: 226

a****m
发帖数: 693

p_1+p_2+...=m/n,
multiple normal distribution

p*********g
发帖数: 226

对不起，能具体点吗？

【在 a****m 的大作中提到】

: p_1+p_2+...=m/n,
: multiple normal distribution

k*****u
发帖数: 1688

f***a
发帖数: 329

p*********g
发帖数: 226

相关主题
● 样本数量问题求助	● 菌斑有统计大拿吗？我问个问题 (转载)
● 请问dip sample （sampling）中文是什么？	● Help on understanding how to Creating a Random Sample without Replacement
● 紧急请教两个关于resampling的概念问题	● 如何解释OUTPUT
进入Statistics版参与讨论

p*********g
发帖数: 226

f***a
发帖数: 329

你理解是对的，不过我的算法是错的。。。。
譬如p_i=1的时候
有空再想，要走了要走了

【在 p*********g 的大作中提到】

: 俺先确认对你的算法理解正确：
: S = empty
: loop until |S| = m
: sample i with prob pi / sum_{j=1}^n p_j
: if i is not in S
: add i to S
: end
: end
: return S
:

h***i
发帖数: 3844

去哪里？

【在 f***a 的大作中提到】

: 你理解是对的，不过我的算法是错的。。。。
: 譬如p_i=1的时候
: 有空再想，要走了要走了

k*******a
发帖数: 772

a****m
发帖数: 693

: 对不起，能具体点吗？

p*********g
发帖数: 226

其实就是先定义一个(X1,...,Xn) 的联合分布，
p(X1,...,Xn) = \prod_i p(X_i)
p(X_i = 1) = pi.
然后按
p(X1, ..., Xn | \sum_i X_i = m)
抽样。

【在 k*******a 的大作中提到】

p*********g
发帖数: 226

能给点 intuition 为什么这样能 work 吗？多谢。

【在 k*******a 的大作中提到】

f***a
发帖数: 329

I*****a
发帖数: 5425

sample(1:n, size = m, prob = c(p1, p2, ..., pn) )

.,

【在 p*********g 的大作中提到】

: 给了 n 个独立的 Bernolli 随机变量X_i，p(X_i = 1) = p_i. 现要采样（X_1, ...,
: X_n) 使得它们的和为 m (m <=n). 如何高效地采样？
: 当然规定了和之后，它们就不再独立了。
: 简单的方法：独立采样所有 X_i, 如果和不为m，扔掉，重新采。当然效率太低了。

t*******2
发帖数: 384

Re

【在 I*****a 的大作中提到】

: sample(1:n, size = m, prob = c(p1, p2, ..., pn) )
:
: .,

(共1页)

进入Statistics版参与讨论

相关主题
● Help on understanding how to Creating a Random Sample without Replacement	● 弱问一个概念
● 如何解释OUTPUT	● 如何重复运行sas程序100次，并把100次的结果全部output?
● 请教怎样可以得到这样的一个dataset?	● 请问SAS里面有没有类似R里面的sample之类的抽样函数？如果没有怎样实现呢
● 请大家帮我看看这个R的问题吧？	● 海量SAS data的处理
● 求助 R sample in matrix form	● 请教一个Sample Size的问题在Categorical Data Analysis
● quant analyst 一道概率的面试题	● 一个Sample Size计算的问题
● [R]how to sample all possible continuous subset from ordered data	● 请教各位统计大拿一个categorical data的sample size 够不够大的问题
● 大牛们，帮忙解决个很小的统计问题	● 样本数量问题求助

相关话题的讨论汇总
话题: p2话题: p1话题: iter话题: ind话题: bernolli

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天