40道经典DS/ML面试题解答，求指导 - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - 40道经典DS/ML面试题解答，求指导

相关主题
● 转行数据挖掘和机器学习	● Neural Network面试的时候会怎么问啊？
● 有关Stochastic Gradient Descent	● 我觉得neural network应用范围不大啊
● 有没有人想报Cloudera的Data Scientist Certificate的	● 问一下python 或者是 R 里面 gradient boosting model 的问题
● 有人考虑过kaggle上这个预测CTR的题目么？	● 这里有做optimization的么？请教个问题。
● 如何用python读取大数据	● 回馈本版~ 最近面的面经和收集来的面经~
● 搞了个实时twitter文本分析来研究闯王和吸奶的行情分析 (转载)	● 请问如何完全跳到data scientist/analyst, 还有多大差距?
● 【真心请教】选master project课题 - 有包子 (转载)	● 请教:现在那些package实现gradient boosting tree比较好?
● 说说最近的一次面试,兼告诫国人	● 计算 confidence interval 和 prediction interval的一般方法

相关话题的讨论汇总
话题: beta话题: coin话题: h0话题: head话题: fair

进入DataSciences版参与讨论

(共1页)

e*******n
发帖数: 872

原题见
http://www.mitbbs.com/article_t/DataSciences/10029.html
专门开一个贴，尝试逐题解答。本人菜鸟，求大牛指导
1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and
get 1 tail and 5 head. Determine whether it’s fair or not. What’s your
confidence value?
我的答案是：
H0: the coin is fair
Ha: the coin is unfair
X is the number of heads
Rejection region: |X - 3| > 2, i.e., X = 0,1,5,or 6
significance level alpha:
alpha = P(reject H0 | H0 is true)
=P(X=0,1,5,6 | H0 is true)
= (choose(6,0)+choose(6,1)+choose(6,5)+choose(6,6))*(1/2)^6
= (1+6+6+1)*(0.5^6) = 0.21875
because alpha > 0.05, we do not have enough evidence to reject H0, and we
accpte H0, so the coin is fair
confidence value 不知该如何计算，求指教

c*******2
发帖数: 8

这是一个经典的Hypothesis testing问题吧，LZ说的是exact binomial test.
比较常用的，还有利用Central Limit Theorem构造的test, 也就是intro stat课教的
one sample proportion test, test statistics 就是 Z=(p_hat-p)/ sqrt(p(1-p)/N)
, 比较standard normal distribution就可以。
confidence interval也是可以根据CLT推得。
通常这个问题还有更难的一个版本，就是 p_hat 很小的时候，用CLT是有问题的，具体
的例子比如，ads click rate, 可能10000次里面只有1～2次click，这时候，要估计
confidence interval用CLT是不合适的，这好像也是Google常考面试题之一，具体的改
进方法，可以参考Wilson estimator，或者用Bayes的思路，LZ可以读读Wiki的解释:
http://en.wikipedia.org/wiki/Binomial_proportion_confidence_int

【在 e*******n 的大作中提到】

: 原题见
: http://www.mitbbs.com/article_t/DataSciences/10029.html
: 专门开一个贴，尝试逐题解答。本人菜鸟，求大牛指导
: 1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and
: get 1 tail and 5 head. Determine whether it’s fair or not. What’s your
: confidence value?
: 我的答案是：
: H0: the coin is fair
: Ha: the coin is unfair
: X is the number of heads

l*****c
发帖数: 31

支持等看~~

t********6
发帖数: 43

第1题可以用Bayes rule吧
Prior：Beta（a,b） *a, b 选择取决于你想让prior多strong
Likelihood: Y~iid Bernouli(p) H~Binomial(n,p) *H is total number of head,
T is total # of tail
Posterior: Beta(a+H, b+T)
至于怎么求Beta(a+H, b+T) 的confidence interval，参见http://stats.stackexchange.com/questions/82475/calculate-the-confidence-interval-for-the-mean-of-a-beta-distribution
1. dbeta() 需用R软件
2. Bootstrap from rbeta() simulation 需用R软件
3. Asymptotic confidence intervals 如楼上所说

t********6
发帖数: 43

我也搭车问40题中的一个rare event的题：
26.If I want to build a classifier, but the data is very unbalanced. I have
a few positive samples but a lot of negative samples. What should I do?
貌似这道题说的就是click through rate/credit fraudulent这种极小概率事件的
training方法。我的思路如下，求大牛指正：
1. Resampling 降噪，但resampling不会降低bias
2. 临床试验里的case control matching 缺点：慢，control subject的选择很
arbitrary
3. Empirical Bayes，prior是empirical_rate.这是目前想到的最靠铺的了，但是没用
过，不知道R和Python有什么好的package没，做过的说说？

B********4
发帖数: 7156

这个问题我也迷糊了很久，查了不少资料，但还不是100%有把握。希望学统计的来指正
一下。
p-value是说在零假设（H0)之下出现看到该情况或者更极端情况的概率，也就是假设硬
币是fair的，抛6次出现5个head或者6个head的概率。因为我们考虑对称性，所以还要
考虑到1个head或者0个head的情况。楼主的p-value计算是对的。
significance level α只是预先设定的一个判定p-value的阈值（即多小的概率算小概
率事件），如果p<=α，我们认为零假设不对(因为在H0之下居然观察到了一个小概率事
件，肯定假设H0有问题）。反之，如果p>α,我们接受零假设H0。
根据wikipedia "given the availability of a hypothesis testing procedure that
can test the null hypothesis θ = θ0 against the alternative that θ ≠ θ
0 for any value of θ0, then a confidence interval with confidence level γ
= 1 − α can be defined as containing any number θ0 for which the
corresponding null hypothesis is not rejected at significance level α.",如
果我们设定α为5%，则我们可以说confidence level是95%。
但是我觉得这个说法不是面试者真正想要问的答案，因为α是你事先设定的，可以为10
%，5%或1%。但和给的观测结果都没啥关系，仅仅是用来接受还是拒绝假设的。现在我
们求出p-value比较大，所以我们应该接受这个假设硬币是fair的。但究竟这样假设有
多大可能是对的呢？我认为应该用Hoeffding's inequality来求这个confidence value
(confidence level).
假设tail=0, head=1,那么样本均值u为5/6, 而实际均值v为1/2, 差为1/3。根据
Hoeffding's inequality,
Pr(|u-v| > 1/3) <= 2exp(-2*(1/9)*6) = 2/exp(4/3) = 53%
所以confidence value是47%.

【在 e*******n 的大作中提到】

B********4
发帖数: 7156

我还有一个解法就是，我们可以不拘泥于significance level α通常为5%，1%。既然
我们算出p-value = 0.22,我们可以把α定为0.22,那么confidence level 为 0.78。
这个和我之前说的47%不矛盾，因为Hoeffding's Inequality得到的是个比较宽松的上
限。

that

【在 B********4 的大作中提到】

: 这个问题我也迷糊了很久，查了不少资料，但还不是100%有把握。希望学统计的来指正
: 一下。
: p-value是说在零假设（H0)之下出现看到该情况或者更极端情况的概率，也就是假设硬
: 币是fair的，抛6次出现5个head或者6个head的概率。因为我们考虑对称性，所以还要
: 考虑到1个head或者0个head的情况。楼主的p-value计算是对的。
: significance level α只是预先设定的一个判定p-value的阈值（即多小的概率算小概
: 率事件），如果p<=α，我们认为零假设不对(因为在H0之下居然观察到了一个小概率事
: 件，肯定假设H0有问题）。反之，如果p>α,我们接受零假设H0。
: 根据wikipedia "given the availability of a hypothesis testing procedure that
: can test the null hypothesis θ = θ0 against the alternative that θ ≠ θ

l******8
发帖数: 1691

你那个permutation的算法，也许是我没看清楚，但是好像没看到你怎么处理重复出现
的字母造成重复字串的情况？

【在 e*******n 的大作中提到】

n*****n
发帖数: 100

我所理解的confidence value是指我们观测到的值/差别不是单纯的由于取样的随机性
造成的。对于这个具体问题，如果coin fair的话， P(T=1,h=5|F)=1/32.也就是说，如
果是fair, 只有1/32的可能性出现这样的问题，也就是说我们可以认为coin不是fair的
，不是fair的confidence level就是1-1/32。

that

【在 B********4 的大作中提到】

e*******n
发帖数: 872

原题见
http://www.mitbbs.com/article_t/DataSciences/10029.html
专门开一个贴，尝试逐题解答。本人菜鸟，求大牛指导
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and
get 1 tail and 5 head. Determine whether it’s fair or not. What’s your
confidence value?
我的答案是：
H0: the coin is fair
Ha: the coin is unfair
X is the number of heads
Rejection region: |X - 3| > 2, i.e., X = 0,1,5,or 6
significance level alpha:
alpha = P(reject H0 | H0 is true)
=P(X=0,1,5,6 | H0 is true)
= (choose(6,0)+choose(6,1)+choose(6,5)+choose(6,6))*(1/2)^6
= (1+6+6+1)*(0.5^6) = 0.21875
because alpha > 0.05, we do not have enough evidence to reject H0, and we
accpte H0, so the coin is fair
confidence value 不知该如何计算，求指教
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
3. Which regression methods are you familiar? How to evaluate regression
result?
答案：
I'm familiar with Lasso and Ridge methods.
They are both linear models, and the prediction formulation is:
f(x) = beta_0 + sum_{i=1}^p beta_i x_i
We can evaluate the regression results using mean squared error (MSE):
1/n sum_i ( y_i - beta_0 + sum_{i=1}^p beta_i x_i)^2
To learn the coefficients, we have
-Ridge
min sum_i ( y_i - beta_0 + sum_{i=1}^p beta_i x_i)^2 + lambda sum_{i=1}^p
beta_i^2
-Lasso
min sum_i ( y_i - beta_0 + sum_{i=1}^p beta_i x_i)^2 + lambda sum_{i=1}^p |
beta_i|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
4. Write down the formula for logistic regression. How to determine the
coefficients given the data?
答案：
* Formula: 假设我们处理二类分类问题，y in {1,0}
Pr(y=1|x) = exp(beta' x)/(1+exp(beta' x))
Pr(y=0|x) = 1/(1+exp(beta' x))
其中beta是coefficient
y=1 if Pr(y=1|x) >= Pr(y=0|x), and y = 0, otherwise.
* Determine the coefficients given the data: 假设我们有n个data points, {(xi,
yi)}, i=1,..,n, where yi in {1,0}
要通过likelihood maximization 来求beta
max_beta g(beta),
g(beta)是目标函数
g(beta) = sum_i log [ Pr(y=yi|x=xi)]
= sum_i [yi beta'xi - log(1+exp(beta'xi))]
我们用Newton-Raphson update来优化这个目标函数，在每个iteration中
beta^new = beta^old - [(g(beta)'')^-1 g(beta)']|_(beta=beta^old)
where
g(beta)' = sum_i xi(yi - p(yi=1|x=xi)),
g(beta)'' = - sum_i xi xi' p(yi=1|x=xi) (1-p(yi=1|x=xi))
defining z=[y1, ..., yn]',
p=[p(yi=1|x=x1), ..., p(yi=1|x=xn)]'
W=diag(p(yi=1|x=x1)(1-p(yi=1|x=x1)), ..., p(yi=1|x=xn)(1-p(yi=1|x=xn)))
X=[x1;...;xn]
we have g(beta)' = X'(z-p), and g(beta)'' = - X' W X
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
5. How do you evaluate regression?
For example, in this particular case:
item click-through-rate predicted rate
1 0.04 0.06
2 0.68 0.78
3 0.27 0.19
4 0.52 0.57
…
答案：
Using mean squared error:
1/n sum_i (click_through_rate_i - predicted_rate_i)^2
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
6. What’s the formula for SVM? What is decision boundary?
答案：
formula of SVM is
f(x) = w'x
min_{w, xi_i} 1/2 ||w||_2^2 + C sum_i xi_i
s.t. for any i:
1 - y_i w' x_i <= xi, 0 <= xi.
decision boundary:
In a statistical-classification problem with two classes, a decision
boundary or decision surface is a hypersurface that partitions the
underlying vector space into two sets, one for each class. The classifier
will classify all the points on one side of the decision boundary as
belonging to one class and all those on the other side as belonging to the
other class.
x: f(x) = 0
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
7. A field with unknown number of rabbits. Catch 100 rabbits and put a label
on each of them. A few days later, catch 300 rabbits and found 60 with
labels. Estimate how many rabbits are there?
这个题是point estimation 的问题。答案是 X = 100*(300/60) = 200，这个数字是
Maximum likelihood estimator
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
8. Given 10 coins with 1 unfair coin and 9 fair coins. The unfair coin has &
#8532; prob. to be head. Now random select 1 coin and throw it 3 times. You
observe head, head, tail. What’s the probability that the selected coin is
the unfair one?
这个题目考的是Bayes’ Rule
答案：
P(unfair coin| observe head, head, tail)
= 1- P(fair coin| observe head, head, tail)
根据Bayes’ Rule：
P(fair coin| observe head, head, tail)
= P(fair coin) * P(observe head, head, tail| fair coin)/
[P(fair coin) * P(observe head, head, tail| fair coin)+
P(unfair coin) * P(observe head, head, tail| unfair coin)]
其中
P(fair coin) = 9/10
P(unfair coin) = 1/10
P(observe head, head, tail| fair coin) = (1/2)^3
P(observe head, head, tail| unfair coin) = (0.8532^2)* (1-0.8532)
代入得到
P(fair coin| observe head, head, tail)
= (9/10*(1/2)^3)/(9/10*(1/2)^3 + 1/10*(0.8532^2)* (1-0.8532))
= 0.9132508
所以P(unfair coin| observe head, head, tail)
= 1 - 0.9132508
= 0.0867492
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
12. Generate all the permutation of a string.
For example, abc, acb, cba, …
答案：
import java.util.ArrayList;
public class permuteSolution{
public ArrayList> permute(String str1) {
char[] num = str1.toCharArray();
ArrayList> result = new ArrayList
>();
//start from an empty list
result.add(new ArrayList());
for (int i = 0; i < num.length; i++) {
//list of list in current iteration of the array num
ArrayList> current = new ArrayList Character>>();
for (ArrayList l : result) {
// # of locations to insert is largest index + 1
for (int j = 0; j < l.size()+1; j++) {
// + add num[i] to different locations
l.add(j, num[i]);
ArrayList temp = new ArrayList(l);
current.add(temp);
//System.out.println(temp);
// - remove num[i] add
l.remove(j);
}
}
result = new ArrayList>(current);
}
return result;
}
}
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
17. What’s the difference between classification and regression?
答案：
Classification tries to separate the dataset into classes belonging to the
response variable. Usually the response variable has two classes: Yes or No
(1 or 0). If the target variable can also have more than 2 categories.
Regression tries to predict numeric or continuous response variables. For
example, the predicted price of a consumer good.
The main difference between classification and regression lies on the
response they try to predict: continuous response of regression, and
discrete class label of classification.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
18. Can you explain how decision tree works? How to build a decision tree
from data?
答案：
A decision tree has decision blocks and terminating blocks where some
conclusion has been reached. Each decision block is based on a feature/
variable/predictor. By making a decision in a decision block, we
are lead to a left/ right branch of a decision block, which is other
decision blocks or to a terminating block.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
19. What is regularization in regression? Why do regularization? How to do
regularization?
答案：
regularization is a method to improve the linear model of regression, by
shrinking the coefficients to zeros. The reason to do this is to select the
variables relevant to the response, and removing the irrelevant variables,
so that the prediction accuracy and the interpretability of the model can be
improved. The way to do regularization is to add a regularization term to
the objective of regression problem, and optimize it. This term can be a l1
norm (lasso) or l2 norm (ridge) of the coefficient vector.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
20. What is gradient descent? stochastic gradient descent?
答案：
Gradient descent is a first-order optimization algorithm. To find a local
minimum of a function using gradient descent, one takes steps proportional
to the negative of the gradient (or of the approximate gradient) of the
function at the current point. If instead one takes steps proportional to
the positive of the gradient, one approaches a local maximum of that
function; the procedure is then known as gradient ascent.
In both gradient descent (GD) and stochastic gradient descent (SGD), you
update a set of parameters in an iterative manner to minimize an error
function.
While in GD, you have to run through ALL the samples in your training set to
do a single update for a parameter in a particular iteration, in SGD, on
the other hand, you use ONLY ONE training sample from your training set to
do the update for a parameter in a particular iteration.
Thus, if the number of training samples are large, in fact very large, then
using gradient descent may take too long because in every iteration when you
are updating the values of the parameters, you are running through the
complete training set. On the other hand, using SGD will be faster because
you use only one training sample and it starts improving itself right away
from the first sample.
SGD often converges much faster compared to GD but the error function is not
as well minimized as in the case of GD. Often in most cases, the close
approximation that you get in SGD for the parameter values are enough
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
29. Write a function to compute sqrt(X). Write a function to compute pow(x,
n) [square root and power)
答案：
public double sqrt1(double x){
double result = x/2;
double low = 0;
double high = Math.max(x,1);
while(Math.abs(x/result - result) > 0.00001){
if(x>result*result) {
low = result;
} else {
high =result;
}
result = (high+low)/2;
//System.out.println(result);
}
return result;
}

相关主题
● 搞了个实时twitter文本分析来研究闯王和吸奶的行情分析 (转载)	● Neural Network面试的时候会怎么问啊？
● 【真心请教】选master project课题 - 有包子 (转载)	● 我觉得neural network应用范围不大啊
● 说说最近的一次面试,兼告诫国人	● 问一下python 或者是 R 里面 gradient boosting model 的问题
进入DataSciences版参与讨论

c*******2
发帖数: 8

: 原题见
: http://www.mitbbs.com/article_t/DataSciences/10029.html
: 专门开一个贴，尝试逐题解答。本人菜鸟，求大牛指导
: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
: 1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and
: get 1 tail and 5 head. Determine whether it’s fair or not. What’s your
: confidence value?
: 我的答案是：
: H0: the coin is fair
: Ha: the coin is unfair

l*****c
发帖数: 31

支持等看~~

t********6
发帖数: 43

B********4
发帖数: 7156

l******8
发帖数: 1691

你那个permutation的算法，也许是我没看清楚，但是好像没看到你怎么处理重复出现
的字母造成重复字串的情况？

【在 e*******n 的大作中提到】

n*****n
发帖数: 100

b*****s
发帖数: 11267

这里样本太少，不可以用CLT
一般我们用success/failure condition （》=10）来演算是否可以CLT，

N)

【在 c*******2 的大作中提到】

: 这是一个经典的Hypothesis testing问题吧，LZ说的是exact binomial test.
: 比较常用的，还有利用Central Limit Theorem构造的test, 也就是intro stat课教的
: one sample proportion test, test statistics 就是 Z=(p_hat-p)/ sqrt(p(1-p)/N)
: , 比较standard normal distribution就可以。
: confidence interval也是可以根据CLT推得。
: 通常这个问题还有更难的一个版本，就是 p_hat 很小的时候，用CLT是有问题的，具体
: 的例子比如，ads click rate, 可能10000次里面只有1～2次click，这时候，要估计
: confidence interval用CLT是不合适的，这好像也是Google常考面试题之一，具体的改
: 进方法，可以参考Wilson estimator，或者用Bayes的思路，LZ可以读读Wiki的解释:
: http://en.wikipedia.org/wiki/Binomial_proportion_confidence_int

b*****s
发帖数: 11267

Ha是two side，所以要考虑对称性
至于你最后给的那个什么不等式，只是给了一个上界而已啊，还不如alpha来得有用，
毕竟5%的type I error保证下,出错率比较少了

that

【在 B********4 的大作中提到】

(共1页)

进入DataSciences版参与讨论

相关主题
● 计算 confidence interval 和 prediction interval的一般方法	● 如何用python读取大数据
● p value被摈弃了？如何算confidence interval之类的东西？	● 搞了个实时twitter文本分析来研究闯王和吸奶的行情分析 (转载)
● 机器学习需要自己搞算法吗	● 【真心请教】选master project课题 - 有包子 (转载)
● 攒人品，求bless~ 新鲜面经 - Machine Learning Engineer	● 说说最近的一次面试,兼告诫国人
● 转行数据挖掘和机器学习	● Neural Network面试的时候会怎么问啊？
● 有关Stochastic Gradient Descent	● 我觉得neural network应用范围不大啊
● 有没有人想报Cloudera的Data Scientist Certificate的	● 问一下python 或者是 R 里面 gradient boosting model 的问题
● 有人考虑过kaggle上这个预测CTR的题目么？	● 这里有做optimization的么？请教个问题。

相关话题的讨论汇总
话题: beta话题: coin话题: h0话题: head话题: fair

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天