Data scientist--Zillow电面 - DataSciences版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - Data scientist--Zillow电面

相关主题
● 求Google 的 Data Science 有关的位置内推 (转载)	● Tumblr HQ NYC refer (转载)
● Hive的表里面的timestamp类型数据，怎么显示？	● Career opportunities-Data Scientist (Permanent positions)
● Impala v Hive	● 求解一个水塘抽样题 (转载)
● 请问大家有没有直接用java全程写mapreduce的程序的？	● Big data是下一个大坑吗
● 如何学习Hadoop?	● Re: 请问大数据问题和以前的数据挖掘有什么区别？ (转载)
● Mapreduce Java	● 妹纸物理phd转data science求建议
● data scientist的五个方面	● F家DS,analytics电面面经，贡献一个sql相关 (转载)
● san bruno ds position	● Twitter Data Scientist 电面题目

相关话题的讨论汇总
话题: int话题: attack话题: itr话题: data话题: count

进入DataSciences版参与讨论

1

(共1页)

e*********5 发帖数: 137	1 前几天刚面过Zillow data scientist，校友内推拿到面试。动作很快，简历刚投一天 technical recruiter就联系，过了两天第一个电面，就是了解一些背景，问都用什么编程，聊了大概20分钟。我只用C++, matlab, R.但他们似乎对懂python的有兴趣，我说正开始学。当天就发了一道题过来，让设计算法，选择自己喜欢的语言实现。我用 C++写，第二天发了答案，很快就给回复，说继续下一轮，跟hiring manager聊。又是聊背景，问dissertation. 不过发现大多时候是我在说，hiring manager只是问了几个简单问题。我也不知道我的回答她是否满意。本来recruiter说可能有一道coding，但是没有。到现在还没收到回复，反正是第一个面试，我觉得更大可能性是没戏了。对了，发现内推拿面试的机会大很多呀。海投了20份，基本没什么消息呀，偶尔有两个是直接拒，面试都没！ Anyway, 上题目！觉得不难，我用C++实现。 Suppose there is a website tracking user activities to prevent robotic attack on the Internet. Here are field definition and example: Anonymous User ID TimeStamp Activity Count 123 9:45am 10 234 9:46am 12 234 9:50am 20 456 9:53am 100 123 9:55am 33 456 9:56am 312 123 10:03am 110 123 10:16am 312 234 10:20am 201 456 10:23am 180 123 10:25am 393 456 10:27am 312 1. Please design an algorithm to identify user IDs that have more than 500 activities within any given 10 minutes. Explain why you chose your algorithms and how you would like to implement it on a very large data set. 2. Please code the algorithm taking the attached data as input with your choice of programming language. Please send the code with comments and output.
T*****u 发帖数: 7103	2 这个是标准的map reduce吗？
e*********5 发帖数: 137	3 我具体实现只是针对fit in memeory的数据。对大规模数据的处理就是再额外解释了。我没具体实现过map reduce，只是了解一些相关知识。
a****k 发帖数: 117	4 For this task, it is best to design a hive database to store those structured data. Hive is designed based on MapReduce and all hive queries are implemented using MapReduce. After that, we can just write SQL/Hive queries to extract the relevant information. For this question, we can use a self-join SQL query: SELECT distinct a.id FROM (SELECT a.id, a.time FROM table a JOIN table b ON a.id= b.id WHERE TIMEDIFF(a.time, b.time)<=10 AND TIMEDIFF(a.time, b.time)>=0 GROUP BY a.id, a.time HAVING SUM(b.count)>500); In this way, it is easier to maintain and update the data. We also make the data processing and analysis automatic and easy. Of course, if we need to rush for some codes to solve the problem where the data is only stored in excel or txt file, we could just follow how SQL (for small dataset) and Hive (for big dataset) implement the above query.
H*H 发帖数: 472	5 这个似乎只考虑了join两个的情况，要是单个timestamp就超过了500 activities,就不用join了。还有三个timestamp在10分钟内出现，就应该sum三个activities之和；依此类推，还要考虑更多的timestamp都在10分钟内出现的情况，就应该sum更多activities之和。 a ON the 【在 a****k 的大作中提到】 : For this task, it is best to design a hive database to store those : structured data. Hive is designed based on MapReduce and all hive queries : are implemented using MapReduce. After that, we can just write SQL/Hive : queries to extract the relevant information. For this question, we can use a : self-join SQL query: : SELECT distinct a.id FROM (SELECT a.id, a.time FROM table a JOIN table b ON : a.id= : b.id WHERE TIMEDIFF(a.time, b.time)<=10 AND TIMEDIFF(a.time, b.time)>=0 : GROUP BY a.id, a.time HAVING SUM(b.count)>500); : In this way, it is easier to maintain and update the data. We also make the
a****k 发帖数: 117	6 GROUP BY a.id, a.time will include all relevant timestamps (either 1 (itself ), 2 or more) for each event. 【在 H*H 的大作中提到】 : 这个似乎只考虑了join两个的情况，要是单个timestamp就超过了500 activities,就不 : 用join了。 : 还有三个timestamp在10分钟内出现，就应该sum三个activities之和；依此类推，还要 : 考虑更多的timestamp都在10分钟内出现的情况，就应该sum更多activities之和。 : : a : ON : the
H*H 发帖数: 472	7 Sorry, I missed your code. It looks that the trick is to group by a, while count with b. itself 【在 a****k 的大作中提到】 : GROUP BY a.id, a.time will include all relevant timestamps (either 1 (itself : ), 2 or more) for each event.
B***i 发帖数: 724	8 吐槽一下，一个简单的coding 题就被 data scientist 们玩成这样了？ map-r ? hive? you must be kidding me.
f*****y 发帖数: 822	9 请问楼主如何用C++实现的。我对CS不了解，如果是数据库的话，这个就是一个self join的hive code。多谢。【在 e*********5 的大作中提到】 : 前几天刚面过Zillow data scientist，校友内推拿到面试。动作很快，简历刚投一天 : technical recruiter就联系，过了两天第一个电面，就是了解一些背景，问都用什么 : 编程，聊了大概20分钟。我只用C++, matlab, R.但他们似乎对懂python的有兴趣，我 : 说正开始学。当天就发了一道题过来，让设计算法，选择自己喜欢的语言实现。我用 : C++写，第二天发了答案，很快就给回复，说继续下一轮，跟hiring manager聊。又是 : 聊背景，问dissertation. 不过发现大多时候是我在说，hiring manager只是问了几个 : 简单问题。我也不知道我的回答她是否满意。本来recruiter说可能有一道coding，但 : 是没有。 : 到现在还没收到回复，反正是第一个面试，我觉得更大可能性是没戏了。对了，发现内 : 推拿面试的机会大很多呀。海投了20份，基本没什么消息呀，偶尔有两个是直接拒，面
l*********o 发帖数: 3091	10 #include #include #include #include using namespace std; class Element { public: int time; int count; Element(int t, int c) { time = t; count = c; } }; class IPRecord { public: queue q; int total_attack; void add_to_queue(int t, int c) { Element ele = Element(t, c); q.push(ele); total_attack += c; } void normalize_queue() { while (q.back().time - q.front().time > 10) { total_attack -= q.front().count; q.pop(); } } }; int main() { map dict; ifstream infile("C:\mfg\thefile.txt"); int ip; int time; int count; while (infile >> ip >> time >> count) { map::iterator itr = dict.find(ip); if (itr != dict.end()) { itr->second.add_to_queue(time, count); itr->second.normalize_queue(); if (itr->second.total_attack > 500) { cout << "find ip " << ip; cout << " total attack " << itr->second.total_attack << endl; } } else { IPRecord rec = IPRecord(); rec.add_to_queue(time, count); dict.insert(pair(ip, rec)); if (rec.total_attack > 500) { cout << "find ip " << ip; cout << " total attack " << itr->second.total_attack << endl; } } } return 0; } output: find ip 123 total attack 705 【在 f*****y 的大作中提到】 : 请问楼主如何用C++实现的。我对CS不了解，如果是数据库的话，这个就是一个self : join的hive code。多谢。
e*********5 发帖数: 137	11 我不懂Hive。对我来说就是一个算法题。用一个队列queqe保存最近10分钟的的记录，一个map保存最近十分钟内的id跟activity count 。依次扫描记录。把比当前记录早于10分钟的记录从队列中删除，把当前记录加入队列，更新map中的计数，如果计数>500,那就是一个潜在attacker。
f*******3 发帖数: 206	12 Why not implement a map reduce job where mapper is to aggregate on customer id, and the aggregation function in the reducer is to loop through a day with ten minute moving window, signal 1 if the sum of counts greater than 500 and break, else return 0. activity 【在 e*********5 的大作中提到】 : 我不懂Hive。对我来说就是一个算法题。 : 用一个队列queqe保存最近10分钟的的记录，一个map保存最近十分钟内的id跟activity : count 。依次扫描记录。把比当前记录早于10分钟的记录从队列中删除，把当前记录 : 加入队列，更新map中的计数，如果计数>500,那就是一个潜在attacker。

1

(共1页)

进入DataSciences版参与讨论

相关主题
● Twitter Data Scientist 电面题目	● 如何学习Hadoop?
● How to prepare for the DS interview?	● Mapreduce Java
● 都用了spark了吗？	● data scientist的五个方面
● DS对数据库需要了解多少？	● san bruno ds position
● 求Google 的 Data Science 有关的位置内推 (转载)	● Tumblr HQ NYC refer (转载)
● Hive的表里面的timestamp类型数据，怎么显示？	● Career opportunities-Data Scientist (Permanent positions)
● Impala v Hive	● 求解一个水塘抽样题 (转载)
● 请问大家有没有直接用java全程写mapreduce的程序的？	● Big data是下一个大坑吗

相关话题的讨论汇总
话题: int话题: attack话题: itr话题: data话题: count

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)