median of N^2 numbers across N machines - JobHunting版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - median of N^2 numbers across N machines

相关主题
● 简单map reduce mean median，傻逼回答	● 请教MapReduce怎么找median
● 一道大数据题	● F家onsite面经
● 问一道google的题	● 请教可以在线练习 map reduce 的地方？
● 一个多线程的简单问题	● mapreduce 初级问题，请各位大牛指点
● 再问一道题	● 一道大数据题，求最优解。
● 一道巨常见的题	● 为什么这题要用min heap？Sort numbers stored on different machines
● 问道大数据的题	● map reduce word count
● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。	● MapReduce的面试题

相关话题的讨论汇总
话题: bucket话题: numbers话题: each话题: machine话题: median

进入JobHunting版参与讨论

(共1页)

a**********k
发帖数: 1953

Each of the N machines has an array of N numbers, how to calculate
the median of the N^2 numbers efficiently?

c*****e
发帖数: 74

I was asked by an Indian interviewer at Google Mountain View 2.5 years ago.
I remembered we discussed communication cost between machines, available
memory on each machine and available disk on each machine, but I couldn't
give a convincing answer then.

【在 a**********k 的大作中提到】

: Each of the N machines has an array of N numbers, how to calculate
: the median of the N^2 numbers efficiently?

a**********k
发帖数: 1953

More general form would be:
median of M*N numbers across M machines, each having N numbers.
When M=2, there's a well-known solution.
How to generalize it to any integer M?

.

【在 c*****e 的大作中提到】

: I was asked by an Indian interviewer at Google Mountain View 2.5 years ago.
: I remembered we discussed communication cost between machines, available
: memory on each machine and available disk on each machine, but I couldn't
: give a convincing answer then.

b********h
发帖数: 119

If you can use MapReduce, you could partition the numbers into several
buckets (Mapper) and count how many numbers are in each bucket. Then, you
know which bucket to look for the median. If the bucket is small enough, we
could put it into memory and find the median. Otherwise, do another
mapreduce job.

a**********k
发帖数: 1953

If the numbers have been distributed beforehand, then each bucket will
be distributed among N machines, then it is reduced to the original problem.
Also, MapReduce infrastructure itself has a cost.

we

【在 b********h 的大作中提到】

: If you can use MapReduce, you could partition the numbers into several
: buckets (Mapper) and count how many numbers are in each bucket. Then, you
: know which bucket to look for the median. If the bucket is small enough, we
: could put it into memory and find the median. Otherwise, do another
: mapreduce job.

j********x
发帖数: 2330

这个问题很好，有讨论空间
不像那些傻逼的无聊题目。。。

b********h
发帖数: 119

problem.
No. After the map phase, every bucket will be on one machine. This is due to
the shuffling between map and reduce.
Yes, the largest cost is the shuffling between map and reduce.

【在 a**********k 的大作中提到】

: If the numbers have been distributed beforehand, then each bucket will
: be distributed among N machines, then it is reduced to the original problem.
: Also, MapReduce infrastructure itself has a cost.
:
: we

c*****e
发帖数: 74

I agree. This seems to be a good way to decrease the communication cost and
disk I/O (but didn't know this two years ago).

we

【在 b********h 的大作中提到】

a**********k
发帖数: 1953

So the map task will run on each machine, and data will be shuffled
so that each bucket will be on one machine, then the reduce task on
each machine will do the counting and report it to one master machine
to calculate which machine has the right bucket for the median, and
asks it to finish up the rest of the work.

to

【在 b********h 的大作中提到】

:
: problem.
: No. After the map phase, every bucket will be on one machine. This is due to
: the shuffling between map and reduce.
: Yes, the largest cost is the shuffling between map and reduce.

j*****u
发帖数: 1133

actually map reduce doesn't reduce the communication cost nor the disk I/O.
Each M/R phase requires m*r connections/intermediate files.

and

【在 c*****e 的大作中提到】

: I agree. This seems to be a good way to decrease the communication cost and
: disk I/O (but didn't know this two years ago).
:
: we

相关主题
● 一道巨常见的题	● 请教MapReduce怎么找median
● 问道大数据的题	● F家onsite面经
● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。	● 请教可以在线练习 map reduce 的地方？
进入JobHunting版参与讨论

c*****e
发帖数: 74

Actually, this is a special Map/Reduce scenario. All the machine can pass
its distribution (over each bucket) statistic (which is small compared to
the original data) to one machine and then it can be determined that which
bucket the median is in.

.

【在 j*****u 的大作中提到】

: actually map reduce doesn't reduce the communication cost nor the disk I/O.
: Each M/R phase requires m*r connections/intermediate files.
:
: and

a**********k
发帖数: 1953

Actually we can do some optimization over the classic Map/Reduce:
Let all buckets distributed on N machines as is, each machine counts
the numbers in all buckets, and pass the result to a master machine,
which calculate and figure out which bucket has the median, then,
we need merge the data in that particular bucket to one machine
to finish the remaining job. In this case, we don't need to shuffle the data
in all the buckets as in classic MR, which has the most overhead as
some one pointed out earlier. We only need to merge data in one bucket.

【在 c*****e 的大作中提到】

: Actually, this is a special Map/Reduce scenario. All the machine can pass
: its distribution (over each bucket) statistic (which is small compared to
: the original data) to one machine and then it can be determined that which
: bucket the median is in.
:
: .

j*j
发帖数: 5564

good

data

【在 a**********k 的大作中提到】

: Actually we can do some optimization over the classic Map/Reduce:
: Let all buckets distributed on N machines as is, each machine counts
: the numbers in all buckets, and pass the result to a master machine,
: which calculate and figure out which bucket has the median, then,
: we need merge the data in that particular bucket to one machine
: to finish the remaining job. In this case, we don't need to shuffle the data
: in all the buckets as in classic MR, which has the most overhead as
: some one pointed out earlier. We only need to merge data in one bucket.

c*****e
发帖数: 74

bladesmith suggested to repeat the first step if the bucket is still large.

data

【在 a**********k 的大作中提到】

a**********k
发帖数: 1953

Yes, if the master finds the bucket is still too large
to fit in memory, it can ask each machine to divide
that particular bucket into more sub-buckets, and
count each sub-bucket on each machine. This way
we only have the cross-machine communication
overhead to collect the counters (not shuffling the
data records around), plus the final merge, a huge
saving in communication bandwidth as well as
time/space cost.

【在 c*****e 的大作中提到】

: bladesmith suggested to repeat the first step if the bucket is still large.
:
: data

b********h
发帖数: 119

True. This is how people do K-means using MapReduce. Instead of passing all
the objects that belongs to one cluster to the Reducers. Just pass some
statistics.

【在 c*****e 的大作中提到】

(共1页)

进入JobHunting版参与讨论

相关主题
● MapReduce的面试题	● 再问一道题
● Apple 数据科学家面经	● 一道巨常见的题
● 问个大数据处理的面试题	● 问道大数据的题
● 要去google onsite的同学们	● #面试题#有100个database，每个存1 million data，如何求出median number of 这些数。
● 简单map reduce mean median，傻逼回答	● 请教MapReduce怎么找median
● 一道大数据题	● F家onsite面经
● 问一道google的题	● 请教可以在线练习 map reduce 的地方？
● 一个多线程的简单问题	● mapreduce 初级问题，请各位大牛指点

相关话题的讨论汇总
话题: bucket话题: numbers话题: each话题: machine话题: median

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天