由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
JobHunting版 - 问道大数据的题
相关主题
hash_map 的遍历问题一道巨常见的题
算法题简单map reduce mean median, 傻逼回答
谁来解释下hashtable的iterator是怎么实现的#面试题#有100个database,每个存1 million data,如何求出median number of 这些数。
hashtable的遍历要去google onsite的同学们
报个G的电面我发现我竟然学会了12种tree traversal的办法
小弟求问LinkedIn那道Deep Iterator的题问一道题目
再问一道题请问怎样写没有parent pointer的BST iterator?
median of N^2 numbers across N machines一道大数据题
相关话题的讨论汇总
话题: each话题: master话题: elements话题: than话题: server
进入JobHunting版参与讨论
1 (共1页)
m********a
发帖数: 128
1
一共有N个机器,每个机器上有N个数。每个机器最多存O(N)个数并对它们操作。如何找
到N^2个数的中数(median)?
这道题大家有什么好的解法吗?
多谢了!
l*********u
发帖数: 19053
2
每个机器sort N个数。
找第1个机器的N(0),第2个机器的N(1)。。。第N个机器的N(N-1)的median?

【在 m********a 的大作中提到】
: 一共有N个机器,每个机器上有N个数。每个机器最多存O(N)个数并对它们操作。如何找
: 到N^2个数的中数(median)?
: 这道题大家有什么好的解法吗?
: 多谢了!

I*******g
发帖数: 7600
3
你在这里装傻? 还是...?

【在 l*********u 的大作中提到】
: 每个机器sort N个数。
: 找第1个机器的N(0),第2个机器的N(1)。。。第N个机器的N(N-1)的median?

l*********u
发帖数: 19053
4
俺理解错了? :)不是找所有N的median?

【在 I*******g 的大作中提到】
: 你在这里装傻? 还是...?
g*****g
发帖数: 34805
5
Practically, I would put the numbers in buckets, say 10 buckets that divides
Integer range, count the numbers in each bucket and figure out which bucket
has the median number. Redistribute the numbers, repeat the process until
the bucket has less than N number and median is in it. This works best if
you know the range and distribution.
m********a
发帖数: 128
6
Thanks! thinking about the buckets, how do you guarantee the range is well
determined such that the numbers are evenly distributed (since each machine
only stores and operates N number)?

divides
bucket

【在 g*****g 的大作中提到】
: Practically, I would put the numbers in buckets, say 10 buckets that divides
: Integer range, count the numbers in each bucket and figure out which bucket
: has the median number. Redistribute the numbers, repeat the process until
: the bucket has less than N number and median is in it. This works best if
: you know the range and distribution.

g*****g
发帖数: 34805
7
It will be fast if you know the range and distribution, but it still works
if you don't.
Every iteration you end up with fewer numbers and a narrower range from a
single bucket.
And you can always rebalance numbers on each machine. All you need to
remember is an offset to the median number.

machine

【在 m********a 的大作中提到】
: Thanks! thinking about the buckets, how do you guarantee the range is well
: determined such that the numbers are evenly distributed (since each machine
: only stores and operates N number)?
:
: divides
: bucket

j**********3
发帖数: 3211
8
mark
O********e
发帖数: 48
9
A generalized solution:
Suppose you have a master node (or are able to use a consensus protocol to
elect a master from among your servers). The master first queries the
servers for the size of their sets of data, call this n, so that it knows to
look for the k = n/2 largest element.
The master then selects a random server and queries it for a random element
from the elements on that server. The master broadcasts this element to
each server, and each server partitions its elements into those larger than
or equal to the broadcasted element and those smaller than the broadcasted
element.
Each server returns to the master the size of the larger-than partition,
call this m. If the sum of these sizes is greater than k, the master
indicates to each server to disregard the less-than set for the remainder of
the algorithm. If it is less than k, then the master indicates to
disregard the larger-than sets and updates k = k - m. If it is exactly k,
the algorithm terminates and the value returned is the pivot selected at the
beginning of the iteration.
If the algorithm does not terminate, recurse beginning with selecting a new
random pivot from the remaining elements.
Analysis:
Let n be the total number of elements and s be the number of servers.
Assume that the elements are roughly randomly and evenly distributed among
servers (each server has O(n/s) elements). In iteration i, we expect to do
about O(n/(s*2^i)) work on each server, as the size of each servers element
sets will be approximately cut in half (remember, we assumed roughly random
distribution of elements) and O(s) work on the master (for broadcasting/
receiving messages and adding the sizes together). We expect O(log(n/s))
iterations. Adding these up over all iterations gives an expected runtime
of O(n/s + slog(n/s)), and assuming s << sqrt(n) which is normally the case,
this becomes simply (O(n/s)), which is the best you could possibly hope for.
Note also that this works not just for finding the median but also for
finding the kth largest value for any value of k.
g*****g
发帖数: 34805
10
Since the node has O(n) storage, I would use n bucket rather than 2 to speed
up the process.

to
element
than

【在 O********e 的大作中提到】
: A generalized solution:
: Suppose you have a master node (or are able to use a consensus protocol to
: elect a master from among your servers). The master first queries the
: servers for the size of their sets of data, call this n, so that it knows to
: look for the k = n/2 largest element.
: The master then selects a random server and queries it for a random element
: from the elements on that server. The master broadcasts this element to
: each server, and each server partitions its elements into those larger than
: or equal to the broadcasted element and those smaller than the broadcasted
: element.

h**********c
发帖数: 4120
11
老丘体
l*********o
发帖数: 3091
12
每台机器各自对自己的N个数排序。O(n^2logn)
设2N个pointer,分别指向每台机器的最小最大数。
用2个priority queue分别存这2N个pointer的最小最大数.
每次pop一个最小的,一个最大的。O(n^2logn)
最后的一个数就是median.

【在 m********a 的大作中提到】
: 一共有N个机器,每个机器上有N个数。每个机器最多存O(N)个数并对它们操作。如何找
: 到N^2个数的中数(median)?
: 这道题大家有什么好的解法吗?
: 多谢了!

m********a
发帖数: 128
13
oldoldtree and liangmaomao, your solutions seem both require an additional
machine to coordinate the solution. If there is no such additional node?
Then there will be lots of communication among the N nodes.

【在 l*********o 的大作中提到】
: 每台机器各自对自己的N个数排序。O(n^2logn)
: 设2N个pointer,分别指向每台机器的最小最大数。
: 用2个priority queue分别存这2N个pointer的最小最大数.
: 每次pop一个最小的,一个最大的。O(n^2logn)
: 最后的一个数就是median.

1 (共1页)
进入JobHunting版参与讨论
相关主题
一道大数据题报个G的电面
L家的高频题merge k sorted arrays giving iterators求讨论!小弟求问LinkedIn那道Deep Iterator的题
reverse an array再问一道题
implement hash tablemedian of N^2 numbers across N machines
hash_map 的遍历问题一道巨常见的题
算法题简单map reduce mean median, 傻逼回答
谁来解释下hashtable的iterator是怎么实现的#面试题#有100个database,每个存1 million data,如何求出median number of 这些数。
hashtable的遍历要去google onsite的同学们
相关话题的讨论汇总
话题: each话题: master话题: elements话题: than话题: server