How to find 10 most frequent strings in 10 billion string list? - JobHunting版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

JobHunting版 - How to find 10 most frequent strings in 10 billion string list?

相关主题
● How to convert string to string array (or vector) (转载)	● 问个google老题的最佳解法
● split String into multiple lines	● Interleave Strings那个题目有O(n)时间　O(1)空间算法么？
● 问个题	● 一道电面题，分享下，这个题应该用哪几个data structure?
● 专家们，find the longest common substring of two strings	● Permutation leetcode-
● [合集] 微软面试题一道	● HackerRank find string..
● 请教一个phone interview 问题	● 请问我写的这个代码哪可以改进一下
● 问道看到的面试题	● 问一道题(8)
● Riverbed 面经	● String list如何排序

相关话题的讨论汇总
话题: 10话题: strings话题: string话题: frequent话题: find

进入JobHunting版参与讨论

1

(共1页)

S****h 发帖数: 558	1 It is a small bay area IT firm. They asked first how to find 10 most frequent strings from 10 billion (string, frequency) pairs. Easy, use heap O(10^9*log10). Then they asked the second question. How to find 10 most frequent strings in 10 billion string list? Not much idea. I answered to sweep once to build (string,frequency) pair and then sweep again to find the first 10. Is there any better way?
D*******e 发帖数: 151	2 split the strings into buckets, e.g., from a to z 26 ones count them separately then merge the results
D*******e 发帖数: 151	3 guarantee the splitting is exclusive
f*******t 发帖数: 7549	4 以前有帖讨论过。可以用bloom filter把只出现一次的string去掉，通过的用堆存，算是一种实际应用中比较有效的办法吧。不知道最佳答案是什么
d*******d 发帖数: 2050	5 bloom filter的具体用法能说说么? 【在 f*******t 的大作中提到】 : 以前有帖讨论过。可以用bloom filter把只出现一次的string去掉，通过的用堆存，算 : 是一种实际应用中比较有效的办法吧。 : 不知道最佳答案是什么
f*******t 发帖数: 7549	6 http://en.wikipedia.org/wiki/Bloom_filter 非常节省空间的验证数据是否出现过的数据结构。很简单，就是几个hashtable，每个 key只占1个bit，用来标记是否出现过。会有false positive，也就是说你检验一个数据，在几个hashtable里对应的bit都为1 ，但实际上它没出现过。而出现过的绝对不会漏判。【在 d*******d 的大作中提到】 : bloom filter的具体用法能说说么?
d*******d 发帖数: 2050	7 这个我明白.如何用它来过滤掉只出现一次的呢? 这和直接用一个hashmap来count的区别在哪里呢? 1 【在 f*******t 的大作中提到】 : http://en.wikipedia.org/wiki/Bloom_filter : 非常节省空间的验证数据是否出现过的数据结构。很简单，就是几个hashtable，每个 : key只占1个bit，用来标记是否出现过。 : 会有false positive，也就是说你检验一个数据，在几个hashtable里对应的bit都为1 : ，但实际上它没出现过。 : 而出现过的绝对不会漏判。

1

(共1页)

进入JobHunting版参与讨论

相关主题
● String list如何排序	● [合集] 微软面试题一道
● 判断两个Strings是否相差一个Edit distance	● 请教一个phone interview 问题
● 问一体G onsite，经常出现	● 问道看到的面试题
● G onsite题求讨论	● Riverbed 面经
● How to convert string to string array (or vector) (转载)	● 问个google老题的最佳解法
● split String into multiple lines	● Interleave Strings那个题目有O(n)时间　O(1)空间算法么？
● 问个题	● 一道电面题，分享下，这个题应该用哪几个data structure?
● 专家们，find the longest common substring of two strings	● Permutation leetcode-

相关话题的讨论汇总
话题: 10话题: strings话题: string话题: frequent话题: find

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)