问一道面试题 - Programming版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - 问一道面试题

相关主题
● URL questions	● 问一道HIVE题关于Efficiency
● 请教一个vignette vcm问题 (转载)	● How to design a database model related to six degree of se (转载)
● Using EXCEL VBA to download information from website	● 一般操作很多的数据用什么数据结构？
● 贡献一下：本版上搜集的 Google 面试题 (转载)	● 有谁能讲讲Cassandra secondary index的？
● 百度面试题，any idea?	● Why Avoid array indexing. Use pointers.
● 请教一个语言选择的弱问题	● php+apache timeout 的问题
● Random Switch Between Two Different URLs	● SQL server optimal query design (转载)
● tinyurl 是怎么做到同一个long url两次得到相同的short url	● 同主题阅读工具

相关话题的讨论汇总
话题: contents话题: query话题: urls话题: indexes

进入Programming版参与讨论

1

(共1页)

g*****u 发帖数: 298	1 In our indexes, we have millions of URLs each of which has a link to some page contents, that is, URL->contents. Now, suppose a user types a query with wild cards , which represent 0 or multiple occurrences of any characters, how do you build the indexes such that such a type of query can be executed efficiently by finding all corresponding URLs->contents efficiently. For example, given a query http://www.oveou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl
c*****t 发帖数: 1879	2 What kind of job is it for? The key would certainly be URL, and value is content. There are plenty of key-value pair databases. So it is not a problem. As for the indexing, I don't think that there exists one for regex or such. So, essentially it is a full table scan of the keys. There could be some optimizations in accessing the B-Tree of keys (like if the prefix is known, then skip a range of B-Tree nodes), but don't think it is that critical. It can be speed up by parition the table across 【在 g****u 的大作中提到】 : In our indexes, we have millions of URLs each of which has a link to some : page contents, that is, URL->contents. Now, suppose a user types a query : with wild cards , which represent 0 or multiple occurrences of any : characters, how do you build the indexes such that such a type of query can : be executed efficiently by finding all corresponding URLs->contents : efficiently. For example, given a query http://www.ove*ou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl
w***g 发帖数: 5958	3 这个问题没什么好的解法。给你一个www..com，结果就有那么多，什么算法都快不了。 can 【在 g***u 的大作中提到】 : In our indexes, we have millions of URLs each of which has a link to some : page contents, that is, URL->contents. Now, suppose a user types a query : with wild cards , which represent 0 or multiple occurrences of any : characters, how do you build the indexes such that such a type of query can : be executed efficiently by finding all corresponding URLs->contents : efficiently. For example, given a query http://www.ove*ou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl
g*****u 发帖数: 298	4 谢谢大家。这是mS SDE的面试题，我从网上抄来的。看了一下，大概提到的有k-gram，suffix tree什么的。

1

(共1页)

进入Programming版参与讨论

相关主题
● 同主题阅读工具	● 百度面试题，any idea?
● cgi测试newbee问题	● 请教一个语言选择的弱问题
● 求救：javascript程序运行中的一个error message	● Random Switch Between Two Different URLs
● 问个技术问题: 怎样把其他网站的搜索结果显示在自己的网站上	● tinyurl 是怎么做到同一个long url两次得到相同的short url
● URL questions	● 问一道HIVE题关于Efficiency
● 请教一个vignette vcm问题 (转载)	● How to design a database model related to six degree of se (转载)
● Using EXCEL VBA to download information from website	● 一般操作很多的数据用什么数据结构？
● 贡献一下：本版上搜集的 Google 面试题 (转载)	● 有谁能讲讲Cassandra secondary index的？

相关话题的讨论汇总
话题: contents话题: query话题: urls话题: indexes

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)