由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - 问一道面试题
相关主题
URL questions问一道HIVE题 关于Efficiency
请教一个vignette vcm问题 (转载)How to design a database model related to six degree of se (转载)
Using EXCEL VBA to download information from website一般操作很多的数据用什么数据结构?
贡献一下:本版上搜集的 Google 面试题 (转载)有谁能讲讲Cassandra secondary index的?
百度面试题,any idea?Why Avoid array indexing. Use pointers.
请教一个语言选择的弱问题php+apache timeout 的问题
Random Switch Between Two Different URLsSQL server optimal query design (转载)
tinyurl 是怎么做到同一个long url两次得到相同的short url同主题阅读工具
相关话题的讨论汇总
话题: contents话题: query话题: urls话题: indexes
进入Programming版参与讨论
1 (共1页)
g*****u
发帖数: 298
1
In our indexes, we have millions of URLs each of which has a link to some
page contents, that is, URL->contents. Now, suppose a user types a query
with wild cards *, which represent 0 or multiple occurrences of any
characters, how do you build the indexes such that such a type of query can
be executed efficiently by finding all corresponding URLs->contents
efficiently. For example, given a query http://www.*o*ve*ou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl
c*****t
发帖数: 1879
2
What kind of job is it for?
The key would certainly be URL, and value is content. There are
plenty of key-value pair databases. So it is not a problem.
As for the indexing, I don't think that there exists one for regex
or such. So, essentially it is a full table scan of the keys.
There could be some optimizations in accessing the B-Tree of keys
(like if the prefix is known, then skip a range of B-Tree nodes),
but don't think it is that critical.
It can be speed up by parition the table across

【在 g*****u 的大作中提到】
: In our indexes, we have millions of URLs each of which has a link to some
: page contents, that is, URL->contents. Now, suppose a user types a query
: with wild cards *, which represent 0 or multiple occurrences of any
: characters, how do you build the indexes such that such a type of query can
: be executed efficiently by finding all corresponding URLs->contents
: efficiently. For example, given a query http://www.*o*ve*ou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl

w***g
发帖数: 5958
3
这个问题没什么好的解法。给你一个www.*.com,结果就有那么多,什么算法都快不了。

can

【在 g*****u 的大作中提到】
: In our indexes, we have millions of URLs each of which has a link to some
: page contents, that is, URL->contents. Now, suppose a user types a query
: with wild cards *, which represent 0 or multiple occurrences of any
: characters, how do you build the indexes such that such a type of query can
: be executed efficiently by finding all corresponding URLs->contents
: efficiently. For example, given a query http://www.*o*ve*ou.com. You need to find iloveyou.com, itveabcu.com, etc. I got stuck for this probl

g*****u
发帖数: 298
4
谢谢大家。这是mS SDE的面试题,我从网上抄来的。
看了一下,大概提到的有k-gram,suffix tree什么的。
1 (共1页)
进入Programming版参与讨论
相关主题
同主题阅读工具百度面试题,any idea?
cgi测试newbee问题请教一个语言选择的弱问题
求救:javascript程序运行中的一个error messageRandom Switch Between Two Different URLs
问个技术问题: 怎样把其他网站的搜索结果显示在自己的网站上tinyurl 是怎么做到同一个long url两次得到相同的short url
URL questions问一道HIVE题 关于Efficiency
请教一个vignette vcm问题 (转载)How to design a database model related to six degree of se (转载)
Using EXCEL VBA to download information from website一般操作很多的数据用什么数据结构?
贡献一下:本版上搜集的 Google 面试题 (转载)有谁能讲讲Cassandra secondary index的?
相关话题的讨论汇总
话题: contents话题: query话题: urls话题: indexes