I*****y 发帖数: 6402 | 1 如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站,哪个搜索引擎好
一些? 好像indeed.com公布出来的是用lucene. |
b******y 发帖数: 9224 | 2 ya, indeed.com uses lucene |
g********g 发帖数: 2172 | 3 Lucene is a index engine only. Nutch is a web crawler. The crawled results
were indexed with Lucene. So they are different products. Indeed used the
Lucene as the index engine but built their own crawler. Nutch is an general
purpose search engine crawler. It is too much work to modify it as a
vertical search engine crawler.
【在 I*****y 的大作中提到】 : 如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站,哪个搜索引擎好 : 一些? 好像indeed.com公布出来的是用lucene.
|
b******y 发帖数: 9224 | 4 good write-up.
Nutch is not good at all for production environment. It is good for playing
with.
To do a truly scalable crawler for a vertical market, you got to do it
yourself. |
z********s 发帖数: 22 | |
z********s 发帖数: 22 | 6 nutch is a crawler based on lucene.
here is mine search engine based on nutch.
http://malachi.thechristianlife.com/
it works pretty well.
here is a tutorial I wrote.
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
hope it helps. |
I*****y 发帖数: 6402 | 7 thanks Peter
【在 z********s 的大作中提到】 : nutch is a crawler based on lucene. : here is mine search engine based on nutch. : http://malachi.thechristianlife.com/ : it works pretty well. : here is a tutorial I wrote. : http://peterpuwang.googlepages.com/NutchGuideForDummies.htm : hope it helps.
|
b******y 发帖数: 9224 | 8 good, thanks for the info |
w****n 发帖数: 48 | 9 Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies. |
b******y 发帖数: 9224 | 10 unfortunately, for doing a search engine, the crawler is the hardest part.
Search is relatively easy.
You get all sorts of crappy html pages and also all sort of crappy websites
to handle... |
g********g 发帖数: 2172 | 11 还有一种方法就是用YAHOO, ALEXA 的DATA. 否则不是狭小领域的话 crawler 的带宽费
都付不起.
【在 w****n 的大作中提到】 : Enterprise search engine: solr: based on lucene. : Good crawler: heritrix. : so far the best tools to build a search engine. Many commercial sites use : the two combination including some big companies.
|