c******n 发帖数: 4965 | 1 thanks, I didn't realize the link shows differently.... here it is:
########################################################
I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.
let's say I do a query on
dog greyhound
note that I did not quote them, i.e. this is not a phrase search.
what happens under the hood ?
which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene ... 阅读全帖 |
|
|
i***c 发帖数: 301 | 3 lucene.net生成的index,搜索没有问题
可是用nutch爬来的index好像结构不同,如何用lucene.net来搜?
还是lucene.net的index版本低? |
|
g********g 发帖数: 2172 | 4 Lucene is a index engine only. Nutch is a web crawler. The crawled results
were indexed with Lucene. So they are different products. Indeed used the
Lucene as the index engine but built their own crawler. Nutch is an general
purpose search engine crawler. It is too much work to modify it as a
vertical search engine crawler. |
|
n**a 发帖数: 12 | 5 Hello,
Amazon.com is looking for experienced engineers with MapReduce/Hadoop/Lucene
,Distributed and scalable systems background. Please send your resumes to
n******[email protected]
Many positions open, location- Seattle, WA
Job description: SDE
Software Dev Engineer, Product Ads
Product Ads is a high-profile, strategic business unit, with support and
interest from all parts of Amazon and top management. We are a highly
motivated, collaborative and fun-loving team building a high growth business
. ... 阅读全帖 |
|
I*****y 发帖数: 6402 | 6 I would like to try building a search engine for a particular topic using
Lucene and Nutch.
I've installed Java, Tomcat 6 and Ant on my testing server http://208.64.71.46:8080/
however, I have no idea how to install Lucene. Anyone knows? please teach me
a little bit. thanks
ps: I use CentOS 5.2 by the way. |
|
K*Q 发帖数: 1001 | 7 oh, nutch itself includes lucene
you do not need to install lucene again
me |
|
|
g**********y 发帖数: 14569 | 9 Sorry, I just use Lucene as a search engine in our product. I didn't dive
into how it works.
I did read some documents and code from Lucene project for curiosity. My
impression is: it is a C-style Java program, painful to read and use.
Maybe you can directly contact the developers for technical details. |
|
t*********e 发帖数: 630 | 10 Google 中可以用 “” 执行精确匹配的搜索。比如,"Somebody that I used to know
", 搜索任何文档中包含此句子的文档。
Lucene 中 PhraseQuery 就可以实现这个。很好奇,实现这种功能的索引是如何建立的
?通常一篇文档被分词,索引时每个词语建有指向其所出现文档的位置信息,所以关键
词搜索很容易。
有人熟悉这种长短语,甚至整个句子精确匹配的索引是如何建立的? Lucene 里面都做
好了,有些好奇怎么实现的。 |
|
b******y 发帖数: 9224 | 11 对,我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。 |
|
z****e 发帖数: 54598 | 12 wildfly是jee标准实现,主要开发者是red hat
lucene是apache的一个项目,主要开发者是apache
不同的东西,lucene不是标准jee组件,跟jee没有必然联系 |
|
x****d 发帖数: 1766 | 13 我懒得查了,楼主知道lucene的这个phrasequery在solr里面implement了么?
怎么设置。
肯定不是commongrams。。。。。
lucene实际用起来挺烦的,客户对搜索的要求可以超乎想象的。 |
|
t*********e 发帖数: 630 | 14 Google 中可以用 “” 执行精确匹配的搜索。比如,"Somebody that I used to know
", 搜索任何文档中包含此句子的文档。
Lucene 中 PhraseQuery 就可以实现这个。很好奇,实现这种功能的索引是如何建立的
?通常一篇文档被分词,索引时每个词语建有指向其所出现文档的位置信息,所以关键
词搜索很容易。
有人熟悉这种长短语,甚至整个句子精确匹配的索引是如何建立的? Lucene 里面都做
好了,有些好奇怎么实现的。 |
|
b******y 发帖数: 9224 | 15 对,我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。 |
|
z****e 发帖数: 54598 | 16 wildfly是jee标准实现,主要开发者是red hat
lucene是apache的一个项目,主要开发者是apache
不同的东西,lucene不是标准jee组件,跟jee没有必然联系 |
|
x****d 发帖数: 1766 | 17 我懒得查了,楼主知道lucene的这个phrasequery在solr里面implement了么?
怎么设置。
肯定不是commongrams。。。。。
lucene实际用起来挺烦的,客户对搜索的要求可以超乎想象的。 |
|
I*****y 发帖数: 6402 | 18 如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站,哪个搜索引擎好
一些? 好像indeed.com公布出来的是用lucene. |
|
b******y 发帖数: 9224 | 19 ya, indeed.com uses lucene |
|
|
w****n 发帖数: 48 | 21 Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies. |
|
K*Q 发帖数: 1001 | 22 I use solr which includes lucene
me |
|
I*****y 发帖数: 6402 | 23 interesting, I heard solr is better than lucene, right? |
|
K*Q 发帖数: 1001 | 24 well, solr is based on lucene but provides more functionalities
I think solr is a great tool to create vertical search |
|
M****o 发帖数: 117 | 25 求建议。
用途是搜索内网资源,文本数量20万左右。国内工作时听说过Lucene,但没有自己做过
。已发CS版面,但借贵版再问一遍。 :) |
|
M****o 发帖数: 117 | 26 用Lucene是最好的吗?与其他的开源搜索引擎比呢?前辈能详细指点指点吗? |
|
m*****k 发帖数: 1864 | 27 Lucene应该是开源搜索引擎里所谓的Industry Standard。 |
|
M****o 发帖数: 117 | 28 求建议。
用途是搜索内网资源,文本数量20万左右。国内工作时听说过Lucene,但没有自己做过
。 |
|
|
|
a**i 发帖数: 289 | 31 各位大侠, 小弟用eclipse打开lucene编译以后, 却
无法执行TestSearch.class。 报错是cannot find
main type。 问问各位大侠这是怎么回事? |
|
b******y 发帖数: 9224 | 32 lucene的中文搜索不是特别好的。分词不行。不过,凑合用还可以了。 |
|
t*g 发帖数: 1758 | 33 我需要把token从lucene index中dump出来,可能要很多数据。怎么做呢?要用Term做吗
?我是一个新手。。谢谢! |
|
t*******e 发帖数: 684 | 34 Depending on how index files are created in the first place, Lucene may
store a full copy of the original text to be indexed, such that you can
restore the text from the query results. Otherwise, you only get other
fields like IDs from the Hit Documents. |
|
t*g 发帖数: 1758 | 35 We did store the original text. I don't have problems in dumping the
original text. I can dump it from through Hit Documents. However, what I
need is to dump the tokenized text. It doesn't exist in the Hit Documents.
Looks like I need to go into indices to get the tokenized documents. But I'm
new to Lucene, I can't find a way to do it. Need help! Thx. |
|
b******y 发帖数: 9224 | 36 You will need to store the terms in lucene index. But, I don't see why you
want to do that. |
|
t*******e 发帖数: 684 | 37 luke is a Java program, it should run on Linux. In general, you are not
supposed to copy or move index files. Having Lucene to re-index the
documents to a network drive is a better approach. |
|
q***n 发帖数: 37 | 38 luke should have some command line right?
Worst case, why not use java on linux and just roll your own code to look at
lucene fields. |
|
|
i**e 发帖数: 6810 | 40 I don't know much about the internals of Lucene.
With Solr, it's possible to specify the default
operator as OR or AND. I think your were more
talking about the OR case. It is optional, that
when AND gives you a very small number of results,
you could do an OR to enrich the result. |
|
t*********e 发帖数: 630 | 41 Wildfly 是用 undertow. 它跟 lucene 有什么关系? |
|
t*********e 发帖数: 630 | 42 Wildfly 是用 undertow. 它跟 lucene 有什么关系? |
|
w***g 发帖数: 5958 | 43 full text search我刚研究过。lucene/solr还是主流。
就是个小轮子,花一天时间看看就行。 |
|
n******g 发帖数: 2201 | 44 。。差距啊 千老转行我老要肯俩礼拜
[在 wdong (万事休) 的大作中提到:]
:full text search我刚研究过。lucene/solr还是主流。
:就是个小轮子,花一天时间看看就行。 |
|
s*********y 发帖数: 6151 | 45 Lucene这种实验室产品 啥时候流行过?
Solr 还算可以 。 但近几年来一直跟在Elasticsearch后面学着学那
Elasticsearch才是real deal
Site search, enterprise search, in-app search, data analytics, data
visualization, backend query layer都能搞定 |
|
w***g 发帖数: 5958 | 46 明明solr和elastic search都用的lucene。 |
|
n******g 发帖数: 2201 | 47 好主意!这个长周末打算弄点csv塞进lucene/solr 然后高点儿analytics |
|
c*****1 发帖数: 3240 | 48 ☆─────────────────────────────────────☆
kzeng (寱语·无味赛百味) 于 (Sun Sep 23 01:21:31 2012, 美东) 提到:
(这是一篇关于很枯燥的技术,很枯燥的历史文本,和不太枯燥的统计的 blog)
看过一篇关于《全宋词》词频统计文章,挺有趣的,想用类似的方法处理一下《资治通
鉴》,所以就趁周末花了几个小时作了一下。
词是长短句,统计两个字组成的词频比较合适,《通鉴》是古文,文字结构不同,所以
我统计了单字频,两字词词频,三字词词频,四字词词频,和五字词词频。同时也记录
各个统计单位(字或词)出现的卷数。《通鉴》294卷,从三家分晋到五代结束共共
1362年,所以卷数可以作为时间的度量。
《全宋词》的词频是用 R 作的。R 虽然是不错的统计软件,也是我的最爱之一,但是
R 并不适合作文本分析,更不适合来作数据库操作。所以就用了 C# 和 Kdb +3.0。 C#
用来分析文本,.Net 是懒人的福音,并且多线程运算非常简单,能够大大提升文本处
理速度,Kdb+用来储存数据,它差不多是性能最好的 in-memor... 阅读全帖 |
|
|
t*********e 发帖数: 630 | 50 1. Cassandra
The Cassandra database serves as a "scalable system of record" in the big
data world, says Jonathan Ellis, vice president of the Cassandra project.
Apache received the project from Facebook, which open-sourced Cassandra in
2008. Whereas Hadoop undertakes data analysis, Cassandra provides a data
store for applications, often highly scalable ones on the Web. Netflix, for
example, runs many Cassandra clusters, Ellis says.
2. Cordova
Giving Apache prominence in mobile computing, Cordova... 阅读全帖 |
|