关于lucene的讨论汇总 - 话题女王

c******n
发帖数: 4965

来自主题: Java版 - anybody doing Lucene/Solr?

thanks, I didn't realize the link shows differently.... here it is:
########################################################
I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.
let's say I do a query on
dog greyhound
note that I did not quote them, i.e. this is not a phrase search.
what happens under the hood ?
which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene ... 阅读全帖

L*******r
发帖数: 1011

来自主题: DotNet版 - .Net search engine - Lucene.Net

http://sourceforge.net/projects/lucenedotnet/
Lucene.Net is a complete up to date .NET port of Jackarta Lucene a
hight-performance, full-featured text search engine written entirely Java. See
http://jakarta.apache.org/lucene for more info on Jakarta Lucene.

i***c
发帖数: 301

来自主题: Programming版 - 请教各位，nutch(lucene)的index用lucene.net可以搜索吗？

lucene.net生成的index,搜索没有问题
可是用nutch爬来的index好像结构不同，如何用lucene.net来搜？
还是lucene.net的index版本低？

g********g
发帖数: 2172

来自主题: StartUp版 - Nutch vs Lucene

Lucene is a index engine only. Nutch is a web crawler. The crawled results
were indexed with Lucene. So they are different products. Indeed used the
Lucene as the index engine but built their own crawler. Nutch is an general
purpose search engine crawler. It is too much work to modify it as a
vertical search engine crawler.

n**a
发帖数: 12

来自主题: JobMarket版 - Amazon.com is looking for experienced engineers with MapReduce/Hadoop/Lucene

Hello,
Amazon.com is looking for experienced engineers with MapReduce/Hadoop/Lucene
,Distributed and scalable systems background. Please send your resumes to
n******[email protected]
Many positions open, location- Seattle, WA
Job description: SDE
Software Dev Engineer, Product Ads
Product Ads is a high-profile, strategic business unit, with support and
interest from all parts of Amazon and top management. We are a highly
motivated, collaborative and fun-loving team building a high growth business
. ... 阅读全帖

I*****y
发帖数: 6402

来自主题: StartUp版 - So how to install Lucene?

I would like to try building a search engine for a particular topic using
Lucene and Nutch.
I've installed Java, Tomcat 6 and Ant on my testing server http://208.64.71.46:8080/
however, I have no idea how to install Lucene. Anyone knows? please teach me
a little bit. thanks
ps: I use CentOS 5.2 by the way.

K*Q
发帖数: 1001

来自主题: StartUp版 - So how to install Lucene?

oh, nutch itself includes lucene
you do not need to install lucene again

me

n***5
发帖数: 86

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

Lucene容易上手
http://www.lucenetutorial.com/lucene-in-5-minutes.html

g**********y
发帖数: 14569

来自主题: Java版 - anybody doing Lucene/Solr?

Sorry, I just use Lucene as a search engine in our product. I didn't dive
into how it works.
I did read some documents and code from Lucene project for curiosity. My
impression is: it is a C-style Java program, painful to read and use.
Maybe you can directly contact the developers for technical details.

t*********e
发帖数: 630

来自主题: Java版 - Lucene 中精确匹配

Google 中可以用 “” 执行精确匹配的搜索。比如，"Somebody that I used to know
", 搜索任何文档中包含此句子的文档。
Lucene 中 PhraseQuery 就可以实现这个。很好奇，实现这种功能的索引是如何建立的
？通常一篇文档被分词，索引时每个词语建有指向其所出现文档的位置信息，所以关键
词搜索很容易。
有人熟悉这种长短语，甚至整个句子精确匹配的索引是如何建立的? Lucene 里面都做
好了，有些好奇怎么实现的。

b******y
发帖数: 9224

来自主题: Java版 - Lucene 中精确匹配

对，我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。

z****e
发帖数: 54598

来自主题: Java版 - Lucene 中精确匹配

wildfly是jee标准实现，主要开发者是red hat
lucene是apache的一个项目，主要开发者是apache
不同的东西，lucene不是标准jee组件，跟jee没有必然联系

x****d
发帖数: 1766

来自主题: Java版 - Lucene 中精确匹配

我懒得查了，楼主知道lucene的这个phrasequery在solr里面implement了么？
怎么设置。
肯定不是commongrams。。。。。
lucene实际用起来挺烦的，客户对搜索的要求可以超乎想象的。

t*********e
发帖数: 630

来自主题: Java版 - Lucene 中精确匹配

b******y
发帖数: 9224

来自主题: Java版 - Lucene 中精确匹配

对，我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。

z****e
发帖数: 54598

来自主题: Java版 - Lucene 中精确匹配

wildfly是jee标准实现，主要开发者是red hat
lucene是apache的一个项目，主要开发者是apache
不同的东西，lucene不是标准jee组件，跟jee没有必然联系

x****d
发帖数: 1766

来自主题: Java版 - Lucene 中精确匹配

I*****y
发帖数: 6402

来自主题: StartUp版 - Nutch vs Lucene

如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站，哪个搜索引擎好
一些？好像indeed.com公布出来的是用lucene.

b******y
发帖数: 9224

来自主题: StartUp版 - Nutch vs Lucene

ya, indeed.com uses lucene

z********s
发帖数: 22

来自主题: StartUp版 - Nutch vs Lucene

nutch is a crawler based on lucene.
here is mine search engine based on nutch.
http://malachi.thechristianlife.com/
it works pretty well.
here is a tutorial I wrote.
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
hope it helps.

w****n
发帖数: 48

来自主题: StartUp版 - Nutch vs Lucene

Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies.

K*Q
发帖数: 1001

来自主题: StartUp版 - So how to install Lucene?

I use solr which includes lucene

me

I*****y
发帖数: 6402

来自主题: StartUp版 - So how to install Lucene?

interesting, I heard solr is better than lucene, right?

K*Q
发帖数: 1001

来自主题: StartUp版 - So how to install Lucene?

well, solr is based on lucene but provides more functionalities
I think solr is a great tool to create vertical search

M****o
发帖数: 117

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

求建议。
用途是搜索内网资源，文本数量20万左右。国内工作时听说过Lucene，但没有自己做过
。已发CS版面，但借贵版再问一遍。：）

M****o
发帖数: 117

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

用Lucene是最好的吗？与其他的开源搜索引擎比呢？前辈能详细指点指点吗？

m*****k
发帖数: 1864

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

Lucene应该是开源搜索引擎里所谓的Industry Standard。

M****o
发帖数: 117

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

求建议。
用途是搜索内网资源，文本数量20万左右。国内工作时听说过Lucene，但没有自己做过
。

x**y
发帖数: 10012

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

http://lucene.apache.org/nutch/tutorial8.html

c*****s
发帖数: 214

来自主题: Java版 - Anybody here used apache Lucene?

http://jakarta.apache.org/lucene/docs/demo.html有例子
我曾经调通过一个程序，很简单，也很灵活。
和一般的搜索一样先建索引然后搜索，可以搜索任何东西，只要你提供读那东西的程序。

a**i
发帖数: 289

来自主题: Java版 - 急！如何用eclipse编辑lucene

各位大侠，小弟用eclipse打开lucene编译以后，却
无法执行TestSearch.class。报错是cannot find
main type。问问各位大侠这是怎么回事？

b******y
发帖数: 9224

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

lucene的中文搜索不是特别好的。分词不行。不过，凑合用还可以了。

t*g
发帖数: 1758

来自主题: Java版 - 再请教一个lucene的问题

我需要把token从lucene index中dump出来，可能要很多数据。怎么做呢?要用Term做吗
?我是一个新手。。谢谢！

t*******e
发帖数: 684

来自主题: Java版 - 再请教一个lucene的问题

Depending on how index files are created in the first place, Lucene may
store a full copy of the original text to be indexed, such that you can
restore the text from the query results. Otherwise, you only get other
fields like IDs from the Hit Documents.

t*g
发帖数: 1758

来自主题: Java版 - 再请教一个lucene的问题

We did store the original text. I don't have problems in dumping the
original text. I can dump it from through Hit Documents. However, what I
need is to dump the tokenized text. It doesn't exist in the Hit Documents.
Looks like I need to go into indices to get the tokenized documents. But I'm
new to Lucene, I can't find a way to do it. Need help! Thx.

b******y
发帖数: 9224

来自主题: Java版 - 再请教一个lucene的问题

You will need to store the terms in lucene index. But, I don't see why you
want to do that.

t*******e
发帖数: 684

来自主题: Java版 - 还是lucene的问题

luke is a Java program, it should run on Linux. In general, you are not
supposed to copy or move index files. Having Lucene to re-index the
documents to a network drive is a better approach.

q***n
发帖数: 37

来自主题: Java版 - 还是lucene的问题

luke should have some command line right?
Worst case, why not use java on linux and just roll your own code to look at
lucene fields.

c******n
发帖数: 4965

来自主题: Java版 - anybody doing Lucene/Solr?

I'm new , so having the following question on the mailing list,
haven't got an answer, maybe someone here could help? thanks!
http://mail-archives.apache.org/mod_mbox/lucene-java-
user/201104.mbox/browser

i**e
发帖数: 6810

来自主题: Java版 - anybody doing Lucene/Solr?

I don't know much about the internals of Lucene.
With Solr, it's possible to specify the default
operator as OR or AND. I think your were more
talking about the OR case. It is optional, that
when AND gives you a very small number of results,
you could do an OR to enrich the result.

t*********e
发帖数: 630

来自主题: Java版 - Lucene 中精确匹配

Wildfly 是用 undertow. 它跟 lucene 有什么关系？

t*********e
发帖数: 630

来自主题: Java版 - Lucene 中精确匹配

Wildfly 是用 undertow. 它跟 lucene 有什么关系？

w***g
发帖数: 5958

来自主题: Programming版 - 搜索 lucene 之类是不是不流行了？

full text search我刚研究过。lucene/solr还是主流。
就是个小轮子，花一天时间看看就行。

n******g
发帖数: 2201

来自主题: Programming版 - 搜索 lucene 之类是不是不流行了？

。。差距啊千老转行我老要肯俩礼拜
[在 wdong (万事休) 的大作中提到：]
:full text search我刚研究过。lucene/solr还是主流。
:就是个小轮子，花一天时间看看就行。

s*********y
发帖数: 6151

来自主题: Programming版 - 搜索 lucene 之类是不是不流行了？

Lucene这种实验室产品啥时候流行过？
Solr 还算可以。但近几年来一直跟在Elasticsearch后面学着学那
Elasticsearch才是real deal
Site search， enterprise search， in-app search， data analytics， data
visualization， backend query layer都能搞定

w***g
发帖数: 5958

来自主题: Programming版 - 搜索 lucene 之类是不是不流行了？

明明solr和elastic search都用的lucene。

n******g
发帖数: 2201

来自主题: Programming版 - 搜索 lucene 之类是不是不流行了？

好主意！这个长周末打算弄点csv塞进lucene/solr 然后高点儿analytics

c*****1
发帖数: 3240

来自主题: History版 - [合集] 脍炙《通鉴》

☆─────────────────────────────────────☆
kzeng (寱语·无味赛百味) 于 (Sun Sep 23 01:21:31 2012, 美东) 提到:
（这是一篇关于很枯燥的技术，很枯燥的历史文本，和不太枯燥的统计的 blog）
看过一篇关于《全宋词》词频统计文章，挺有趣的，想用类似的方法处理一下《资治通
鉴》，所以就趁周末花了几个小时作了一下。
词是长短句，统计两个字组成的词频比较合适，《通鉴》是古文，文字结构不同，所以
我统计了单字频，两字词词频，三字词词频，四字词词频，和五字词词频。同时也记录
各个统计单位（字或词）出现的卷数。《通鉴》294卷，从三家分晋到五代结束共共
1362年，所以卷数可以作为时间的度量。
《全宋词》的词频是用 R 作的。R 虽然是不错的统计软件，也是我的最爱之一，但是
R 并不适合作文本分析，更不适合来作数据库操作。所以就用了 C# 和 Kdb +3.0。 C#
用来分析文本，.Net 是懒人的福音，并且多线程运算非常简单，能够大大提升文本处
理速度，Kdb+用来储存数据，它差不多是性能最好的 in-memor... 阅读全帖

a*****o
发帖数: 209

来自主题: History版 - 脍炙《通鉴》

很有意思的实验。
分析词频的实现楼主可以尝试一下Lucenehttp://lucene.apache.org/core/，非常成熟的开源全文检索系统。它处理文本时建立反向索引，用来进行文本检索的效率远远超过任何基于数据库查询的方法。它建立索引速度也非常快，它的主页上说"over 95GB/hour on modern hardware"。楼主说过通鉴大约两百多万字，那么全文10M左右，在预处理的时候按照章节分割成不同的documents，然后用Lucene建立索引可以说应该是非常迅速的。
在建立的索引基础上，词频分析以及其他更复杂的分析可以一劳永逸地实现，既可以通
过Lucene API(e.g., http://lucene.apache.org/core/old_versioned_docs/versions/3_0_2/api/all/org/apache/lucene/index/TermDocs.html#freq()), 也可以通过一些索引查看工具比如Lukehttp://code.google.com/p/luke/。
Lucene可以方便地扩展到处理中文，中科院... 阅读全帖

t*********e
发帖数: 630

来自主题: Java版 - 15 high-impact Apache projects

1. Cassandra
The Cassandra database serves as a "scalable system of record" in the big
data world, says Jonathan Ellis, vice president of the Cassandra project.
Apache received the project from Facebook, which open-sourced Cassandra in
2008. Whereas Hadoop undertakes data analysis, Cassandra provides a data
store for applications, often highly scalable ones on the Web. Netflix, for
example, runs many Cassandra clusters, Ellis says.
2. Cordova
Giving Apache prominence in mobile computing, Cordova... 阅读全帖

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天