由买买提看人间百态

topics

全部话题 - 话题: lucene
1 2 3 4 5 6 7 8 9 下页 末页 (共9页)
c******n
发帖数: 4965
1
来自主题: Java版 - anybody doing Lucene/Solr?
thanks, I didn't realize the link shows differently.... here it is:
########################################################
I'm new to lucene/search engine , and have been struggling with these
questions recently.
I'd appreciate a lot of you could shed some light on this.
let's say I do a query on
dog greyhound
note that I did not quote them, i.e. this is not a phrase search.
what happens under the hood ?
which term does Lucene use to look up the inverted Index ?
I read somewhere that Lucene ... 阅读全帖
L*******r
发帖数: 1011
2
来自主题: DotNet版 - .Net search engine - Lucene.Net
http://sourceforge.net/projects/lucenedotnet/
Lucene.Net is a complete up to date .NET port of Jackarta Lucene a
hight-performance, full-featured text search engine written entirely Java. See
http://jakarta.apache.org/lucene for more info on Jakarta Lucene.
i***c
发帖数: 301
3
lucene.net生成的index,搜索没有问题
可是用nutch爬来的index好像结构不同,如何用lucene.net来搜?
还是lucene.net的index版本低?
g********g
发帖数: 2172
4
来自主题: StartUp版 - Nutch vs Lucene
Lucene is a index engine only. Nutch is a web crawler. The crawled results
were indexed with Lucene. So they are different products. Indeed used the
Lucene as the index engine but built their own crawler. Nutch is an general
purpose search engine crawler. It is too much work to modify it as a
vertical search engine crawler.
n**a
发帖数: 12
5
Hello,
Amazon.com is looking for experienced engineers with MapReduce/Hadoop/Lucene
,Distributed and scalable systems background. Please send your resumes to
n******[email protected]
Many positions open, location- Seattle, WA
Job description: SDE
Software Dev Engineer, Product Ads
Product Ads is a high-profile, strategic business unit, with support and
interest from all parts of Amazon and top management. We are a highly
motivated, collaborative and fun-loving team building a high growth business
. ... 阅读全帖
I*****y
发帖数: 6402
6
来自主题: StartUp版 - So how to install Lucene?
I would like to try building a search engine for a particular topic using
Lucene and Nutch.
I've installed Java, Tomcat 6 and Ant on my testing server http://208.64.71.46:8080/
however, I have no idea how to install Lucene. Anyone knows? please teach me
a little bit. thanks
ps: I use CentOS 5.2 by the way.
K*Q
发帖数: 1001
7
来自主题: StartUp版 - So how to install Lucene?
oh, nutch itself includes lucene
you do not need to install lucene again

me
n***5
发帖数: 86
g**********y
发帖数: 14569
9
来自主题: Java版 - anybody doing Lucene/Solr?
Sorry, I just use Lucene as a search engine in our product. I didn't dive
into how it works.
I did read some documents and code from Lucene project for curiosity. My
impression is: it is a C-style Java program, painful to read and use.
Maybe you can directly contact the developers for technical details.
t*********e
发帖数: 630
10
来自主题: Java版 - Lucene 中精确匹配
Google 中可以用 “” 执行精确匹配的搜索。比如,"Somebody that I used to know
", 搜索任何文档中包含此句子的文档。
Lucene 中 PhraseQuery 就可以实现这个。很好奇,实现这种功能的索引是如何建立的
?通常一篇文档被分词,索引时每个词语建有指向其所出现文档的位置信息,所以关键
词搜索很容易。
有人熟悉这种长短语,甚至整个句子精确匹配的索引是如何建立的? Lucene 里面都做
好了,有些好奇怎么实现的。
b******y
发帖数: 9224
11
来自主题: Java版 - Lucene 中精确匹配
对,我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。
z****e
发帖数: 54598
12
来自主题: Java版 - Lucene 中精确匹配
wildfly是jee标准实现,主要开发者是red hat
lucene是apache的一个项目,主要开发者是apache
不同的东西,lucene不是标准jee组件,跟jee没有必然联系
x****d
发帖数: 1766
13
来自主题: Java版 - Lucene 中精确匹配
我懒得查了,楼主知道lucene的这个phrasequery在solr里面implement了么?
怎么设置。
肯定不是commongrams。。。。。
lucene实际用起来挺烦的,客户对搜索的要求可以超乎想象的。
t*********e
发帖数: 630
14
来自主题: Java版 - Lucene 中精确匹配
Google 中可以用 “” 执行精确匹配的搜索。比如,"Somebody that I used to know
", 搜索任何文档中包含此句子的文档。
Lucene 中 PhraseQuery 就可以实现这个。很好奇,实现这种功能的索引是如何建立的
?通常一篇文档被分词,索引时每个词语建有指向其所出现文档的位置信息,所以关键
词搜索很容易。
有人熟悉这种长短语,甚至整个句子精确匹配的索引是如何建立的? Lucene 里面都做
好了,有些好奇怎么实现的。
b******y
发帖数: 9224
15
来自主题: Java版 - Lucene 中精确匹配
对,我通读过lucene的程序。而且自己开发了搜索引擎出来。我觉得看源程序是最好的
学习方法。不懂可以直接上lucene mailing list上问。
z****e
发帖数: 54598
16
来自主题: Java版 - Lucene 中精确匹配
wildfly是jee标准实现,主要开发者是red hat
lucene是apache的一个项目,主要开发者是apache
不同的东西,lucene不是标准jee组件,跟jee没有必然联系
x****d
发帖数: 1766
17
来自主题: Java版 - Lucene 中精确匹配
我懒得查了,楼主知道lucene的这个phrasequery在solr里面implement了么?
怎么设置。
肯定不是commongrams。。。。。
lucene实际用起来挺烦的,客户对搜索的要求可以超乎想象的。
I*****y
发帖数: 6402
18
来自主题: StartUp版 - Nutch vs Lucene
如果想做一个和indeed.com或iloveOPT的myvisajobs.com一样的网站,哪个搜索引擎好
一些? 好像indeed.com公布出来的是用lucene.
b******y
发帖数: 9224
19
来自主题: StartUp版 - Nutch vs Lucene
ya, indeed.com uses lucene
z********s
发帖数: 22
20
来自主题: StartUp版 - Nutch vs Lucene
nutch is a crawler based on lucene.
here is mine search engine based on nutch.
http://malachi.thechristianlife.com/
it works pretty well.
here is a tutorial I wrote.
http://peterpuwang.googlepages.com/NutchGuideForDummies.htm
hope it helps.
w****n
发帖数: 48
21
来自主题: StartUp版 - Nutch vs Lucene
Enterprise search engine: solr: based on lucene.
Good crawler: heritrix.
so far the best tools to build a search engine. Many commercial sites use
the two combination including some big companies.
K*Q
发帖数: 1001
22
来自主题: StartUp版 - So how to install Lucene?
I use solr which includes lucene

me
I*****y
发帖数: 6402
23
来自主题: StartUp版 - So how to install Lucene?
interesting, I heard solr is better than lucene, right?
K*Q
发帖数: 1001
24
来自主题: StartUp版 - So how to install Lucene?
well, solr is based on lucene but provides more functionalities
I think solr is a great tool to create vertical search
M****o
发帖数: 117
25
来自主题: BuildingWeb版 - 想做个搜索引擎,Lucene行吗?
求建议。
用途是搜索内网资源,文本数量20万左右。国内工作时听说过Lucene,但没有自己做过
。已发CS版面,但借贵版再问一遍。 :)
M****o
发帖数: 117
26
来自主题: BuildingWeb版 - 想做个搜索引擎,Lucene行吗?
用Lucene是最好的吗?与其他的开源搜索引擎比呢?前辈能详细指点指点吗?
m*****k
发帖数: 1864
27
来自主题: BuildingWeb版 - 想做个搜索引擎,Lucene行吗?
Lucene应该是开源搜索引擎里所谓的Industry Standard。
M****o
发帖数: 117
28
求建议。
用途是搜索内网资源,文本数量20万左右。国内工作时听说过Lucene,但没有自己做过
x**y
发帖数: 10012
c*****s
发帖数: 214
30
来自主题: Java版 - Anybody here used apache Lucene?

http://jakarta.apache.org/lucene/docs/demo.html有例子
我曾经调通过一个程序,很简单,也很灵活。
和一般的搜索一样先建索引然后搜索,可以搜索任何东西,只要你提供读那东西的程序。
a**i
发帖数: 289
31
来自主题: Java版 - 急! 如何用eclipse编辑lucene
各位大侠, 小弟用eclipse打开lucene编译以后, 却
无法执行TestSearch.class。 报错是cannot find
main type。 问问各位大侠这是怎么回事?
b******y
发帖数: 9224
32
lucene的中文搜索不是特别好的。分词不行。不过,凑合用还可以了。
t*g
发帖数: 1758
33
来自主题: Java版 - 再请教一个lucene的问题
我需要把token从lucene index中dump出来,可能要很多数据。怎么做呢?要用Term做吗
?我是一个新手。。谢谢!
t*******e
发帖数: 684
34
来自主题: Java版 - 再请教一个lucene的问题
Depending on how index files are created in the first place, Lucene may
store a full copy of the original text to be indexed, such that you can
restore the text from the query results. Otherwise, you only get other
fields like IDs from the Hit Documents.
t*g
发帖数: 1758
35
来自主题: Java版 - 再请教一个lucene的问题
We did store the original text. I don't have problems in dumping the
original text. I can dump it from through Hit Documents. However, what I
need is to dump the tokenized text. It doesn't exist in the Hit Documents.
Looks like I need to go into indices to get the tokenized documents. But I'm
new to Lucene, I can't find a way to do it. Need help! Thx.
b******y
发帖数: 9224
36
来自主题: Java版 - 再请教一个lucene的问题
You will need to store the terms in lucene index. But, I don't see why you
want to do that.
t*******e
发帖数: 684
37
来自主题: Java版 - 还是lucene的问题
luke is a Java program, it should run on Linux. In general, you are not
supposed to copy or move index files. Having Lucene to re-index the
documents to a network drive is a better approach.
q***n
发帖数: 37
38
来自主题: Java版 - 还是lucene的问题
luke should have some command line right?
Worst case, why not use java on linux and just roll your own code to look at
lucene fields.
c******n
发帖数: 4965
39
来自主题: Java版 - anybody doing Lucene/Solr?
I'm new , so having the following question on the mailing list,
haven't got an answer, maybe someone here could help? thanks!
http://mail-archives.apache.org/mod_mbox/lucene-java-
user/201104.mbox/browser
i**e
发帖数: 6810
40
来自主题: Java版 - anybody doing Lucene/Solr?
I don't know much about the internals of Lucene.
With Solr, it's possible to specify the default
operator as OR or AND. I think your were more
talking about the OR case. It is optional, that
when AND gives you a very small number of results,
you could do an OR to enrich the result.
t*********e
发帖数: 630
41
来自主题: Java版 - Lucene 中精确匹配
Wildfly 是用 undertow. 它跟 lucene 有什么关系?
t*********e
发帖数: 630
42
来自主题: Java版 - Lucene 中精确匹配
Wildfly 是用 undertow. 它跟 lucene 有什么关系?
w***g
发帖数: 5958
43
来自主题: Programming版 - 搜索 lucene 之类是不是不流行了?
full text search我刚研究过。lucene/solr还是主流。
就是个小轮子,花一天时间看看就行。
n******g
发帖数: 2201
44
来自主题: Programming版 - 搜索 lucene 之类是不是不流行了?
。。差距啊 千老转行我老要肯俩礼拜
[在 wdong (万事休) 的大作中提到:]
:full text search我刚研究过。lucene/solr还是主流。
:就是个小轮子,花一天时间看看就行。
s*********y
发帖数: 6151
45
来自主题: Programming版 - 搜索 lucene 之类是不是不流行了?
Lucene这种实验室产品 啥时候流行过?
Solr 还算可以 。 但近几年来一直跟在Elasticsearch后面学着学那
Elasticsearch才是real deal
Site search, enterprise search, in-app search, data analytics, data
visualization, backend query layer都能搞定
w***g
发帖数: 5958
46
来自主题: Programming版 - 搜索 lucene 之类是不是不流行了?
明明solr和elastic search都用的lucene。
n******g
发帖数: 2201
47
来自主题: Programming版 - 搜索 lucene 之类是不是不流行了?
好主意!这个长周末打算弄点csv塞进lucene/solr 然后高点儿analytics
c*****1
发帖数: 3240
48
来自主题: History版 - [合集] 脍炙《通鉴》
☆─────────────────────────────────────☆
kzeng (寱语·无味赛百味) 于 (Sun Sep 23 01:21:31 2012, 美东) 提到:
(这是一篇关于很枯燥的技术,很枯燥的历史文本,和不太枯燥的统计的 blog)
看过一篇关于《全宋词》词频统计文章,挺有趣的,想用类似的方法处理一下《资治通
鉴》,所以就趁周末花了几个小时作了一下。
词是长短句,统计两个字组成的词频比较合适,《通鉴》是古文,文字结构不同,所以
我统计了单字频,两字词词频,三字词词频,四字词词频,和五字词词频。同时也记录
各个统计单位(字或词)出现的卷数。《通鉴》294卷,从三家分晋到五代结束共共
1362年,所以卷数可以作为时间的度量。
《全宋词》的词频是用 R 作的。R 虽然是不错的统计软件,也是我的最爱之一,但是
R 并不适合作文本分析,更不适合来作数据库操作。所以就用了 C# 和 Kdb +3.0。 C#
用来分析文本,.Net 是懒人的福音,并且多线程运算非常简单,能够大大提升文本处
理速度,Kdb+用来储存数据,它差不多是性能最好的 in-memor... 阅读全帖
a*****o
发帖数: 209
49
来自主题: History版 - 脍炙《通鉴》
很有意思的实验。
分析词频的实现楼主可以尝试一下Lucenehttp://lucene.apache.org/core/,非常成熟的开源全文检索系统。它处理文本时建立反向索引,用来进行文本检索的效率远远超过任何基于数据库查询的方法。它建立索引速度也非常快,它的主页上说"over 95GB/hour on modern hardware"。楼主说过通鉴大约两百多万字,那么全文10M左右,在预处理的时候按照章节分割成不同的documents,然后用Lucene建立索引可以说应该是非常迅速的。
在建立的索引基础上,词频分析以及其他更复杂的分析可以一劳永逸地实现,既可以通
过Lucene API(e.g., http://lucene.apache.org/core/old_versioned_docs/versions/3_0_2/api/all/org/apache/lucene/index/TermDocs.html#freq()), 也可以通过一些索引查看工具比如Lukehttp://code.google.com/p/luke/
Lucene可以方便地扩展到处理中文,中科院... 阅读全帖
t*********e
发帖数: 630
50
来自主题: Java版 - 15 high-impact Apache projects
1. Cassandra
The Cassandra database serves as a "scalable system of record" in the big
data world, says Jonathan Ellis, vice president of the Cassandra project.
Apache received the project from Facebook, which open-sourced Cassandra in
2008. Whereas Hadoop undertakes data analysis, Cassandra provides a data
store for applications, often highly scalable ones on the Web. Netflix, for
example, runs many Cassandra clusters, Ellis says.
2. Cordova
Giving Apache prominence in mobile computing, Cordova... 阅读全帖
1 2 3 4 5 6 7 8 9 下页 末页 (共9页)