第5页 - 关于lucene的讨论汇总 - 话题女王

g********e
发帖数: 1142

来自主题: SanFrancisco版 - 前即刻老兵：我所了解的人民搜索研发状况(zz)

前即刻老兵：我所了解的人民搜索研发状况
2013年05月24日 11:16 创事记微博作者：jikesolider 我有话说
邓亚萍领衔的即刻搜索，一直以来都备受关注
编者注：人民网旗下的即刻搜索，一直以来因名人效应、官方背景、种种变动而备
受关注。本文来自弯曲评论，作者jikesolider自称即刻老兵，仅供参考。
离开即刻已经几个月了，想起在jike将近三年的工作时光，感慨还是很多的，闲来
无事，整理下在即刻的点点滴滴，以供同行或者后续想去即刻谋生的参考。
即刻的前身叫人民搜索，当时可以说一穷二白，当时的领导是宫，由于对搜索不了
解，无从下手，就先和中科院进行合作，用开源的Lucene搭了个搜索，功能和性能不能
适合大搜索的要求。后来就搁浅了。
然后来了世界冠军，世界冠军果然不同凡响，首先就和前中国谷歌总监刘的公司云
壤合作，聘请刘作为首席科学家，云壤提供技术支持和开发，当时签的合同是给云壤一
定的股权，同时还有一大笔钱，当然钱来自纳税人，也无所谓了，刘的公司经过不到一
年的开发，在2011年6月20日上线，并且更名为即刻搜索，记得我们当时听到更名为
jike，都乐了，... 阅读全帖

x**v
发帖数: 100

来自主题: SanFrancisco版 - Could anybody suggest a tool to do performance testing for Lucene search

I am using JMeter right now. I am wondering whether there is other better
options?
Thank you so much.

m****v
发帖数: 780

来自主题: SanFrancisco版 - 天，如何能让程序转得快点？有包子。 (转载)

把A用不同长度的substrings来 build index
用B的作为query，用OR search去search top k (相当于用了不同的distance算法), 再
用edit distance（也许都不用这步了）
lucene搜索是优化了得，非常快

o****e
发帖数: 916

来自主题: SanFrancisco版 - 天，如何能让程序转得快点？有包子。 (转载)

+1 on lucene , index b4, and query each of a5, select top x, should be
lightening fast. don't bother to write your own matching algorithm.

m***2
发帖数: 595

来自主题: SanFrancisco版 - 刚到湾区求Software Engineer工作或内推 – 滴水之恩必涌泉相报

不要轻易在简历上写“我热爱编程，我热爱学习”
2015-10-12 CocoaChina
作者：tinyfool 授权本站转载。
看纯银的微博：
“收到私信咨询，统一回复下。进入互联网行业，多把玩产品就够了。比如说，把玩
Appstore每个分类下TOP20-50的产品，挨个写分析总结，整两三个月，比看书考证强多
了。对编辑和运营职位来说，在知乎混得好会有很大的加分，玩简书和lofter这类小众
创作产品也有加分。总之得有创作出来，潜水转发跟帖就算了。”
“如果是一个不进行任何形式的创作，也没有许多成文的分析总结，甚至连各个社交平
台的好友(粉丝)都不多，只是大声呐喊着“我真的很热爱…”的人，在我眼里是很可疑
的。至少他在数百封应聘信里不值得多看一眼。”
左耳朵耗子吐槽某资深技术专家的微博，对此文的情绪形成也有贡献。
有感，如下：
95年的时候，我高中同学郭军买了一本Borland C++手册，我们两个人此后两年没机会
碰真的电脑，没见过Tubro C更没见过Borland C++，生看这本书，看了两年，百看不厌
。做不到，别轻易说，我热爱编程，热爱学习。
97年，父母在高考前三天给我... 阅读全帖

i***0
发帖数: 8469

来自主题: Seattle版 - Machine Learning Opportunity, Adsymptotic!!

I wanted to connect on a 2 opportunities I'm working on. One is a role in
Menlo Park with Adsymptotic. I'm working with the founders from Google/
Admob/Yahoo and backed by Sequoia/KP; well funded, and growing. I've helped
them staff a number of roles and looking for a hadoop expert to join the
team. Take a look, they're doing quite well.
I'm also working with the VPE of Mashlogic in Palo Alto and they're looking
for an Java/BigData engineer. They're growing, backed by NEA and Bessemer
Vent... 阅读全帖

p*****b
发帖数: 291

来自主题: WashingtonDC版 - Java开发人员知识点

这几年,这些列出的东西80%用过,除了JMX,和几个商用的WEB Containers,加密算法实现等.
看样子,还算跟的上形势到目前为止.
另外还应加上SOLR/LUCENE/TIKA等在APACHE 下.
目前的行价应是多少,对符合这个List的coder?

d******n
发帖数: 186

来自主题: WashingtonDC版 - 10 open IT contractor positions in a company at DC downtown

StreamSage是Comcast的一个下属公司,在DC downtown,现在有10个contractor
position.这周五有一个open house,一起interview这些position,希望尽快找到合适的
人. 感兴趣的同学可以看看下面的介绍,是从小秘群发给大家的.
Calling All StreamSagers!
Some of you may have heard a rumor of an upcoming Open House here at
StreamSage.
Well, it is scheduled for this Friday July 20th from 2pm – 6pm.
The purpose of this open house is to fill our 10 open contractor positions
as quickly as possible. So, we will be conducting mass interviews & making
on the spot job offers to qualifie... 阅读全帖

N***M
发帖数: 4295

来自主题: WashingtonDC版 - 10 open IT contractor positions in a company at DC downtown

这年头hadoop lucene solr都不好找

d********o
发帖数: 1738

来自主题: WaterWorld版 - 谁能告诉我这个网站上该怎么搜索

这个网站就是一个joke, 尤其是不能search, 都什么年代了，找两人把lucene安上吧！

C******y
发帖数: 2007

来自主题: WaterWorld版 - 真令人羡慕啊，群粉簇拥

文本分析的手段有很多种，国内那帮人用的词频分析太低级了。连个基本的Lucene
Index都没有，LSA, random index压缩，SVM分类，这些说了你脑残也不懂，肘子也就
是写个mud脚本的水平是不会懂这些东西的。，而且这总有个statistical power的问题
，再说写作风格总是有变化的，这玩意也就学术上玩一玩，拿去判断一个人的文学作品
是不公平而且会被你这种不懂的人断章取义的。
什么小姐之类的，就是你纯从令堂的经历推测来的吧，告诉你这是不靠谱的。

g********0
发帖数: 6201

来自主题: Joke版 - 前即刻老兵：我所了解的人民搜索研发状况(zz) (转载)

【以下文字转载自 Military 讨论区】
发信人: thanksgiving (###), 信区: Military
标题: 前即刻老兵：我所了解的人民搜索研发状况(zz) (转载)
发信站: BBS 未名空间站 (Fri May 24 00:21:13 2013, 美东)
发信人: goldenlife (goldenlife), 信区: SanFrancisco
标题: 前即刻老兵：我所了解的人民搜索研发状况(zz)
发信站: BBS 未名空间站 (Thu May 23 23:47:29 2013, 美东)
前即刻老兵：我所了解的人民搜索研发状况
2013年05月24日 11:16 创事记微博作者：jikesolider 我有话说
邓亚萍领衔的即刻搜索，一直以来都备受关注
编者注：人民网旗下的即刻搜索，一直以来因名人效应、官方背景、种种变动而备
受关注。本文来自弯曲评论，作者jikesolider自称即刻老兵，仅供参考。
离开即刻已经几个月了，想起在jike将近三年的工作时光，感慨还是很多的，闲来
无事，整理下在即刻的点点滴滴，以供同行或者后续想去即刻谋... 阅读全帖

r*****9
发帖数: 75

来自主题: Apple版 - 菜鸟请教interview problem ----- Apple Quartz and PDF（pdf提取？ text Mining）--在线等解答，谢谢

我今天被面了一个start up的 intern，一个烙印面的我，英语一句也没听懂，整个过
程基本就是鸡同丫讲。 3分钟匆匆结束。
这个家伙随后给我发过来了一个问题(在电话里问过了，我实在听不懂)，叫我给他回个
自己的想法.不过看了email后，我也是没懂，实在是惭愧。
他的问题原话：As discussed, please share your thoughts on integrating Quartz
from Apple (for PDFs) into the solution. I am interesting in getting your
view how this might help/affect the solution of automating the process.
大体背景介绍：公司是一个做类似与text mining的工作，用lucene，需要把pdf里面
的文本取出来。
我google了一下这个Quartz 是apple的一个图形处理的引擎。
我想问得是，我这个东西怎么自动的提取pdf的文本呢？
我要有什么想法呢
谢谢前辈指导啊

i***c
发帖数: 301

来自主题: BuildingWeb版 - 我的网站，各位给提提意见

can you give me some info about the web crawler?
how do you integrete with asp.net
I using nutch and lucene seems not easy with asp.net

the
't

s****y
发帖数: 983

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

当然可以

l*******e
发帖数: 10

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

几年前搞过java版本的luncene。

w*********m
发帖数: 4740

来自主题: BuildingWeb版 - 想做个搜索引擎，Lucene行吗？

any luncene 版本 in other language?

b******y
发帖数: 9224

来自主题: BuildingWeb版 - 现在做手机开发，但是想学习网络开发，前台后台都想学

不用。就在本地机器上就可以了。比如说，我用tomcat，可以直接在local machine上
做个网站（可以安装mysql)。你甚至可以将你做的网站screen shot截图后留着，将来
找工作啥的很好使。
我现在的网站是裸机，自己的java程序和搜索引擎软件。数据库是mysql. Linux based
, 巨方便。
我也考虑用jetty, 当年和它的founder有过email交流，他做的jetty是embedded java
server, 也很好用。
除了这些，我还自己编写过:
-java based template engine (like velocity)
-java based mini web server
-an efficient crawler system based on java
-search libraries like Lucene but with my own IP
的等等。
欢迎继续交流.

b******y
发帖数: 9224

来自主题: BuildingWeb版 - 搜索效率问题请教

存储不能用数据库了，需要用lucene那样的index. 也就是需要用搜索技术实现。

发帖数: 1

来自主题: BuildingWeb版 - 试着做了一个image sharing的网站

这个idea其实已经酝酿了一段时间，一开始纯粹是geek的通病，手痒做了个非常简单的
原型出来，真正开始放手大干是从今年一月初开始。真正体会到了一个人身兼dev+
product manager+project manager+QA到底有多累。唯一的时间就是每天下班和周末。
每天下班后晚上经常干活到凌晨一两点，前面几个月里时不时就把weekend当作48小时
的hackathon。每天基本两杯StarBucks的最大杯coffee。大概6，7个月基本的架子就完
成了，后来一两个月基本是在增加一些新的feature。拖的时间比较长，主要是后端系
统设计的时候总是按捺不住地要达到trading system的reliability，还经常假设一些
情况比如，如果有千万级的图片怎么办（有时候我自己都笑了）。时常在几个option之
间犹豫徘徊，久久无法决定。现在基本整个系统的关键workflow可以随时bounce live-
live不会出错,transactional的workflow比如checkout/bookkeeping全部是non-
locking。
系统基本上是几台自己pro... 阅读全帖

发帖数: 1

来自主题: BuildingWeb版 - 试着做了一个image sharing的网站

r*****9
发帖数: 75

来自主题: ComputerGraphics版 - 面试问题求救：关于 Quartz， textming, Pdf

b*********d
发帖数: 139

来自主题: CS版 - 做text mining的同修，请推荐一个好的tokenization的library？

lucene

x**y
发帖数: 10012

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

可

M****o
发帖数: 117

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

前辈，能给点详细的建议吗？我这刚刚起步呀。

l*****g
发帖数: 246

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

楼上的果然沉默是金啊

w**********s
发帖数: 291

来自主题: CS版 - 想做个搜索引擎，Lucene行吗？

Lemur试试？

t***s
发帖数: 48

来自主题: CS版 - 【求教】Text Indexer for Large Volume of ASCII files【先谢】

我需要index大概四百万个ascii文件。哪位高人给推荐个好用的text indexer。多谢。
简单的说，就是个类似于text search engine里面indexer那一部分，但是提供比较方
便的command line access。最好在windows上。
具体点，就是能够index这四百万文件，把结果放在一个可以从command line方便读取
的repository里。可以是relational database或者其他proprietary的格式，只要能够
从command line或者perl之类的scripts读，结果能输出到ascii文件就可以。当然如果
可以从数据库里直接用SQL读更好。
安装越简单越好。最好都是command line。
我试过微软的search server和open source的lucene，都不太满意，主要是输出都是
web pages。像我的情况一个关键字上百页的输出。处理起来太麻烦。
再谢。

c****e
发帖数: 1453

来自主题: Database版 - 搜索database按什么算法最快？用index?

You didn't read what I wrote? Body has no difference with sender or subject.
If you are interested in the implementation detail, take a look at Lucene.
Essentially, you can see each document as a set of field, and you build
reverse index over each document. The field conect helps on structured
filtering. That's why it's called faceted search.

c****e
发帖数: 1453

来自主题: Database版 - 搜索database按什么算法最快？用index?

d****u
发帖数: 275

来自主题: Database版 - MySQL的fulltext检索

mysql数据库有3列内容，各是几百字的纯文本，数据库一共有100万条记录
这样的规模用mysql的全文检索效率如何？拿来做在线搜索是否推荐？
还是一定要装lucene一类的自然语言包？
谢谢！

F****n
发帖数: 3271

来自主题: Database版 - 为啥RDBMS只用一个Index? (转载)

我前段时间用ES写了个 real time Durable write的东西感觉这玩意远没有那么难。
可惜这个是给公司写的。等有时间准备直接用Lucene, zookeeper弄个ACID的real time
object db.

F****n
发帖数: 3271

来自主题: Database版 - 为啥RDBMS只用一个Index? (转载)

In theory yes, in practice, not necessarily.
I just implemented a system with real time durable write and query based on
Lucene.
It is almost ACID, only without the repeatable and serializable isolation
levels.
I am thinking about making an open source library with full ACID support.

e***g
发帖数: 158

来自主题: DotNet版 - .Net search engine - Lucene.Net

ah! just port whole apache to .net. or run java on .net.

L*******r
发帖数: 1011

来自主题: DotNet版 - .Net search engine - Lucene.Net

Hehe, C# and java are very similar. Apache project is good. Why not use it?

See

w********c
发帖数: 2632

来自主题: DotNet版 - .Net search engine - Lucene.Net

JJDD copycats! hehe.

Java.

L*******r
发帖数: 1011

来自主题: DotNet版 - .Net search engine - Lucene.Net

hehe. Why?
Think about how template works? copying is not bad. But you have to know the
mechanics behind it. :)
hehe, no need to argue that la. Just develop the best technology. :)

it?

W********n
发帖数: 254

来自主题: DotNet版 - Stack Exchange’s Architecture in Bullet Points

Taken from: http://blog.serverfault.com/2011/02/11/stack-exchanges-architecture-in-bullet-points/
Traffic:
95 Million Page Views a Month
800 HTTP requests a second
180 DNS requests a second
55 Megabits per second
Data Centers:
1 Rack with Peak Internet in OR (Hosts our chat and Data Explorer)
2 Racks with Peer 1 in NY (Hosts the rest of the Stack Exchange Network)
Production Servers*:
12 Web Servers (Windows Server 2008 R2)
2 Database Servers (Windows Server 2008 R2 and SQL Server 2008 R2)
2 Loa... 阅读全帖

G*********a
发帖数: 1080

来自主题: Java版 - Anybody here used apache Lucene?

G*********a
发帖数: 1080

来自主题: Java版 - Anybody here used apache Lucene?

en, but what i want to do is really understand its API and extend its simple
demo into a more flexible and powerful testbed.
for example, how to extend Similarity class etc. if u do play on it, i hope i
can discuss with u.

。

c*****s
发帖数: 214

来自主题: Java版 - Anybody here used apache Lucene?

The process of creating index and searching is sort of standard, you can't
make too many changes on them.
Most work is done to create your own document reader. Here the document means
an item of record, is pretty abstract thing.
I'm not sure what is the Similarity Class you mentioned. If you want to search
within class, you have to design such reader to parse what you want to search.
What does Lecene to is create index for the source you provide(Lecenu also
provides the reader for txt and html f

b******y
发帖数: 9224

来自主题: Java版 - 讨论一下web framework吧

Apache web server, tomcat/jboss/jetty web server, Velocity template engine,
lucene, database, that's pretty much it for the web application.

b******y
发帖数: 9224

来自主题: Java版 - 使用java的大型站点

随着java技术的进步，越来越多的站点开始使用java作为主要的语言和平台。以下是一
些使用java的大型站点。
indeed.com
工作搜索引擎，在德州。以lucene, java为主
become.com
大型的比较购物搜索站点。由韩国人建立。他们的crawler介绍如下文章:
http://java.sun.com/developer/technicalArticles/WebServices/become/
nextag.com
大型的比较购物搜索站点。在shopping comparison engine方面是主要的player
shopzilla.com
大型比较购物搜索站点。在加州LA
LinkedIn.com
大型的工作社交站点。
amazon.com
他们的平台逐步从c/c++/perl向java移植
expedia.com
大型网上订购机票站点。他们的supply chain系统正在从c/c++向java移植
可见，java已经进入了high performance server端，不再是c/c++的天下了。。。
连google现在也在用java

t*g
发帖数: 1758

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

我的queryparser,把中文放进去，query里什么也没有了。

t*******e
发帖数: 684

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

Have you specified an appropriate Analyzer for your query? e.g. CJKAnalyzer

k***r
发帖数: 4260

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

见过免费的java分词库，找找吧。不过好的都要钱，还不便宜

t*g
发帖数: 1758

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

Thx, it worked.

CJKAnalyzer

t*g
发帖数: 1758

来自主题: Java版 - 请问有用过lucene作中文搜索的吗?

We're planning to buy one...Still under evaluation.

t*******e
发帖数: 684

来自主题: Java版 - 再请教一个lucene的问题

.
'm
This is impossible. Inverted index in a search engine stores terms
(tokens) in a term index file as the search key, which maps Document IDs,
and returns matched Documents as the query results. But not the other way around.
The terms you specified in you query are the tokens you may use to highlight
the original text.

t*g
发帖数: 1758

来自主题: Java版 - 再请教一个lucene的问题

如果我打开index，go through所有的term，然后要把每一个term所在的document里其
他所有的field打印出来，怎么做?谢谢！可以通过TermPostion和TermEnum直接得到吗?

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

topics

未名新帖统计// 7月16日

历史上的今天