[Data Science Project Case] Fuzzy matching on names - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Fuzzy matching on names

相关主题
● [Data Science Project Case] Generate Categories for Product	● 想从网上抓点数据，实现不了
● 说说浅学ML的感受	● 审稿机会 Call for reviewers: soft computing 方向 (转载)
● 有关clustering	● 一道面试题，向本版求教一下。
● 怎么计算距离比较好？	● 有没有人想报Cloudera的Data Scientist Certificate的
● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？	● only average statistics
● Science杂志一篇关于clustering的新文章 (转载)	● 找DS的工作帮忙分析下
● 我有大概80000～100000个左右的时间序列，希望对他们进行分类。	● 我觉得关于datascience最近看到的几个有价值的贴
● 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？	● Science上新clustering算法的分析测试

相关话题的讨论汇总
话题: names话题: data话题: dell话题: xps话题: am

进入DataSciences版参与讨论

(共1页)

c***z
发帖数: 6348

We have two data sets, one for product views and one for actual
purchases. We don't have all the shopping cart information and need to
infer the missing ones.
To make a training case we need to join the two sets, and the cart id
and item names are the only available keys. The problem is the items
can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
Laptop 17 inch mean the same item.
I am thinking of two ways: tf-idf to identify the first three words of
item names; or clustering using edit distance.
This would be the first time I am doing a text analysis project, so I
am wondering if I need a lot of data, instead of just a smaller
sample, as well as what would be the best approach and tools. I am
familiar with R, Matlab, Pig and some Scala, and am willing to learn
other languages as well.
Thanks a lot!

C***i
发帖数: 486

这个看起来不错
http://openrefine.org/
另外觉得Python 之类的做这类处理应该比较顺手
http://stackoverflow.com/questions/682367/good-python-modules-f
http://stackoverflow.com/questions/2923420/fuzzy-string-matchin

h********3
发帖数: 2075

tf-idf肯定不靠谱。tf-idf通常都是针对一篇document，一篇起码有几千字的文章。你
才几个单词，你算出来的tf就是几个样本的结果，没有任何意义。
可以考虑用edit distance来做clustering，不过那个速度太慢了。edit distance是N
方的复杂度。简单来说，你可以用Jaccard Index，就是两组词的交集大小除以两组词
的并集大小。
不过，我觉得最靠谱的还是先找个词典，把所有brand的名词都统计起来。然后再找个
词典，把商品category的词也统计起来。这样有语意上的匹配。单单只看词的话，很有
可能把iPhone的套子和iPhone放到一起了。。。

【在 c***z 的大作中提到】

: We have two data sets, one for product views and one for actual
: purchases. We don't have all the shopping cart information and need to
: infer the missing ones.
: To make a training case we need to join the two sets, and the cart id
: and item names are the only available keys. The problem is the items
: can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
: Laptop 17 inch mean the same item.
: I am thinking of two ways: tf-idf to identify the first three words of
: item names; or clustering using edit distance.
: This would be the first time I am doing a text analysis project, so I

l*******m
发帖数: 1096

i would suggest using search related frameworks or techques. all of them are
based on indexing, very fast

【在 c***z 的大作中提到】

r*****d
发帖数: 346

无奇策。。

【在 c***z 的大作中提到】

N*n
发帖数: 456

我觉得也是得老老实实地去分析。。用edit distance 去cluster不是很可靠。
比如你这个例子，得把Dell_17_XPS 有几种可能的名字都鉴别出来。。
Perl 处理这种字符text应该是很顺手。。其它不清楚

【在 r*****d 的大作中提到】

: 无奇策。。

E***1
发帖数: 2534

Maybe you can try the 're' module in python. I once saw an example that it
converts different formats of phone number into one style.
(123)456-7890
123-456-7890
123456789
+1(123)456-789 (with country code)
123-456-789 x321 (with extension)
and etc.
These numbers are essentially the same, re module can help and save you time.

c****t
发帖数: 19049

这不是regular expressions的题目吗？我哪里想错了吗？
Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib，node上
用java, python, r都可以。也有专门的小软件不过没在node上用过

c*****o
发帖数: 1702

直接sql regex

c***z
发帖数: 6348

Thank you all!

相关主题
● Science杂志一篇关于clustering的新文章 (转载)	● 想从网上抓点数据，实现不了
● 我有大概80000～100000个左右的时间序列，希望对他们进行分类。	● 审稿机会 Call for reviewers: soft computing 方向 (转载)
● 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？	● 一道面试题，向本版求教一下。
进入DataSciences版参与讨论

c***z
发帖数: 6348

Can you share more details? Thanks a lot!

are

【在 l*******m 的大作中提到】

: i would suggest using search related frameworks or techques. all of them are
: based on indexing, very fast

c***z
发帖数: 6348

Can you share more details?
I used regex in r for a bit, but don't the package that does this kind of
job...
Thanks a lot!

【在 c****t 的大作中提到】

: 这不是regular expressions的题目吗？我哪里想错了吗？
: Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib，node上
: 用java, python, r都可以。也有专门的小软件不过没在node上用过

c***z
发帖数: 6348

I tried Jaccard index and it worked well. I will take a look at cosine
distance and other suggested method as well. Thanks again! You guys are very
helpful!

c***z
发帖数: 6348

Some updates:
We did a pilot with Jaccard index, and at the cost of 2 false positives, I
was able to add 15 true positives, in addition to the 5 true matches by
exact matching (i.e. J distance is 0).
At a larger scale, I took a sample of 2000 matched records and painfully
eyeballed them.
It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
condition, among which about 1000 are exact match), things start to look
good.
In terms of training set for purchase inferencing, this increases the sample
size by 50%. In other possible applications, such as clustering for
categories, product IDs and brands, we may start from 0.35 as a criteria.
Any suggestions and comments are extremely welcome!

l******0
发帖数: 244

你用 Jccard index 计算两个产品名的相似度？比如 Dell 17" XPS and Dell XPS
Laptop 17

sample

【在 c***z 的大作中提到】

: Some updates:
: We did a pilot with Jaccard index, and at the cost of 2 false positives, I
: was able to add 15 true positives, in addition to the 5 true matches by
: exact matching (i.e. J distance is 0).
: At a larger scale, I took a sample of 2000 matched records and painfully
: eyeballed them.
: It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
: condition, among which about 1000 are exact match), things start to look
: good.
: In terms of training set for purchase inferencing, this increases the sample

c***z
发帖数: 6348

Correct.
Last Friday we tested this method on items names from NPD, and it worked
well (we did a K-S test on NPD's data and our own data).

【在 l******0 的大作中提到】

: 你用 Jccard index 计算两个产品名的相似度？比如 Dell 17" XPS and Dell XPS
: Laptop 17
:
: sample

l******0
发帖数: 244

iPhone 5, iPhone 4 and iPhone screen,iPhone battery, 这类的，怎么区别？或是
你数据中产品名 ambiguity 本身就比较少，所以不管怎么做，效果都不会差. 用 RE
把首字母开头为大写的字符串提取出来，uniq sort, 应该就比较容易看出这些产品名
的特征了。
不过，网络语言，有些时候不用大写，比如 iphone or obama. 但是产品 review, 网
络上抓取的时候，应该有那个产品的 official name.

【在 c***z 的大作中提到】

: Correct.
: Last Friday we tested this method on items names from NPD, and it worked
: well (we did a K-S test on NPD's data and our own data).

c***z
发帖数: 6348

Yes, model/brand/category is a big thing, we append them a number of times
to the item name, to increase their weight

(共1页)

进入DataSciences版参与讨论

相关主题
● Science上新clustering算法的分析测试	● 问个问题：一堆（1M）二维座标系的点，每个点有weight，怎么做clustering？
● 请推荐生物界认可的Clustering Analysis的免费软件	● Science杂志一篇关于clustering的新文章 (转载)
● data scientist的五个方面	● 我有大概80000～100000个左右的时间序列，希望对他们进行分类。
● 都用了spark了吗？	● 有没有谁自己买服务器组建几个clusters跑hadoop大数据的？
● [Data Science Project Case] Generate Categories for Product	● 想从网上抓点数据，实现不了
● 说说浅学ML的感受	● 审稿机会 Call for reviewers: soft computing 方向 (转载)
● 有关clustering	● 一道面试题，向本版求教一下。
● 怎么计算距离比较好？	● 有没有人想报Cloudera的Data Scientist Certificate的

相关话题的讨论汇总
话题: names话题: data话题: dell话题: xps话题: am

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天