由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Fuzzy matching on names
相关主题
[Data Science Project Case] Generate Categories for Product想从网上抓点数据, 实现不了
说说浅学ML的感受审稿机会 Call for reviewers: soft computing 方向 (转载)
有关clustering一道面试题,向本版求教一下。
怎么计算距离比较好?有没有人想报Cloudera的Data Scientist Certificate的
问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?only average statistics
Science杂志一篇关于clustering的新文章 (转载)找DS的工作 帮忙分析下
我有大概80000~100000个左右的时间序列,希望对他们进行分类。我觉得关于datascience最近看到的几个有价值的贴
有没有谁自己买服务器组建几个clusters跑hadoop大数据的?Science上新clustering算法的分析测试
相关话题的讨论汇总
话题: names话题: data话题: dell话题: xps话题: am
进入DataSciences版参与讨论
1 (共1页)
c***z
发帖数: 6348
1
We have two data sets, one for product views and one for actual
purchases. We don't have all the shopping cart information and need to
infer the missing ones.
To make a training case we need to join the two sets, and the cart id
and item names are the only available keys. The problem is the items
can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
Laptop 17 inch mean the same item.
I am thinking of two ways: tf-idf to identify the first three words of
item names; or clustering using edit distance.
This would be the first time I am doing a text analysis project, so I
am wondering if I need a lot of data, instead of just a smaller
sample, as well as what would be the best approach and tools. I am
familiar with R, Matlab, Pig and some Scala, and am willing to learn
other languages as well.
Thanks a lot!
C***i
发帖数: 486
h********3
发帖数: 2075
3
tf-idf肯定不靠谱。tf-idf通常都是针对一篇document,一篇起码有几千字的文章。你
才几个单词,你算出来的tf就是几个样本的结果,没有任何意义。
可以考虑用edit distance来做clustering,不过那个速度太慢了。edit distance是N
方的复杂度。简单来说,你可以用Jaccard Index,就是两组词的交集大小除以两组词
的并集大小。
不过,我觉得最靠谱的还是先找个词典,把所有brand的名词都统计起来。然后再找个
词典,把商品category的词也统计起来。这样有语意上的匹配。单单只看词的话,很有
可能把iPhone的套子和iPhone放到一起了。。。

【在 c***z 的大作中提到】
: We have two data sets, one for product views and one for actual
: purchases. We don't have all the shopping cart information and need to
: infer the missing ones.
: To make a training case we need to join the two sets, and the cart id
: and item names are the only available keys. The problem is the items
: can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
: Laptop 17 inch mean the same item.
: I am thinking of two ways: tf-idf to identify the first three words of
: item names; or clustering using edit distance.
: This would be the first time I am doing a text analysis project, so I

l*******m
发帖数: 1096
4
i would suggest using search related frameworks or techques. all of them are
based on indexing, very fast

【在 c***z 的大作中提到】
: We have two data sets, one for product views and one for actual
: purchases. We don't have all the shopping cart information and need to
: infer the missing ones.
: To make a training case we need to join the two sets, and the cart id
: and item names are the only available keys. The problem is the items
: can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
: Laptop 17 inch mean the same item.
: I am thinking of two ways: tf-idf to identify the first three words of
: item names; or clustering using edit distance.
: This would be the first time I am doing a text analysis project, so I

r*****d
发帖数: 346
5
无奇策。。

【在 c***z 的大作中提到】
: We have two data sets, one for product views and one for actual
: purchases. We don't have all the shopping cart information and need to
: infer the missing ones.
: To make a training case we need to join the two sets, and the cart id
: and item names are the only available keys. The problem is the items
: can have many names in both sets, e.g. Dell 17" XPS and Dell XPS
: Laptop 17 inch mean the same item.
: I am thinking of two ways: tf-idf to identify the first three words of
: item names; or clustering using edit distance.
: This would be the first time I am doing a text analysis project, so I

N*n
发帖数: 456
6
我觉得也是得老老实实地去分析。。用edit distance 去cluster不是很可靠。
比如你这个例子,得把Dell_17_XPS 有几种可能的名字都鉴别出来。。
Perl 处理这种字符text应该是很顺手。。其它不清楚

【在 r*****d 的大作中提到】
: 无奇策。。
E***1
发帖数: 2534
7
Maybe you can try the 're' module in python. I once saw an example that it
converts different formats of phone number into one style.
(123)456-7890
123-456-7890
123456789
+1(123)456-789 (with country code)
123-456-789 x321 (with extension)
and etc.
These numbers are essentially the same, re module can help and save you time.
c****t
发帖数: 19049
8
这不是regular expressions的题目吗?我哪里想错了吗?
Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib,node上
用java, python, r都可以。也有专门的小软件不过没在node上用过
c*****o
发帖数: 1702
9
直接sql regex
c***z
发帖数: 6348
10
Thank you all!
相关主题
Science杂志一篇关于clustering的新文章 (转载)想从网上抓点数据, 实现不了
我有大概80000~100000个左右的时间序列,希望对他们进行分类。审稿机会 Call for reviewers: soft computing 方向 (转载)
有没有谁自己买服务器组建几个clusters跑hadoop大数据的?一道面试题,向本版求教一下。
进入DataSciences版参与讨论
c***z
发帖数: 6348
11
Can you share more details? Thanks a lot!

are

【在 l*******m 的大作中提到】
: i would suggest using search related frameworks or techques. all of them are
: based on indexing, very fast

c***z
发帖数: 6348
12
Can you share more details?
I used regex in r for a bit, but don't the package that does this kind of
job...
Thanks a lot!

【在 c****t 的大作中提到】
: 这不是regular expressions的题目吗?我哪里想错了吗?
: Perl当年是在regex领域独霸。现在popular的语言里都有有关的package/lib,node上
: 用java, python, r都可以。也有专门的小软件不过没在node上用过

c***z
发帖数: 6348
13
I tried Jaccard index and it worked well. I will take a look at cosine
distance and other suggested method as well. Thanks again! You guys are very
helpful!
c***z
发帖数: 6348
14
Some updates:
We did a pilot with Jaccard index, and at the cost of 2 false positives, I
was able to add 15 true positives, in addition to the 5 true matches by
exact matching (i.e. J distance is 0).
At a larger scale, I took a sample of 2000 matched records and painfully
eyeballed them.
It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
condition, among which about 1000 are exact match), things start to look
good.
In terms of training set for purchase inferencing, this increases the sample
size by 50%. In other possible applications, such as clustering for
categories, product IDs and brands, we may start from 0.35 as a criteria.
Any suggestions and comments are extremely welcome!
l******0
发帖数: 244
15
你用 Jccard index 计算两个产品名的相似度? 比如 Dell 17" XPS and Dell XPS
Laptop 17

sample

【在 c***z 的大作中提到】
: Some updates:
: We did a pilot with Jaccard index, and at the cost of 2 false positives, I
: was able to add 15 true positives, in addition to the 5 true matches by
: exact matching (i.e. J distance is 0).
: At a larger scale, I took a sample of 2000 matched records and painfully
: eyeballed them.
: It seems that after Jaccard index > 0.35 (about 1500 records satisfy this
: condition, among which about 1000 are exact match), things start to look
: good.
: In terms of training set for purchase inferencing, this increases the sample

c***z
发帖数: 6348
16
Correct.
Last Friday we tested this method on items names from NPD, and it worked
well (we did a K-S test on NPD's data and our own data).

【在 l******0 的大作中提到】
: 你用 Jccard index 计算两个产品名的相似度? 比如 Dell 17" XPS and Dell XPS
: Laptop 17
:
: sample

l******0
发帖数: 244
17
iPhone 5, iPhone 4 and iPhone screen,iPhone battery, 这类的,怎么区别?或是
你数据中产品名 ambiguity 本身就比较少,所以不管怎么做,效果都不会差. 用 RE
把首字母开头为大写的字符串提取出来,uniq sort, 应该就比较容易看出这些产品名
的特征了。
不过,网络语言,有些时候不用大写,比如 iphone or obama. 但是产品 review, 网
络上抓取的时候,应该有那个产品的 official name.

【在 c***z 的大作中提到】
: Correct.
: Last Friday we tested this method on items names from NPD, and it worked
: well (we did a K-S test on NPD's data and our own data).

c***z
发帖数: 6348
18
Yes, model/brand/category is a big thing, we append them a number of times
to the item name, to increase their weight
1 (共1页)
进入DataSciences版参与讨论
相关主题
Science上新clustering算法的分析测试问个问题:一堆(1M)二维座标系的点,每个点有weight,怎么做clustering?
请推荐生物界认可的Clustering Analysis的免费软件Science杂志一篇关于clustering的新文章 (转载)
data scientist的五个方面我有大概80000~100000个左右的时间序列,希望对他们进行分类。
都用了spark了吗?有没有谁自己买服务器组建几个clusters跑hadoop大数据的?
[Data Science Project Case] Generate Categories for Product想从网上抓点数据, 实现不了
说说浅学ML的感受审稿机会 Call for reviewers: soft computing 方向 (转载)
有关clustering一道面试题,向本版求教一下。
怎么计算距离比较好?有没有人想报Cloudera的Data Scientist Certificate的
相关话题的讨论汇总
话题: names话题: data话题: dell话题: xps话题: am