[Data Science Project Case] Topic Learning - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Topic Learning

相关主题
● 有没有做sentiment analysis的，求思路 (转载)	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● 如何evaluate an unsupervised learning method?	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● 凑热闹转发一篇自己写的博文，轻拍	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?	● SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
● [Road map] From ClickStream to ConsumerInsight	● 新手学python，有个简单数据结构问题，在线急等
● 欢迎加入“机器学习实践” 俱乐部	● 求问一道关于NLP的面试题
● 说说浅学ML的感受	● 有没有大牛来classifiy一下 PCA用法吗？
● [Data Science Project Case] Generate Categories for Product	● Regression也属于ML？

相关话题的讨论汇总
话题: topic话题: content话题: column话题: quality

进入DataSciences版参与讨论

(共1页)

j*******g
发帖数: 331

There are quite a bunch of really messed up dataset we have to use, mostly
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.

D**u
发帖数: 288

my two cents, N-gram not going to help much here.
You definitely need to build dictionaries either for the goods or for the
trash or both. Then, next step is "term frequency" calculation problem. Do
some research on TF-IDF or BM25, don't be daunted by the name, the
algorithms are simple ways of counting frequency.

(共1页)

进入DataSciences版参与讨论

相关主题
● Regression也属于ML？	● [Road map] From ClickStream to ConsumerInsight
● One phone interview question.	● 欢迎加入“机器学习实践” 俱乐部
● Customer Journey Analytics的一般方法跟models	● 说说浅学ML的感受
● 请教用Hive 算TF－IDF	● [Data Science Project Case] Generate Categories for Product
● 有没有做sentiment analysis的，求思路 (转载)	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● 如何evaluate an unsupervised learning method?	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● 凑热闹转发一篇自己写的博文，轻拍	● Bioinformatics Position in a Genomics Center in a University in the Southern California
● kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?	● SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)

相关话题的讨论汇总
话题: topic话题: content话题: column话题: quality

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天