由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
DataSciences版 - [Data Science Project Case] Topic Learning
相关主题
有没有做sentiment analysis的,求思路 (转载)Bioinformatics Position in a Genomics Center in a University in the Southern California
如何evaluate an unsupervised learning method?Bioinformatics Position in a Genomics Center in a University in the Southern California
凑热闹转发一篇自己写的博文,轻拍Bioinformatics Position in a Genomics Center in a University in the Southern California
kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
[Road map] From ClickStream to ConsumerInsight新手学python, 有个简单数据结构问题,在线急等
欢迎加入“机器学习实践” 俱乐部求问一道关于NLP的面试题
说说浅学ML的感受有没有大牛来classifiy一下 PCA用法吗?
[Data Science Project Case] Generate Categories for ProductRegression也属于ML?
相关话题的讨论汇总
话题: topic话题: content话题: column话题: quality
进入DataSciences版参与讨论
1 (共1页)
j*******g
发帖数: 331
1
There are quite a bunch of really messed up dataset we have to use, mostly
due to bad ETL and lousy client input. In one column, the content can be
vastly different. For example, in the column "store information", the
content could be the store name, which is good, or it could be just the
brand, or the address, some short name like "ABC", or some meaningless code/
strings.
This would be an unsupervised learning problem. There are several things we
want to achieve: 1, identify the quality of certain column, come up with a
probability or a confidence level how the actual content associate with the
topic. 2, classify the content into several groups based on the quality. 3,
we also want to generalize the information so that any topic/content comes
in, we can have a good idea how good the quality, how relevant they are.
Interesting how people like me try to discover about a new term, I will
always google it and see what is the result and I will build an idea what
this topic might be. First of all, I want to know if we can do some sort of
similar information retrieval with 3rd party API. Since we have too few
information in the column, it is difficult to do topic modeling like a
document. If we build an dictionary, we have to take N-gram into
consideration, I don't how to deal with that.
I am quite new to the data science world, any input will be greatly
appreciated.
D**u
发帖数: 288
2
my two cents, N-gram not going to help much here.
You definitely need to build dictionaries either for the goods or for the
trash or both. Then, next step is "term frequency" calculation problem. Do
some research on TF-IDF or BM25, don't be daunted by the name, the
algorithms are simple ways of counting frequency.
1 (共1页)
进入DataSciences版参与讨论
相关主题
Regression也属于ML?[Road map] From ClickStream to ConsumerInsight
One phone interview question.欢迎加入“机器学习实践” 俱乐部
Customer Journey Analytics的一般方法跟models说说浅学ML的感受
请教 用Hive 算TF-IDF[Data Science Project Case] Generate Categories for Product
有没有做sentiment analysis的,求思路 (转载)Bioinformatics Position in a Genomics Center in a University in the Southern California
如何evaluate an unsupervised learning method?Bioinformatics Position in a Genomics Center in a University in the Southern California
凑热闹转发一篇自己写的博文,轻拍Bioinformatics Position in a Genomics Center in a University in the Southern California
kaggle上这个restaurant-revenue-prediction的题目有人考虑过么?SE/Data scientist找工作总结[F/G/L/T/D/P/U…] (转载)
相关话题的讨论汇总
话题: topic话题: content话题: column话题: quality