[Data Science Project Case] Parsing URLS - DataSciences版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

DataSciences版 - [Data Science Project Case] Parsing URLS

相关主题
● 请推荐一个NLP的data set (转载)	● 机器学习日报2015年2月楼
● [Data Science Project Case] Marketing Return	● DS需要会的手艺真不少
● [Road map] From ClickStream to ConsumerInsight	● 借版面问个machine learning的问题
● 希拉里脸部加屎API	● 求职要求clearance
● 【免费讲座】2/28 Session: Introducing SQL Server on Linux (转载)	● [Data Science Project Case] Topic Learning
● 机器学习需要自己搞算法吗	● Data scientist / Machine Learning Engineer 相关面试题 (转载)
● 凑热闹转发一篇自己写的博文，轻拍	● random forest 有没有可能保证某几个变量一直被选上
● 恭喜开版，发个刚看到的好玩的machine learning的图	● 一个面试题（predictive model） (转载)

相关话题的讨论汇总
话题: url话题: character话题: urls话题: names话题: 360

进入DataSciences版参与讨论

(共1页)

c***z
发帖数: 6348

This is something I am working on and would like to hear if you have any
clue.
Say we have millions of product names, such as "Xbox 360", "Playstation 4",
etc.
We want to extract (tokenize) meaningful information from billions of URLs (
click history), and want to distinguish the 360 in "Xbox 360" (useful) and
the 360 in session ids (garbage).
For example, given
www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
The first 09 is size (keep) and the second 09 is garbage (drop)
We want: amazon nike running shoes 09 mens buy hello there; but we want to
drop: abc 123, as well as the second 09
Due to the size of the data, manually checking the names is impossible. Does
anyone have a clue?
I am thinking about hashing table, but that means the parsing time raises
from O(1) to O(N), and N is millions!
Thanks!

I******y
发帖数: 176

不知道理解对不对，胡说两句：
感觉可以根据url的pattern来分类然后extract things you want
按你那个例子，比如同属amazon domain的url pattern都是domain/brand/item％size/
... 那么已知这个pattern就可以把你需要的提出来。

c***z
发帖数: 6348

Sounds good, will take a look at the patterns. Thanks a lot!

l******n
发帖数: 9344

就用regular expression match就好了

size/

【在 I******y 的大作中提到】

: 不知道理解对不对，胡说两句：
: 感觉可以根据url的pattern来分类然后extract things you want
: 按你那个例子，比如同属amazon domain的url pattern都是domain/brand/item％size/
: ... 那么已知这个pattern就可以把你需要的提出来。

c***z
发帖数: 6348

Can you give more details?
I did regular expression match in R on company names, it was a pain in the
butt, and it's only 10k names...

r*******y
发帖数: 626

One way is to establish a clickstream, which leads to a sale. From the item
sold, make sense of URLs clicked during the process.

,
(

【在 c***z 的大作中提到】

: This is something I am working on and would like to hear if you have any
: clue.
: Say we have millions of product names, such as "Xbox 360", "Playstation 4",
: etc.
: We want to extract (tokenize) meaningful information from billions of URLs (
: click history), and want to distinguish the 360 in "Xbox 360" (useful) and
: the 360 in session ids (garbage).
: For example, given
: www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref=hello%09there
: The first 09 is size (keep) and the second 09 is garbage (drop)

b**L
发帖数: 646

running-shoes%09mens 不是size9, %dd 是ascii码
所以用regex 应该很容易parse 这些url

c***z
发帖数: 6348

谢谢大家的input，我确实对url不熟，呵呵，还是要多多学习啊

b*****o
发帖数: 715

完全不懂你在说什么。
你给的那个例子里％09都是escaped unicode：
import urllib
urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
=hello%09there")
'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
另外，为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
呀，都是GET request里的param。还是说你有一个param的whitelist/blacklist？

,
(

【在 c***z 的大作中提到】

c***z
发帖数: 6348

Ah, I just realized that I know too little for URL parse, I will ask the
engineers so that I can ask the question more intelligently.

ref

【在 b*****o 的大作中提到】

: 完全不懂你在说什么。
: 你给的那个例子里％09都是escaped unicode：
: import urllib
: urllib.unquote("www.amazon.com/nike/running-shoes%09mens/buy?q=abc&x=123&ref
: =hello%09there")
: 'www.amazon.com/nike/running-shoes\tmens/buy?q=abc&x=123&ref=hello\tthere'
: 另外，为什么drop q=...和x=...,但是保留ref=...? 这两者就功能而言没有任何不同
: 呀，都是GET request里的param。还是说你有一个param的whitelist/blacklist？
:
: ,

相关主题
● 机器学习需要自己搞算法吗	● 机器学习日报2015年2月楼
● 凑热闹转发一篇自己写的博文，轻拍	● DS需要会的手艺真不少
● 恭喜开版，发个刚看到的好玩的machine learning的图	● 借版面问个machine learning的问题
进入DataSciences版参与讨论

l******0
发帖数: 244

we have millions of product names, such as "Xbox 360"
--- The 'millions of product names' are known and in your database, or
unknown?
You want to extract company name --> product name from URL, or anything else?
First impression is to sort all the URL lines alphabetically so that it
would be much easier to identify different URL patterns from different sites.

,
(

【在 c***z 的大作中提到】

l*******s
发帖数: 1258

this is a sequence labeling task:
a url is a sequence, your task is to find out terms within the url.
It's similar with named entity recognition task.
You can read some paper about it.
Model: CRF, MEMM, HMM
training data: manually label them

l*******s
发帖数: 1258

cont:
Use tag B I O to indicate beginning, inside, and outside of a word.
Each character in URL will be assigned a tag, B, or I, or O.
Then this becomes a classification task, just with 3 class labels: BIO.
Grab any classifier you want, mine is MaxEnt
Feature engineering:
convert each character to a feature vector. The most helpful features will
be: n gram character before or after current charactor, length of url,
whether there is a digit or letter in neighboring characters, and of course
current character.
Model training and decoding:
This step is pretty simple, exactly the same with any other classification
tasks.
tips: use some post-processing rules to improve.

l*******s
发帖数: 1258

(共1页)

进入DataSciences版参与讨论

相关主题
● 一个面试题（predictive model） (转载)	● 【免费讲座】2/28 Session: Introducing SQL Server on Linux (转载)
● data science 面试求教	● 机器学习需要自己搞算法吗
● 请教大家一个做feature的问题	● 凑热闹转发一篇自己写的博文，轻拍
● pyspark subtract 如何使用？	● 恭喜开版，发个刚看到的好玩的machine learning的图
● 请推荐一个NLP的data set (转载)	● 机器学习日报2015年2月楼
● [Data Science Project Case] Marketing Return	● DS需要会的手艺真不少
● [Road map] From ClickStream to ConsumerInsight	● 借版面问个machine learning的问题
● 希拉里脸部加屎API	● 求职要求clearance

相关话题的讨论汇总
话题: url话题: character话题: urls话题: names话题: 360

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天