用python urlopen 抓mitbbs页面的问题 - Programming版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - 用python urlopen 抓mitbbs页面的问题

相关主题
● 请教一个python urlopen的问题	● Python 下载数据请教。
● 求教， python 对于很奇怪的字符的encoding 怎么处理？	● Python 自动登录问题
● 这是什么编码？	● new to python,问个stupid的问题
● Facebook的用CNN作翻译怎么性能超过RNN的	● 上次的题目继续求助~~~python输入词组各单词大写首字母之后
● Python unicode问题	● Python macro question
● error of importing data from txt file to IBM netezza SQL da (转载)	● Python and C/C++ Question
● 怎样运行一个 Python script?	● 受不了python了
● Python, import/global	● Python 如何自动import multiple files

相关话题的讨论汇总
话题: gb2312话题: 页面话题: urlopen话题: chardet话题: import

进入Programming版参与讨论

1

(共1页)

g******e 发帖数: 352	1 试着用python urlopen 获取mitbbs的页面 Windows下，没有再加encode, decode 遇到一个奇怪的问题，有些帖子的页面抓下来没问题，但有些帖子的页面获取到的就是乱码，如果用chardet来检测，也检测不出是什么编码 (返回None). 而那些能正常抓下来的页面，则chardet会正确返回 gb2312 如果加上content.encode('gb2312').encode(type) 就报错'gb2312' codec can't decode bytes in position 1-2: illegal multibyte sequence mitbbs所有页面应该都是gb2312编码呀, 有哪位大牛给看看问题出在哪里,谢谢，有包子答谢
r****t 发帖数: 10904	2 贴个有问题的例子出来，一般用 regex massage 一下就行了。
g******e 发帖数: 352	3 code在这里，谢谢 import urllib import urllib2 import sys import chardet response = urllib.urlopen('http://www.mitbbs.com/article_t/Programming/31190605.html') content = response.read() print content chardet.detect(content) type = sys.getfilesystemencoding() print content.decode('gb2312').encode(type) 【在 r****t 的大作中提到】 : 贴个有问题的例子出来，一般用 regex massage 一下就行了。
d****e 发帖数: 251	4 Check out FAQ: http://chardet.feedparser.org/docs/faq.html You should first respect the explicit encoding.The auto detection is inaccurate and non-standard. 【在 g******e 的大作中提到】 : 试着用python urlopen 获取mitbbs的页面 : Windows下，没有再加encode, decode : 遇到一个奇怪的问题，有些帖子的页面抓下来没问题， : 但有些帖子的页面获取到的就是乱码，如果用chardet来检测，也检测不出 : 是什么编码 (返回None). 而那些能正常抓下来的页面，则chardet会正确返回 : gb2312 : 如果加上content.encode('gb2312').encode(type) : 就报错'gb2312' codec can't decode bytes in position 1-2: illegal multibyte : sequence : mitbbs所有页面应该都是gb2312编码呀,
g******e 发帖数: 352	5 谢谢您的回复，问题其实不是在chardet上，就算不用chardet,同样的python code, 我用urlopen抓下来的一大部分mitbbs网页就是乱码，根本print不出来，保存到文件也是乱码，试图用gb2312解码也报错。但是有一小部分mitbbs网页能正确print出来想不出问题出在哪里，我的环境是python 2.6, windows xp 中文版如果方便的话，哪位大侠可以在机器上run一下这几行简单的code? 能正确print吗？ import urllib import urllib2 import sys response = urllib.urlopen('http://www.mitbbs.com/article_t/Programming/31190605.html') content = response.read() print content type = sys.getfilesystemencoding() print content.decode('gb2312').encode(type) 【在 d****e 的大作中提到】 : Check out FAQ: http://chardet.feedparser.org/docs/faq.html : You should first respect the explicit encoding.The auto detection is : inaccurate and non-standard.
d****e 发帖数: 251	6 gzipped (maybe some of them based on your experience), you may use the gzip module. You know what to do from there. Good luck.
g******e 发帖数: 352	7 问题解决了，谢谢大牛，包子奉上 gzip 【在 d****e 的大作中提到】 : gzipped (maybe some of them based on your experience), you may use the gzip : module. You know what to do from there. : Good luck.

1

(共1页)

进入Programming版参与讨论

相关主题
● Python 如何自动import multiple files	● Python unicode问题
● 请教Python问题	● error of importing data from txt file to IBM netezza SQL da (转载)
● 包子，能否在Python 里生成一个csv 文件，并将它放在一个server 的directory 下？	● 怎样运行一个 Python script?
● why I can not import Tkinter?	● Python, import/global
● 请教一个python urlopen的问题	● Python 下载数据请教。
● 求教， python 对于很奇怪的字符的encoding 怎么处理？	● Python 自动登录问题
● 这是什么编码？	● new to python,问个stupid的问题
● Facebook的用CNN作翻译怎么性能超过RNN的	● 上次的题目继续求助~~~python输入词组各单词大写首字母之后

相关话题的讨论汇总
话题: gb2312话题: 页面话题: urlopen话题: chardet话题: import

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)