由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Java版 - 怎样截取网页
中多个

之间的内容?
相关主题
急请教:用java实现解析parse一个log文件,多谢指点从文件读入数据得到的是bytes
如何处理中文文件名?问一个blocking IO的程序
请问个BufferedReader 读 file 的问题java,多层map应该怎么写?求个葫芦
随机读一个大文件中的任意一行新手求教 BufferedReader.readLine()
怎么从键盘输入整数或float?简单算法问题
关于char和int的问题有没有检查IP范围的算法?
JAVA文本文件读写问题java 截取一部分string
刚刚开始学习java,麻烦帮我看一下我哪里错了行吗?谢谢Re: Need Emergent help for Java I/O!
相关话题的讨论汇总
话题: div话题: string话题: 截取话题: drt话题: colspan
进入Java版参与讨论
1 (共1页)
o********g
发帖数: 14
1
已读入html文件,现用正则表达式截取
里的内容。如果
里只有一段内容(一

),则可以成功截取。但若
里有大于等于2段内容时,则这一整块

内的内容无法截取。有谁知道怎么搞定这个问题吗?求给点意见
以下是Java的代码:
public static void main(String[] args) throws IOException {
File source_file = new File("./data/page source.txt");
FileReader fr = null;
BufferedReader br = null;
try {
fr = new FileReader(source_file);
br = new BufferedReader(fr);
} catch (FileNotFoundException e2) {
e2.printStackTrace();
}
String pageSource = null;
String regEx = "
(

.+?)

";

int i=0;
while ((pageSource = br.readLine()) != null) {
Pattern pat = Pattern.compile(regEx);
Matcher mat = pat.matcher(pageSource);
while(mat.find()) { // replace" "
i++;
System.out.println(i+" "+mat.group(1));
}
}
}
正则表达式为:String regEx = "

(.+?)

";
以下是匹配的HTML源代码的样例

I am really excited about taking this course, because as a student, I
have always been very excited about teachers who made a point to incorporate
media and technology in the classroom.  For me, it has always made
learning a bit more fun.  Particularly, in high school, I remember
taking physics (which I really didn't enjoy) but my teacher used a
SmartBoard and incorporated our cell phones into our lessons, and it always
made it that much more intriguing.  I think it is important to learn
how to incorporate such technology into the classroom setting because it
will stimulate the students, even if they don't have a strong interest
in the particular subject.

At the same time, I hope that I am able
to keep up with classroom technology.  As a young teacher-hopeful, I am
very aware of technology in today's society and learning about all of
the new available technologies.  Yet, I know it is a very fast paced
market,and I hope to be able to keep learning as I continue into my career.&
#160;


这一块能截取

The game I remember was the Oregon Trail. Maybe this is because no other
game stuck in my head or because my classroom was so into it this game that
it stuck in my head. 


The best part of this game was that everyone in my class was involved and
excited about the game. From the very extroverted to those who were not, it
allowed all of my classmates to have input and participate.


这一块不能截取
问题是出在正则表达式里么?求懂这一块的高手支招啊~~艾玛谢不尽啊
e*****t
发帖数: 1005
2
不是regex高手,要是我,直接把html parse了,用xpath搞定。

【在 o********g 的大作中提到】
: 已读入html文件,现用正则表达式截取
里的内容。如果
里只有一段内容(一
: 组

),则可以成功截取。但若
里有大于等于2段内容时,则这一整块

: 内的内容无法截取。有谁知道怎么搞定这个问题吗?求给点意见
: 以下是Java的代码:
: public static void main(String[] args) throws IOException {
: File source_file = new File("./data/page source.txt");
: FileReader fr = null;
: BufferedReader br = null;
: try {
: fr = new FileReader(source_file);
m*****r
发帖数: 298
3
晕,貌似这个xpath很适合我前两天的提问啊。。。。
http://www.mitbbs.com/article_t/Java/31141889.html

【在 e*****t 的大作中提到】
: 不是regex高手,要是我,直接把html parse了,用xpath搞定。
o********g
发帖数: 14
4

嗯多谢给出意见,试试去

【在 e*****t 的大作中提到】
: 不是regex高手,要是我,直接把html parse了,用xpath搞定。
i**w
发帖数: 883
5
Pattern p = Pattern.compile("
>(.+?)
");
Matcher m = p.matcher(input);

if (m.matches()) {
int cnt = m.groupCount();
System.out.println(cnt);

String g1 = m.group(1);
System.out.println(g1);

String g2 = m.group(2);
System.out.println(g2);
}
o***e
发帖数: 65
6
试试jsoup?
b******y
发帖数: 9224
7
你需要用jsoup等html parsing的工具。
不要写regex, 这个我做过很多时间了。你会发现,你写了,就算没问题,将来维护也
麻烦。得不偿失哈。
我的经验总结。
1 (共1页)
进入Java版参与讨论
相关主题
Re: Need Emergent help for Java I/O!怎么从键盘输入整数或float?
Re: 如何从键盘输入获得一个float值?谢谢!关于char和int的问题
Stupid IBM JDKJAVA文本文件读写问题
Java XML parser的问题刚刚开始学习java,麻烦帮我看一下我哪里错了行吗?谢谢
急请教:用java实现解析parse一个log文件,多谢指点从文件读入数据得到的是bytes
如何处理中文文件名?问一个blocking IO的程序
请问个BufferedReader 读 file 的问题java,多层map应该怎么写?求个葫芦
随机读一个大文件中的任意一行新手求教 BufferedReader.readLine()
相关话题的讨论汇总
话题: div话题: string话题: 截取话题: drt话题: colspan