由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - parsing bibliography and sorting (转载)
相关主题
一个搞统计的对C#的第一印象问几个javascript面试题
请问Python初学者怎么学how to find the date of today in UNIX?
segmentation fault as soon as entering 1 function in the arm processor boardregular expression的一个问题
关于新语言的想法A very dump c++ question
How to Parsing function in haskell?谁知道如何调试yacc程序?
问java api的问题怎么样用 C Parse HTML?
parsing file in node: js or python ?Smart Parser/Compiler Development
请教一个parser的问题求教Code
相关话题的讨论汇总
话题: author话题: lastname话题: my话题: sorting
进入Programming版参与讨论
1 (共1页)
c******n
发帖数: 4965
1
【 以下文字转载自 THU 讨论区 】
发信人: creation (努力自由泳50m/45sec !), 信区: THU
标 题: parsing bibliography and sorting
发信站: BBS 未名空间站 (Sun Nov 11 13:27:54 2012, 美东)
my wife was spending a lot of time sorting the bibliography of her thesis,
because her bib was obtained in plain text form, I have to parse out the
first author last name first. so I wrote this little piece of code. hope it
will be useful for someone too....
right now it fails to parse single-author bib, cuz it's difficult to
recognize a human name from other words. but for biology papers, a paper
mostly has multiple authors
sub get_first_author($) {
my ($line) = @_;
my ($author, $second_possible, $remaining ) = split /,|and|\d/, $line ,3;
my $lastname = find_author_lastname($author);
my $second_possible_lastname = find_author_lastname($second_possible);
return $lastname ne ''? $lastname:$second_possible_lastname;
}
sub find_author_lastname($) {
my ($author) = @_;
my @segments = split /[, ]+/, $author;
my @candidates_for_last = ();
foreach my $s (@segments) {
if ( uc($s) eq $s ) { next;} # all upper case
if ( length($s) == 1 ) { next;} # only a single letter
if ( $s =~ /^([:alpha:]\.)+$/ ) { next;} # A.B.C. pattern
push @candidates_for_last, $s;
}
@candidates_for_last = sort {length($b) - length($a)} @candidates_for_
last;
return $candidates_for_last[0];
}
print join "", map {$_->[1]} sort { $a->[0] cmp $b->[0] } map { [get_first_
author($_) , $_ ] } <>;
l*******s
发帖数: 1258
2
这个东西可大可小
往小了说 写一堆正则表达式 自己弄一些rule 应该可以解决大部分问题
往大了说 就是NLP里面典型的Named Entity Recognition问题,主流方法用machine
learning加一些context features。不妨试试一些现成的包,比如opennlp等
1 (共1页)
进入Programming版参与讨论
... ?
相关主题
求教CodeHow to Parsing function in haskell?
如何下载网络页面,不包含,
问java api的问题
how to count the times a function is usedparsing file in node: js or python ?
How to user Perl to handle object on client side?请教一个parser的问题
一个搞统计的对C#的第一印象问几个javascript面试题
请问Python初学者怎么学how to find the date of today in UNIX?
segmentation fault as soon as entering 1 function in the arm processor boardregular expression的一个问题
关于新语言的想法A very dump c++ question
相关话题的讨论汇总
话题: author话题: lastname话题: my话题: sorting