c******n 发帖数: 4965 | 1 my wife was spending a lot of time sorting the bibliography of her thesis,
because her bib was obtained in plain text form, I have to parse out the
first author last name first. so I wrote this little piece of code. hope it
will be useful for someone too....
right now it fails to parse single-author bib, cuz it's difficult to
recognize a human name from other words. but for biology papers, a paper
mostly has multiple authors
sub get_first_author($) {
my ($line) = @_;
my ($author, $second_possible, $remaining ) = split /,|and|\d/, $line ,3;
my $lastname = find_author_lastname($author);
my $second_possible_lastname = find_author_lastname($second_possible);
return $lastname ne ''? $lastname:$second_possible_lastname;
}
sub find_author_lastname($) {
my ($author) = @_;
my @segments = split /[, ]+/, $author;
my @candidates_for_last = ();
foreach my $s (@segments) {
if ( uc($s) eq $s ) { next;} # all upper case
if ( length($s) == 1 ) { next;} # only a single letter
if ( $s =~ /^([:alpha:]\.)+$/ ) { next;} # A.B.C. pattern
push @candidates_for_last, $s;
}
@candidates_for_last = sort {length($b) - length($a)} @candidates_for_
last;
return $candidates_for_last[0];
}
print join "", map {$_->[1]} sort { $a->[0] cmp $b->[0] } map { [get_first_
author($_) , $_ ] } <>; |
|