由买买提看人间百态

boards

本页内容为未名空间相应帖子的节选和存档,一周内的贴子最多显示50字,超过一周显示500字 访问原贴
Programming版 - python code performance --- normal or too slow?
相关主题
How to have another func call printf with va_arg list ?再请教一个c++的简单问题(enumerated constant)
Regular Expressions Cookbook, 2nd Editiontypedef 的一个问题
How to Parsing function in haskell?又问几个c语言编程的题目
parsing bibliography and sorting (转载)xml schema beginner question
问java api的问题Shuffle performance (C#)
parsing file in node: js or python ?10个包子请教一个简单的编程问题
请教一个parser的问题python question
some interview questions求助!这段code如何使用?
相关话题的讨论汇总
话题: geno话题: genotype话题: alt话题: chrs话题: ref
进入Programming版参与讨论
1 (共1页)
i***r
发帖数: 1035
1
file is 2.5GB with 18,217,166 lines
my python script took about 20-30 minutes to finish
seems slow?
Thanks!!
input file data structure (showing first two lines, wrapped):
chromo pos ref alt dc1 dc2 dc3 dtm bas din
crw itb ptw spw isw irw inw ru1 ru2
ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2
ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa
jac
chr01 242806 G T 0/0 0/0 . 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0 0/0 0/0 . 0/0
0/0 0/0 . 0/0 0/0 0/0 0/0 0/0 0/0 0/
0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0
0/0
my python code is to:
1. parse the header and produce first file
2. parse the body and translate 0s and 1s to ATGC etc to produce second file
.
import sys
def geno_to_base(ref, alt, genotype):
assert len(genotype) == 3, "genotype not in 0/1 format"
allele1 = alt if genotype[0] else ref
allele2 = alt if genotype[-1] else ref
return '{} {} '.format(allele1, allele2)
def translate_geno(ref, alt, genotype):
'''genotype needs to be either . or 0/0 format'''
return '0 0 ' if genotype == '.' else geno_to_base(ref, alt, genotype)
def line_parse(line):
chrs, pos, ref, alt, *geno = line.split()
all_genotype = [translate_geno(ref, alt, g) for g in geno]
return chrs, pos, ''.join(all_genotype)
if __name__ == "__main__":
fn = sys.argv[1] # required
fin = open(fn)
tfam = open('out.tfam','w')
tped = open('out.tped', 'w')
# write tfam
header = next(fin)
for i,h in enumerate(header.split()[4:]):
tfam.write('{}t{}t0t0t0t0n'.format(i,h))
# write tped
morgan = 0
for i,l in enumerate(fin):
rs_id = 'snp{}'.format(i+1)
chrs, pos, all_geno = line_parse(l)
chrs = int(chrs[3:]) # only need the number
tped.write( '{} {} {} {} {}n'.format(chrs, rs_id, morgan, pos, all_
geno) )
tfam.close()
tped.close()
w****w
发帖数: 521
2
慢,你这里每一行format和if重复太多。
每行开始的时候建个dictionary:
geno['.']='0 0'
geno['0/0']='{} {}'.format(f[2],f[2])
geno['0/1']='{} {}'.format(f[2],f[3])
geno['1/0']='{} {}'.format(f[2],f[3])
geno['1/1']='{} {}'.format(f[3],f[3])
n***e
发帖数: 723
3
Your i/o could be very slow if you keep reading & writing small blocks of
data each loop. I don't know about python i/o but most programming languages
should already contain stream buffering technology for this issue. To test
it, you can try to limit your i/o access by read & write a bigger block (say
1000 lines) every time, see if it can make a difference.
i***r
发帖数: 1035
4
Thanks. Sounds a good idea.
I benchmarked the code again, it took 20 minutes to finish

【在 w****w 的大作中提到】
: 慢,你这里每一行format和if重复太多。
: 每行开始的时候建个dictionary:
: geno['.']='0 0'
: geno['0/0']='{} {}'.format(f[2],f[2])
: geno['0/1']='{} {}'.format(f[2],f[3])
: geno['1/0']='{} {}'.format(f[2],f[3])
: geno['1/1']='{} {}'.format(f[3],f[3])

i***r
发帖数: 1035
5
Thanks!!
I will benchmark more and report my result here.

languages
test
say

【在 n***e 的大作中提到】
: Your i/o could be very slow if you keep reading & writing small blocks of
: data each loop. I don't know about python i/o but most programming languages
: should already contain stream buffering technology for this issue. To test
: it, you can try to limit your i/o access by read & write a bigger block (say
: 1000 lines) every time, see if it can make a difference.

n****1
发帖数: 1136
6
使用python>=3.2, 把前两个函数合并为:
import functools.lru_cache
@functools.lru_cache
def translate_geno2(ref, alt, genotype):
hsmap = {'0/0': ref+ref, '0/1': ref+alt, '1/0': alt+ref, '1/1': alt+alt,
'.': '0 0'}
return hsmap[genotype]
i***r
发帖数: 1035
7
不会用这个库,最后拿到
.decorating_function..wrapper at
0x7f2bdf52d560>
import sys
import functools
@functools.lru_cache
def translate_geno(g):
d = {'.':'0 0 ',
'0/0':'1 1 ',
'0/1':'1 2 ',
'1/0':'2 1 ',
'1/1':'2 2 '}
return d[g]
def line_parse(line):
chrs, pos, ref, alt, *geno = line.split()
all_genotype = [translate_geno(g) for g in geno]
===> all_genotype 就是我上面说的那个wrapper

alt,

【在 n****1 的大作中提到】
: 使用python>=3.2, 把前两个函数合并为:
: import functools.lru_cache
: @functools.lru_cache
: def translate_geno2(ref, alt, genotype):
: hsmap = {'0/0': ref+ref, '0/1': ref+alt, '1/0': alt+ref, '1/1': alt+alt,
: '.': '0 0'}
: return hsmap[genotype]

i***r
发帖数: 1035
8
下面这个scrip,用了dictionary,快了4~5倍
现在要6分钟
import sys
d = {'.':'0 0 ',
'0/0':'1 1 ',
'0/1':'1 2 ',
'1/0':'2 1 ',
'1/1':'2 2 '}
def line_parse(line):
chrs, pos, ref, alt, *geno = line.split()
all_genotype = [d[g] for g in geno]
return chrs, pos, ''.join(all_genotype)
if __name__ == "__main__":(以下和原来一样)
d******e
发帖数: 2265
9
最牛兄写的好像c 程序。
你可以用csv 包.看上去。

din

jaa

【在 i***r 的大作中提到】
: file is 2.5GB with 18,217,166 lines
: my python script took about 20-30 minutes to finish
: seems slow?
: Thanks!!
: input file data structure (showing first two lines, wrapped):
: chromo pos ref alt dc1 dc2 dc3 dtm bas din
: crw itb ptw spw isw irw inw ru1 ru2
: ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2
: ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa
: jac

i***r
发帖数: 1035
10
我入门的时候用的C
~2年后,改matlab
又~2年后,改python
还是C思维,希望过渡到OOP。。。。
csv package以前用来输出过csv file,回头仔细看看

【在 d******e 的大作中提到】
: 最牛兄写的好像c 程序。
: 你可以用csv 包.看上去。
:
: din
:
: jaa

n****1
发帖数: 1136
11
6分钟不错了,2.5G的数据做个copy都得几分钟吧.
python里面最贴心的还是主流语言里独有的list comprehension和generator
expression, 用好了极爽无比. 个人项目里OOP倒不是那么重要.
推荐你看看:
http://docs.python.org/3.4/howto/functional.html
1 (共1页)
进入Programming版参与讨论
相关主题
求助!这段code如何使用?问java api的问题
Python做计算怎么只用一个核?parsing file in node: js or python ?
水木上python大坑啊: 疑Google员工把8w行Python项目用4w行Java重写了请教一个parser的问题
怎么用python把这个data读进去,变成dict结构some interview questions
How to have another func call printf with va_arg list ?再请教一个c++的简单问题(enumerated constant)
Regular Expressions Cookbook, 2nd Editiontypedef 的一个问题
How to Parsing function in haskell?又问几个c语言编程的题目
parsing bibliography and sorting (转载)xml schema beginner question
相关话题的讨论汇总
话题: geno话题: genotype话题: alt话题: chrs话题: ref