python code performance --- normal or too slow? - Programming版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Programming版 - python code performance --- normal or too slow?

相关主题
● How to have another func call printf with va_arg list ?	● 再请教一个c++的简单问题（enumerated constant）
● Regular Expressions Cookbook, 2nd Edition	● typedef 的一个问题
● How to Parsing function in haskell?	● 又问几个c语言编程的题目
● parsing bibliography and sorting (转载)	● xml schema beginner question
● 问java api的问题	● Shuffle performance (C#)
● parsing file in node: js or python ?	● 10个包子请教一个简单的编程问题
● 请教一个parser的问题	● python question
● some interview questions	● 求助！这段code如何使用？

相关话题的讨论汇总
话题: geno话题: genotype话题: alt话题: chrs话题: ref

进入Programming版参与讨论

1

(共1页)

i***r 发帖数: 1035	1 file is 2.5GB with 18,217,166 lines my python script took about 20-30 minutes to finish seems slow? Thanks!! input file data structure (showing first two lines, wrapped): chromo pos ref alt dc1 dc2 dc3 dtm bas din crw itb ptw spw isw irw inw ru1 ru2 ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2 ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa jac chr01 242806 G T 0/0 0/0 . 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 . 0/0 0/0 0/0 . 0/0 0/0 0/0 0/0 0/0 0/0 0/ 0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 my python code is to: 1. parse the header and produce first file 2. parse the body and translate 0s and 1s to ATGC etc to produce second file . import sys def geno_to_base(ref, alt, genotype): assert len(genotype) == 3, "genotype not in 0/1 format" allele1 = alt if genotype[0] else ref allele2 = alt if genotype[-1] else ref return '{} {} '.format(allele1, allele2) def translate_geno(ref, alt, genotype): '''genotype needs to be either . or 0/0 format''' return '0 0 ' if genotype == '.' else geno_to_base(ref, alt, genotype) def line_parse(line): chrs, pos, ref, alt, *geno = line.split() all_genotype = [translate_geno(ref, alt, g) for g in geno] return chrs, pos, ''.join(all_genotype) if __name__ == "__main__": fn = sys.argv[1] # required fin = open(fn) tfam = open('out.tfam','w') tped = open('out.tped', 'w') # write tfam header = next(fin) for i,h in enumerate(header.split()[4:]): tfam.write('{}t{}t0t0t0t0n'.format(i,h)) # write tped morgan = 0 for i,l in enumerate(fin): rs_id = 'snp{}'.format(i+1) chrs, pos, all_geno = line_parse(l) chrs = int(chrs[3:]) # only need the number tped.write( '{} {} {} {} {}n'.format(chrs, rs_id, morgan, pos, all_ geno) ) tfam.close() tped.close()
w****w 发帖数: 521	2 慢，你这里每一行format和if重复太多。每行开始的时候建个dictionary: geno['.']='0 0' geno['0/0']='{} {}'.format(f[2],f[2]) geno['0/1']='{} {}'.format(f[2],f[3]) geno['1/0']='{} {}'.format(f[2],f[3]) geno['1/1']='{} {}'.format(f[3],f[3])
n***e 发帖数: 723	3 Your i/o could be very slow if you keep reading & writing small blocks of data each loop. I don't know about python i/o but most programming languages should already contain stream buffering technology for this issue. To test it, you can try to limit your i/o access by read & write a bigger block (say 1000 lines) every time, see if it can make a difference.
i***r 发帖数: 1035	4 Thanks. Sounds a good idea. I benchmarked the code again, it took 20 minutes to finish 【在 w****w 的大作中提到】 : 慢，你这里每一行format和if重复太多。 : 每行开始的时候建个dictionary: : geno['.']='0 0' : geno['0/0']='{} {}'.format(f[2],f[2]) : geno['0/1']='{} {}'.format(f[2],f[3]) : geno['1/0']='{} {}'.format(f[2],f[3]) : geno['1/1']='{} {}'.format(f[3],f[3])
i***r 发帖数: 1035	5 Thanks!! I will benchmark more and report my result here. languages test say 【在 n***e 的大作中提到】 : Your i/o could be very slow if you keep reading & writing small blocks of : data each loop. I don't know about python i/o but most programming languages : should already contain stream buffering technology for this issue. To test : it, you can try to limit your i/o access by read & write a bigger block (say : 1000 lines) every time, see if it can make a difference.
n****1 发帖数: 1136	6 使用python>=3.2, 把前两个函数合并为： import functools.lru_cache @functools.lru_cache def translate_geno2(ref, alt, genotype): hsmap = {'0/0': ref+ref, '0/1': ref+alt, '1/0': alt+ref, '1/1': alt+alt, '.': '0 0'} return hsmap[genotype]
i***r 发帖数: 1035	7 不会用这个库，最后拿到 .decorating_function..wrapper at 0x7f2bdf52d560> import sys import functools @functools.lru_cache def translate_geno(g): d = {'.':'0 0 ', '0/0':'1 1 ', '0/1':'1 2 ', '1/0':'2 1 ', '1/1':'2 2 '} return d[g] def line_parse(line): chrs, pos, ref, alt, geno = line.split() all_genotype = [translate_geno(g) for g in geno] ===> all_genotype 就是我上面说的那个wrapper alt, 【在 n***1 的大作中提到】 : 使用python>=3.2, 把前两个函数合并为： : import functools.lru_cache : @functools.lru_cache : def translate_geno2(ref, alt, genotype): : hsmap = {'0/0': ref+ref, '0/1': ref+alt, '1/0': alt+ref, '1/1': alt+alt, : '.': '0 0'} : return hsmap[genotype]
i***r 发帖数: 1035	8 下面这个scrip，用了dictionary，快了4~5倍现在要6分钟 import sys d = {'.':'0 0 ', '0/0':'1 1 ', '0/1':'1 2 ', '1/0':'2 1 ', '1/1':'2 2 '} def line_parse(line): chrs, pos, ref, alt, *geno = line.split() all_genotype = [d[g] for g in geno] return chrs, pos, ''.join(all_genotype) if __name__ == "__main__":（以下和原来一样）
d******e 发帖数: 2265	9 最牛兄写的好像c 程序。你可以用csv 包.看上去。 din jaa 【在 i***r 的大作中提到】 : file is 2.5GB with 18,217,166 lines : my python script took about 20-30 minutes to finish : seems slow? : Thanks!! : input file data structure (showing first two lines, wrapped): : chromo pos ref alt dc1 dc2 dc3 dtm bas din : crw itb ptw spw isw irw inw ru1 ru2 : ru3 im1 im2 im3 im4 xj1 xj2 qh1 qh2 : ti1 ti2 glw mxa rwa ysa ysb ysc cac jaa : jac
i***r 发帖数: 1035	10 我入门的时候用的C ~2年后，改matlab 又~2年后，改python 还是C思维，希望过渡到OOP。。。。 csv package以前用来输出过csv file，回头仔细看看【在 d******e 的大作中提到】 : 最牛兄写的好像c 程序。 : 你可以用csv 包.看上去。 : : din : : jaa
n****1 发帖数: 1136	11 6分钟不错了,2.5G的数据做个copy都得几分钟吧. python里面最贴心的还是主流语言里独有的list comprehension和generator expression, 用好了极爽无比. 个人项目里OOP倒不是那么重要. 推荐你看看: http://docs.python.org/3.4/howto/functional.html

1

(共1页)

进入Programming版参与讨论

相关主题
● 求助！这段code如何使用？	● 问java api的问题
● Python做计算怎么只用一个核？	● parsing file in node: js or python ?
● 水木上python大坑啊：疑Google员工把8w行Python项目用4w行Java重写了	● 请教一个parser的问题
● 怎么用python把这个data读进去，变成dict结构	● some interview questions
● How to have another func call printf with va_arg list ?	● 再请教一个c++的简单问题（enumerated constant）
● Regular Expressions Cookbook, 2nd Edition	● typedef 的一个问题
● How to Parsing function in haskell?	● 又问几个c语言编程的题目
● parsing bibliography and sorting (转载)	● xml schema beginner question

相关话题的讨论汇总
话题: geno话题: genotype话题: alt话题: chrs话题: ref

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)