Molecular Index 一般用什么格式存储？ - Biology版 - 未名存档

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Biology版 - Molecular Index 一般用什么格式存储？

相关主题
● 测序技术的重磅炸弹--Oxford Nanopore	● 有illumina股票的这会儿应该高兴坏了
● 大家都是用什么tools分析HiSeq & MiSeq data的？	● 请问illumina GA, HiSeq, MiSeq主要有什么区别？
● 有人转换过sra到bam文件吗?	● 国内的败家子太多了，控制生物科学研究经费很有必要啊
● 谢晓亮小组报告新技术有望大幅增加试管婴儿成功率	● Can I put single and pair-ended RNAseq data together in DESeq analysis
● ILLUMINA的 HiSeq 成像部分是什么样的？	● 问个RNAseq问题
● Pac Bio 的技术怎么样？	● illumina hiseq 2000（2*75PE）包lane，最大能够包含几个barcode啊？
● 南方科大招聘生物方向博士后(年薪18万RMB)	● 这儿有人跟华大高层熟么
● HiSeq的下一代sequencer出来了	● 转载： WuXi PharmaTech Purchases an Illumina HiSeq X Ten Sequencing System

相关话题的讨论汇总
话题: index话题: hiseq话题: fastq话题: cluster话题: bam

进入Biology版参与讨论

1

(共1页)

n******7 发帖数: 12463	1 现在很多应用都会用上Molecular Index 一般这样的数据，用MI对原始fastq做了demultiplexing之后用什么格式存储呢？我看有的人用BAM格式，应该是利用BAM里面很灵活的tags来记录MI的信息而且很多工具可以用来提取这个信息但是我总觉得BAM格式overkill了，毕竟这里面没有任何alignment信息我也听说有人直接根据MI把fastq分了，存在很多subfolder里面结果就有问题了我琢磨直接用fastq的ID line存这个信息因为本质上还是fastq 格式就是这样记录MI的方式就比较随意了不是通用标准
s******s 发帖数: 13035	2 是说index的序列么？fastq里面支持呀【在 n******7 的大作中提到】 : 现在很多应用都会用上Molecular Index : 一般这样的数据，用MI对原始fastq做了demultiplexing之后 : 用什么格式存储呢？ : 我看有的人用BAM格式，应该是利用BAM里面很灵活的tags来记录MI的信息 : 而且很多工具可以用来提取这个信息 : 但是我总觉得BAM格式overkill了，毕竟这里面没有任何alignment信息 : 我也听说有人直接根据MI把fastq分了，存在很多subfolder里面 : 结果就有问题了 : 我琢磨直接用fastq的ID line存这个信息 : 因为本质上还是fastq 格式
n******7 发帖数: 12463	3 谢谢，其实我就是想等你来回答:) 我查了一下，这个是illumina Casava 1.8以后的格式，index在ID这一行的最后： With Casava 1.8 the format of the '@' line has changed: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG EAS139 the unique instrument name 136 the run id FC706VJ the flowcell id 2 flowcell lane 2104 tile number within the flowcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read is filtered, N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence 我check了一下，这个是用I7/I5 demultiplexing的时候自动生成的而我是想处理inline barcode sequence，跟这个不一样不过这个让我有了另一个问题：这个fastq ID的信息有用吗？我好像从来没有关注过reads ID 这里面唯一可能有用的就是paired end的/1 /2了好像早年有些代码还是用这个来识别两个reads 现在都是单独存两个文件了【在 s******s 的大作中提到】 : 是说index的序列么？fastq里面支持呀
s******s 发帖数: 13035	4 en, 现在PE多数都是两个文件了。大多数工具align的时候，这些信息都默认丢掉了吧。其实理论上也可以做做 batch effect analysis, 不过可能大家觉得数据量够大了，不需要微调了，做做 bqsr就够了，最近说现在机器质量好，连bqsr可能都不用了。 Broad是BAM的忠实使用者，据说他们的机器读出来以后直接都搞成unaligned BAM, 根本没有FASTQ这个中间状态。这里有BI的人过来确认一下么？ btw，Stanford好像最近扔了一篇文章出来，说hiseq 4000的新chemistry有问题，做multiplex有5%-10%的错误率，不知道Illumina会不会跳出来撇清。【在 n******7 的大作中提到】 : 谢谢，其实我就是想等你来回答:) : 我查了一下，这个是illumina Casava 1.8以后的格式，index在ID这一行的最后： : With Casava 1.8 the format of the '@' line has changed: : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG : EAS139 the unique instrument name : 136 the run id : FC706VJ the flowcell id : 2 flowcell lane : 2104 tile number within the flowcell lane : 15343 'x'-coordinate of the cluster within the tile
n******7 发帖数: 12463	5 谢谢记得bam是不记录fastq id line 那我决定随便搞了 board是喜欢bam，我之前说的那个用bam记录demultiplexed reads的就是board出来的人弄的还没看源码，感觉是基于picard做的我问过能不能用fastq.gz 他说fastq只是temporary的格式。。只是存序列的话，我还是喜欢fastq.gz 简单明了，兼容所有reads处理工具最多用gzip pipe一下 unaligned bam的压缩比应该差不多，但是后续处理大部分第三方工具不支持我猜board是喜欢自己搞整个工具链吧你说的那个hiseq4000的error rate太吓人了，伊鲁米娜肯定不承认，或者会快速修复的【在 s******s 的大作中提到】 : en, 现在PE多数都是两个文件了。 : 大多数工具align的时候，这些信息都默认丢掉了吧。其实理论上也可以做做 : batch effect analysis, 不过可能大家觉得数据量够大了，不需要微调了，做做 : bqsr就够了，最近说现在机器质量好，连bqsr可能都不用了。 : Broad是BAM的忠实使用者，据说他们的机器读出来以后直接都搞成unaligned : BAM, 根本没有FASTQ这个中间状态。这里有BI的人过来确认一下么？ : btw，Stanford好像最近扔了一篇文章出来，说hiseq 4000的新chemistry有问题， : 做multiplex有5%-10%的错误率，不知道Illumina会不会跳出来撇清。
s******s 发帖数: 13035	6 http://biorxiv.org/content/early/2017/04/09/125724 HiSeq 4000 problems ``` We discovered that up to 5-10% of sequencing reads (or signals) are incorrectly assigned from a given sample to other samples in a multiplexed pool. We provide evidence that this "spreading-of-signals" arises from low levels of free index primers present in the pool. These index primers can prime pooled library fragments at random via complementary 3′ ends, and get extended by DNA polymerase, creating a new library molecule with a new index before binding to the patterned flow cell to generate a cluster for sequencing. This causes the resulting read from that cluster to be assigned to a different sample, causing the spread of signals within multiplexed samples. ``` 【在 n******7 的大作中提到】 : 谢谢 : 记得bam是不记录fastq id line : 那我决定随便搞了 : board是喜欢bam，我之前说的那个用bam记录demultiplexed reads的就是board出来的 : 人弄的 : 还没看源码，感觉是基于picard做的 : 我问过能不能用fastq.gz : 他说fastq只是temporary的格式。。 : 只是存序列的话，我还是喜欢fastq.gz : 简单明了，兼容所有reads处理工具
n******7 发帖数: 12463	7 擦，这要是真的，玩大了啊我看摘要，这不光是HiSeq4000，还有hiseq3000和X ten都有这个问题 In 2015, a new chemistry of cluster generation was introduced in the newer Illumina machines (HiSeq 3000/4000/X Ten) called exclusion amplification ( ExAmp), which was a fundamental shift from the earlier method of random cluster generation by bridge amplification on a non-patterned flow cell. 可能最新的novaseq也会有这问题这要是用来测 tumor samples, 结果完全废了－－－看了一下正文，通篇说hiseq4000是因为他们只有这个测试 Since the HiSeq 3000 and HiSeq X Ten share the same chemistry as the HiSeq 4000, it is possible that such index switching may also occur at a similar rate using these sequencers, although we have not tested this directly. get assigned 【在 s******s 的大作中提到】 : http://biorxiv.org/content/early/2017/04/09/125724 : HiSeq 4000 problems : ``` We discovered that up to 5-10% of sequencing reads (or signals) are : incorrectly assigned from a given sample to other samples in a multiplexed : pool. We provide evidence that this "spreading-of-signals" arises from low : levels of free index primers present in the pool. These index primers can : prime pooled library fragments at random via complementary 3′ ends, and get : extended by DNA polymerase, creating a new library molecule with a new : index before binding to the patterned flow cell to generate a cluster for : sequencing. This causes the resulting read from that cluster to be assigned
s******s 发帖数: 13035	8 坐等illumina跳出来spin 【在 n******7 的大作中提到】 : 擦，这要是真的，玩大了啊 : 我看摘要，这不光是HiSeq4000，还有hiseq3000和X ten都有这个问题 : In 2015, a new chemistry of cluster generation was introduced in the newer : Illumina machines (HiSeq 3000/4000/X Ten) called exclusion amplification ( : ExAmp), which was a fundamental shift from the earlier method of random : cluster generation by bridge amplification on a non-patterned flow cell. : 可能最新的novaseq也会有这问题 : 这要是用来测 tumor samples, 结果完全废了 : －－－ : 看了一下正文，通篇说hiseq4000是因为他们只有这个测试
z*t 发帖数: 863	9 弱问next-seq 500/550会不会受影响？：擦，这要是真的，玩大了啊：我看摘要，这不光是HiSeq4000，还有hiseq3000和X ten都有这个问题：In 2015, a new chemistry of cluster generation was introduced in the newer ：Illumina machines (HiSeq 3000/4000/X Ten) called exclusion amplification ( ：ExAmp), which was a fundamental shift from the earlier method of random ：cluster generation by bridge amplification on a non-patterned flow cell. ：可能最新的novaseq也会有这问题：这要是用来测 tumor samples, 结果完全废了：－－－：看了一下正文，通篇说hiseq4000是因为他们只有这个测试：.......... 【在 n******7 的大作中提到】 : 擦，这要是真的，玩大了啊 : 我看摘要，这不光是HiSeq4000，还有hiseq3000和X ten都有这个问题 : In 2015, a new chemistry of cluster generation was introduced in the newer : Illumina machines (HiSeq 3000/4000/X Ten) called exclusion amplification ( : ExAmp), which was a fundamental shift from the earlier method of random : cluster generation by bridge amplification on a non-patterned flow cell. : 可能最新的novaseq也会有这问题 : 这要是用来测 tumor samples, 结果完全废了 : －－－ : 看了一下正文，通篇说hiseq4000是因为他们只有这个测试
n******7 发帖数: 12463	10 no newer ( 【在 z*t 的大作中提到】 : 弱问next-seq 500/550会不会受影响？ : : ：擦，这要是真的，玩大了啊 : ：我看摘要，这不光是HiSeq4000，还有hiseq3000和X ten都有这个问题 : ：In 2015, a new chemistry of cluster generation was introduced in the newer : ：Illumina machines (HiSeq 3000/4000/X Ten) called exclusion amplification ( : ：ExAmp), which was a fundamental shift from the earlier method of random : ：cluster generation by bridge amplification on a non-patterned flow cell. : ：可能最新的novaseq也会有这问题 : ：这要是用来测 tumor samples, 结果完全废了
n******7 发帖数: 12463	11 http://www.illumina.com/content/dam/illumina-marketing/documents/products/whitepapers/index-hopping-white-paper-770-2017-004.pdf?linkId=36607862 spin来了【在 s******s 的大作中提到】 : 坐等illumina跳出来spin

1

(共1页)

进入Biology版参与讨论

相关主题
● 转载： WuXi PharmaTech Purchases an Illumina HiSeq X Ten Sequencing System	● ILLUMINA的 HiSeq 成像部分是什么样的？
● NextSeq 500 Desktop Sequencer	● Pac Bio 的技术怎么样？
● 基因测序公司研发组的应用专家职位咋样	● 南方科大招聘生物方向博士后(年薪18万RMB)
● Field Specialist Needed - Biotech NGS area (转载)	● HiSeq的下一代sequencer出来了
● 测序技术的重磅炸弹--Oxford Nanopore	● 有illumina股票的这会儿应该高兴坏了
● 大家都是用什么tools分析HiSeq & MiSeq data的？	● 请问illumina GA, HiSeq, MiSeq主要有什么区别？
● 有人转换过sra到bam文件吗?	● 国内的败家子太多了，控制生物科学研究经费很有必要啊
● 谢晓亮小组报告新技术有望大幅增加试管婴儿成功率	● Can I put single and pair-ended RNAseq data together in DESeq analysis

相关话题的讨论汇总
话题: index话题: hiseq话题: fastq话题: cluster话题: bam

未名新帖统计// 7月16日

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

* 这里只显示发帖超过25的版面，努力灌水吧:-)