NGS 二代测序分析，大家来评评 - Biology版

本页内容为未名空间相应帖子的节选和存档，一周内的贴子最多显示50字，超过一周显示500字访问原贴

Biology版 - NGS 二代测序分析，大家来评评

相关主题
● NGS technique question, urgent!	● 药厂默克 Boston召postdoc, Bioinformatics/Genomics
● RNA-seq 表达量问题	● 42 mer read
● NGS(GATK) vs Sanger results	● 弱问RNAseq里的FDR
● reads mapping	● 请问附图中的ChIP-seq peak ranking是怎么完成的
● 求推荐表观遗传结合生物信息做的比较好的实验室	● 求助！paper review 机会(bioinformatics, GWAS, NGS)
● 请教ChIPseq的技术问题	● chip-seq, single-read or paired-end?
● RNA seq分析求教	● 请问一般ChIPseq应该有多少peaks？
● bioinformatics job opportunity	● oxford nanopore

相关话题的讨论汇总
话题: reads话题: chip话题: tf话题: mappable话题: peaks

进入Biology版参与讨论

(共1页)

n********k
发帖数: 2818

Just got some of my ChIP-seq data back:
For histone markers, looked great and the cores told me it is great above 70
% mappable reads and peak calling with FDR of <5%...
for my TF, the data make sense to me but the core said it is trash/useless,
9-20% mappable reads (out of 9-11M, meant to get 20M) and peaks calling with
a FDR of 100%.
Luckily my TF has been chipped many many time and has very conserved binding
sites. I randomly picked the mapped peaks, most of them with at least 1
high confident binding site...so biology tells me the data is very good...
With that, I am thinking I would need to get a bit bioformatical myself...
So any guru on the field could make more comments on the data-mining/
analysis part...
Also, any recommendations on de nova motif finders from ChIP data? How
about duplicate reads, and repetitive regions?...Pretty frustrated as of now
,...either I haven't found the right places/softwares, or the data-mining/
analysis for NGS is still pretty much at a primitive stage...

a******g
发帖数: 129

Your core is right. Normally, 70% mappable reads are mandatory for a good
ChIP-seq. 20% is way too low to convince people this ChIP-seq is valid.

70
,
with
binding

【在 n********k 的大作中提到】

: Just got some of my ChIP-seq data back:
: For histone markers, looked great and the cores told me it is great above 70
: % mappable reads and peak calling with FDR of <5%...
: for my TF, the data make sense to me but the core said it is trash/useless,
: 9-20% mappable reads (out of 9-11M, meant to get 20M) and peaks calling with
: a FDR of 100%.
: Luckily my TF has been chipped many many time and has very conserved binding
: sites. I randomly picked the mapped peaks, most of them with at least 1
: high confident binding site...so biology tells me the data is very good...
: With that, I am thinking I would need to get a bit bioformatical myself...

R****n
发帖数: 708

If I want to do a methylation ChIP-seq on SOLiD or Illumina's system, how
many reads should I require from the Core. I asked a Third part company.
They mention that each run can fit 10 samples, about 6.6 GB data per sample.
I don't know how to select the best throughput for the methylation ChIP.
Thank you!

【在 a******g 的大作中提到】

: Your core is right. Normally, 70% mappable reads are mandatory for a good
: ChIP-seq. 20% is way too low to convince people this ChIP-seq is valid.
:
: 70
: ,
: with
: binding

a******g
发帖数: 129

I haven't done methylation ChIP-seq. For TF ChIP-seq on GAII from illumina,
20m should be fine.

sample.

【在 R****n 的大作中提到】

: If I want to do a methylation ChIP-seq on SOLiD or Illumina's system, how
: many reads should I require from the Core. I asked a Third part company.
: They mention that each run can fit 10 samples, about 6.6 GB data per sample.
: I don't know how to select the best throughput for the methylation ChIP.
: Thank you!

n********t
发帖数: 1079

With that awful mapping rate, it is difficult to justify the result.

s******s
发帖数: 13035

never done mapping myself, but FYI
1. histone marker和TF的Chip-seq protocol应该非常不一样，不知道你是不是用的一
个protocol
2. hiseq现在一条lane大概200m reads, 50bp的话，略微不到40G的data.

70
,
with
binding

【在 n********k 的大作中提到】

n********k
发帖数: 2818

I did the ChIP part and the core prepared the libraries and I believe it was
the same protocol for all samples. Apparently TF groups have much lower
amount of DNA to start with and I was wondering whether that would make the
background noise much big an issue and thus the percentage of the mappable
reads is very low. I have three conditions: the percentage of the mappable
reads is decreasing as the same way the starting amount of ChIP-DNA does. I
was wondering whether over-amplification could cause the problem?
Nonetheless, for the mapped peaks, most of them are correct ones while the
rest needs to be verified further.

【在 s******s 的大作中提到】

: never done mapping myself, but FYI
: 1. histone marker和TF的Chip-seq protocol应该非常不一样，不知道你是不是用的一
: 个protocol
: 2. hiseq现在一条lane大概200m reads, 50bp的话，略微不到40G的data.
:
: 70
: ,
: with
: binding

n********k
发帖数: 2818

Sure, I would agree from statistics, however, Since the gene as you know
have very conserved binding site and have been ChIPed so many time----what
would be ur thoughts on the fact: most of the peaks (1200 for one condition
and 800 for a 2nd---the numbers also agree with my expectation between the
two conditions) are confirmed targets while the rest could be potentials
ones...

【在 a******g 的大作中提到】

: Your core is right. Normally, 70% mappable reads are mandatory for a good
: ChIP-seq. 20% is way too low to convince people this ChIP-seq is valid.
:
: 70
: ,
: with
: binding

L*******a
发帖数: 293

Generally, 70% mappable reads is an experienced criteria.
FYI.
ENCODE project requires 10M mappable reads for human ChIP-seq experiments,
and modENCODE project requires 4M for worm and fly.
From the perspective of genomic, some classic genes bound by peaks are not
enough to prove your ChIP-seq data is valid, in terms of saturation.
You'd better to have some replicates anyway, each of replicate should meet
saturation and mappability criteria, and the reproducibility rate also
should be in an acceptable range (e.g. 80% overlapping)

j*p
发帖数: 411

1. which species?
2. whether your TF has been chipseqed before?
3. does the core use MACS to call peaks for your TF?

70
,
with
binding

【在 n********k 的大作中提到】

相关主题
● 请教ChIPseq的技术问题	● 药厂默克 Boston召postdoc, Bioinformatics/Genomics
● RNA seq分析求教	● 42 mer read
● bioinformatics job opportunity	● 弱问RNAseq里的FDR
进入Biology版参与讨论

n********k
发帖数: 2818

rats
sure, many times
yes

【在 j*p 的大作中提到】

: 1. which species?
: 2. whether your TF has been chipseqed before?
: 3. does the core use MACS to call peaks for your TF?
:
: 70
: ,
: with
: binding

n********k
发帖数: 2818

thanks for the infor, and was looking at the guideline...so, what would be
minimum number of cells for a TF-chip?

【在 L*******a 的大作中提到】

: Generally, 70% mappable reads is an experienced criteria.
: FYI.
: ENCODE project requires 10M mappable reads for human ChIP-seq experiments,
: and modENCODE project requires 4M for worm and fly.
: From the perspective of genomic, some classic genes bound by peaks are not
: enough to prove your ChIP-seq data is valid, in terms of saturation.
: You'd better to have some replicates anyway, each of replicate should meet
: saturation and mappability criteria, and the reproducibility rate also
: should be in an acceptable range (e.g. 80% overlapping)

j*p
发帖数: 411

"for my TF, the data make sense to me but the core said it is trash/useless,
9-20% mappable reads (out of 9-11M, meant to get 20M) and peaks calling with
a FDR of 100%. "
Mouse sample with 20%x11M = 2.2M is useless for publication. But it is still
potentially useable for trouble shooting.
Possible reasons(most likely -- least likely):
1. Anti-body doesn't work, did not pull down anything, therefore, no signal
enrichment on sites that are supposed TF-binding. The whole signal should
look no difference between your input, you are supposed to see flat line (
except centermere) across chromosomes.
2. Library overload. this will give you less output reads, because 11M is
kind of low, usually GXII generates 20-30M reads, with raw .fastq file size
of 3-6G. If this is the only reason, you should still be able to see chip-
enriched sites, but fold enrichment should be low. you should upload your
mapped data to genome browser, or IGV and go to some of you positive
controls and take a look whether their promoters/enhancers have signal?
3. sequencing mapping, if raw reads is long, trim reads will give you
slightly better mapping, but won't change the mappable reads distribution.
which means if you did not see enrichment with 2.2M reads, you probably won'
t see enrichment in 4.4M reads.
"Luckily my TF has been chipped many many time and has very conserved
binding sites. "
If the same TF had been chipseqed many times, and if this TF has a conserved
motif, you can use it's motif to reverse search binding sites use fimo (
part of meme suite).
"I randomly picked the mapped peaks, most of them with at least 1high
confident binding site...so biology tells me the data is very good..."
This doesn't make sense to me, I never heard of something called "mapped
peaks", it should either be "mapped reads" or "called peaks". If it is "
mapped reads", of course if you blat them against genome reference, they'll
go to somewhere, but whether many of the reads will pile up together is
another question. I don't think this could be used as an evidence of showing
your data is very good.
"With that, I am thinking I would need to get a bit bioformatical myself...
So any guru on the field could make more comments on the data-mining/
analysis part...
Also, any recommendations on de nova motif finders from ChIP data? How
about duplicate reads, and repetitive regions?...Pretty frustrated as of now
,...either I haven't found the right places/softwares, or the data-mining/
analysis for NGS is still pretty much at a primitive stage..."
I wrote a brief introduction few days ago, only if you are interested.
Suggestion: tell your core to show you the mapped reads on browser, seeing
is believing, and take close look into those called peaks (I guess most of
them with FDR 100%), you'll then understand whether you should trust them.
BTW. the number of peaks means nothing, because one call always call more or
fewer peaks by manipulating thresholds.

j*p
发帖数: 411

people use cell numbers from 10-100M to do normal TF chipseq. New
techniques are developing to chip in small amount of cells, as few as ~ten
thousands, as someone claimed. (for example: Single-tube linear DNADNADNA
amplification (LinDADA) for robust ChIP-seq)

【在 n********k 的大作中提到】

: thanks for the infor, and was looking at the guideline...so, what would be
: minimum number of cells for a TF-chip?

n********k
发帖数: 2818

This is so great and thank you very much.
1. the Antibody is at least decent---this gene has been chipped many many
times by many labs; and I confirmed the ChIP using QPCR...
2. I have been suspecting we may have the library overload problem or over-
amplification issue(if that makes sense). The core really followed the
histone protocol and was meant to get 20-30M reads, and it did for Histone
markers and Input DNA. However, for my TF, the 1st is 9M with 9% mappable
reads (I expect less bind events than the second one) and the 2nd one is 11M
with 23% mappable reads--all after excluding duplicate reads.
3. Yes, it was called peaks-- many of the called peaks look very nice to me
using IGV software, showing very nice enrichment---the folds of the
enrichment range from 5-2000 across the genome (many are low). Among many I
have checked, the known binding site is sitting right at(or very close to)
the center of the peaks. That's what promoted me to think the data is still
usable but I might need to increase my starting materials in term of
saturation. I used the Solid/invitro new kit and it claimed to only need 10-
300K cells for ChIP-seq. I used about 300K for histone markers and half goes
to the library preparation, it worked nicely. I used about 2-4M for TF, and
the above is what I got...
4. Could you please check your mailbox. thanks.

useless,
with
still
signal

【在 j*p 的大作中提到】

: "for my TF, the data make sense to me but the core said it is trash/useless,
: 9-20% mappable reads (out of 9-11M, meant to get 20M) and peaks calling with
: a FDR of 100%. "
: Mouse sample with 20%x11M = 2.2M is useless for publication. But it is still
: potentially useable for trouble shooting.
: Possible reasons(most likely -- least likely):
: 1. Anti-body doesn't work, did not pull down anything, therefore, no signal
: enrichment on sites that are supposed TF-binding. The whole signal should
: look no difference between your input, you are supposed to see flat line (
: except centermere) across chromosomes.

n********k
发帖数: 2818

I was reading about the LinD, any experience or comments on it? thanks

【在 j*p 的大作中提到】

: people use cell numbers from 10-100M to do normal TF chipseq. New
: techniques are developing to chip in small amount of cells, as few as ~ten
: thousands, as someone claimed. (for example: Single-tube linear DNADNADNA
: amplification (LinDADA) for robust ChIP-seq)

m***c
发帖数: 177

No matter how many reads you got from your core (raw data), only mapped
reads tell trues. From your mapper reads, you got the results that are in
line with your previous experiments. This tells you that the mapped reads in
your experiment makes sense. The minimum total number of mapped reads for a
chip seq shouldn't be a fixed number and it varies upon the genome and the
binding protein or histone markers that you are studying. Usually, histone
maker requires more mapped reads because histone marker simply distributes
more even and broadly. For some binding proteins, you may never be able to
get enough mapped reads as it is so unique and has no many binding sites.
With your mapped reads, calling for enriched genome region is going to be
important for you. I personally think the MACS will do fairly good work for
binding protein chip seq.
Good luck!

condition

【在 n********k 的大作中提到】

: Sure, I would agree from statistics, however, Since the gene as you know
: have very conserved binding site and have been ChIPed so many time----what
: would be ur thoughts on the fact: most of the peaks (1200 for one condition
: and 800 for a 2nd---the numbers also agree with my expectation between the
: two conditions) are confirmed targets while the rest could be potentials
: ones...

i*****g
发帖数: 11893

never done the assay before, so i cannot tell

n********k
发帖数: 2818

thanks all the same for ur friendly response...

【在 i*****g 的大作中提到】

: never done the assay before, so i cannot tell

b****r
发帖数: 17995

我是外行
不过FDR是啥东西我还是懂，是说的false discovery rate吧，peaks calling with a
FDR of 100%. 那不是说等于没有任何阳性结果？
另外你只有20%的序列可以map到genome里，其他都是些啥东西，你有没有看过，会不会
是什么东西污染，还是就是纯粹的noise？打个比方，如果是细菌或者其他实验残留的
污染，能不能找到污染源，减去这个background？这么大的序列应该能找到到底是啥东
西，如果不是noise的话

j*p
发帖数: 411

Agree.
Unmapped reads could be caused by (not limited to):
1. sequencing error. these reads probably won't map to any genome.
2. bacterial/viral contamination during library preparation. It won't be
easy to identify which contamination it is, if you don't have any candidates
ahead of time, however, if you do, it is pretty easy to confirm. We
recently found ~90% of our unmapped reads could be map to a bacterial genome
. This bacterial was used to replace bees to stick down the protein. while
in out input lanes, majority of the reads map to human genome.
Nevertheless, remove background won't help to increase local enrichment. and
the worse case is that you have a quite decent mappable rate, but they are
just randomly distributed to the entire genome, which looks no difference
with input.

a

【在 b****r 的大作中提到】

: 我是外行
: 不过FDR是啥东西我还是懂，是说的false discovery rate吧，peaks calling with a
: FDR of 100%. 那不是说等于没有任何阳性结果？
: 另外你只有20%的序列可以map到genome里，其他都是些啥东西，你有没有看过，会不会
: 是什么东西污染，还是就是纯粹的noise？打个比方，如果是细菌或者其他实验残留的
: 污染，能不能找到污染源，减去这个background？这么大的序列应该能找到到底是啥东
: 西，如果不是noise的话

b****r
发帖数: 17995

我觉得noise如果来自某一个或几个生物来源还是很容易找的，拼起一小段到ncbi里去
blast一下不就出来了，什么genome和vector它都会去比较到的

candidates
genome
and

【在 j*p 的大作中提到】

: Agree.
: Unmapped reads could be caused by (not limited to):
: 1. sequencing error. these reads probably won't map to any genome.
: 2. bacterial/viral contamination during library preparation. It won't be
: easy to identify which contamination it is, if you don't have any candidates
: ahead of time, however, if you do, it is pretty easy to confirm. We
: recently found ~90% of our unmapped reads could be map to a bacterial genome
: . This bacterial was used to replace bees to stick down the protein. while
: in out input lanes, majority of the reads map to human genome.
: Nevertheless, remove background won't help to increase local enrichment. and

(共1页)

进入Biology版参与讨论

相关主题
● oxford nanopore	● 求推荐表观遗传结合生物信息做的比较好的实验室
● native的histone组装	● 请教ChIPseq的技术问题
● epigenetic的哪方面数据最为可靠	● RNA seq分析求教
● ChIP-seq on H3K4me3	● bioinformatics job opportunity
● NGS technique question, urgent!	● 药厂默克 Boston召postdoc, Bioinformatics/Genomics
● RNA-seq 表达量问题	● 42 mer read
● NGS(GATK) vs Sanger results	● 弱问RNAseq里的FDR
● reads mapping	● 请问附图中的ChIP-seq peak ranking是怎么完成的

相关话题的讨论汇总
话题: reads话题: chip话题: tf话题: mappable话题: peaks

#	版面	帖数(主题数)
-	全站	4871 (796)
1	Military	3777 (569)
2	Stock	341 (51)
3	Joke	117 (17)
4	History	116 (3)
5	Automobile	100 (9)
6	USANews	55 (9)
7	Midlife	45 (1)
8	Headline	41 (41)
9	Dreamer	33 (13)
10	FleaMarket	32 (20)
11	Living	30 (7)

boards

未名新帖统计// 7月16日

历史上的今天