CisGenome文档.doc_第1页
CisGenome文档.doc_第2页
CisGenome文档.doc_第3页
CisGenome文档.doc_第4页
CisGenome文档.doc_第5页
已阅读5页,还剩33页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

数据源ChIP-chipBPMAP文件源:/jihk/CisGenome/index_files/download.htm意义:CEL文件源:/sites/entrez?db=gds&cmd=search&term=GSE11062意义:ChIP-Seq文件源:/jihk/CisGenome/index_files/download.htm意义:功能1 安装(1) 在如下页面下载CisGenome压缩包/jihk/CisGenome/index_files/download.htm(2) 解压所下载文件即完成安装注意:(一).不要将CisGenome文件夹解压到有空格存在的路径里,比如说C:Document and SettingsCisGenome,否则CisGenome可能不能正常工作; (二).CisGenome不能识别文件名有空格文件,比如my test.CEL,你将其改成 my_test.CEL或其他正确格式即可;(三).用记事本或写字板等文本编辑器打开CisGenome.ini文件,修改CisGenome.exe所在的位置,如CisGenome=D:Projectscisgenome_project(四).保存并关闭 CisGenome.ini(五).双击CisGenome.exe打开CisGenome界面如果看到了界面,说明安装成功,那么开始其功能的探索吧!2(a). Tiling Array Analysisa.1创建一个tiling array工程 在分析tiling array数据之前,我们需要创建一个tiling array数据集(data set),这里我们介绍如何建立Affymetrixs的 CEL and BPMAP文件的数据集 依次点击界面上File Load Data Tiling Array Dataset Import from Affymetrix CEL+BPMAP如下图: 然后出现如下的对话框,进行选择文件夹这样所有的bpmap文件全导入到列表中,选择他们点“add”将其添加到工程中,点击“MoveUP”、“MoveDown”将bpmap文件进行排序,然后点“Next”进行下一步然后将cel文件添加到数据集中,按图中红圈提示进行导入cel文件被导入到左边的列表中,再按图中红圈顺序将样本(如IP1,IP2,Input1,Input2)导入到工程中,每一个cel文件对应一个bpmap文件,也可点击“MoveUP”、“MoveDown”对其进行匹配。我们要给每一个样本一个样品名(Sample ID)和Group ID,Group ID用于后面的分析,比如,你想在ChIP-chip试验中比较IP和Input,你可以为所有的IP样本设置Group ID为1,而为Input样本设置Group ID为2,之后你可以让程序寻找pattern “12”.然后点击“finish”成功创建数据集,如果没有错误的话双击一个cel文件会出现原始阵列图像保存和打开工程的操作就不说了,不过记得要频繁保存工程以免数据丢失。a.2 Tiling Array 分析 (cont.) 2.2.1 数据标准化(Data Normalization)在试图研究峰(ChIP-chip实验中蛋白质-DNA结合区域)之前,我们需要标准化阵列数据开消除系统偏差。点击Tiling Array Normalization Quantile (CEL+BPMAP),(这里先尝试Quantile,MAT标准化将被合并)。然后选择保存结果的位置,数据标准化就开始执行。执行完成后你会发现有新的项目被添加到Project Explorer中(bar文件),然后我们来研究真正的信号问题指南中显示的Sample1和Sample2,而我们实际出现的是原来的Sample ID,不知道为什么?数据意义(BAR文件):2.2.2峰值检测(Peak Detection)从数据中找到峰,点击Tiling Array Peak Detection然后出来一个对话框,我们需要做如下几点:1,:选择一个数据集进行分析2:选择进行什么类型的分析(单样(OneSample),二样(TwoSample)分析或多样)3:选择一种pattern (例如 12. 我们创建数据的时候, 1=IP, 2=Input, 12 这里表示 IPInput, 即. ChIP-chip结合区域).4:选择一个文件来保存结果(另外两个高级选项在这里不用理会)完成后,我们会发现在Genomic Regions (BED, COD下有新的项目生成,他的名字是你定的,扩展名是“_all.cod”,这确实是包含你所检测达到峰的栏标定界(tab-delimited)的文本文件然后选择一个峰,点击第一栏,你会在CisGenome Browser上看到峰2(b). ChIP-Seq Data Analysisb.1 数据准备(Data Preparation)b.1.1下载数据 /jihk/CisGenome/index_files/download.htmb.1.2 Alignment文件准备NRSF ChIP-seq Data (from Johnson et al., Science, 316:1497-1502, 2007, Human hg17)在上述站点找到此链接,将文件解压缩到与hg17文件夹同一目录下,否则将不会出现预期结果b.2创建一个ChIP-seq Analysis 工程b.2.1 导入数据 ( Loading Genome Database)b.2.2 导入Alignment文件(loading Alignment Files)注意:Alignment文件(*.aln)必须和ChIP-Seq文件夹在同一目录b.3 保存工程(Save Your Project)b.3 ChIP-Seq One Sample Analysis b.3.1. Exploration 在基因组范围内搜索结合信号,我们需要了解成百上千万序列reads分布的统计特性。这一步称为“Exploration”,即在一个固定长度窗口(a fixed-length window)中为reads匹配一个参数模型。操作为“Sequencing One Sample Analysis Exploration”然后按图示红圈顺序进行操作 拟合(model fitting )结果如下图所示:问题根据上图的CisGenome能打开Txt文件,但我们生成了一个Txt文件却不能用CisGenome打开。只能在记事本下打开。b.3.2 峰值检测(Peak Detection)操作为“Sequencing One Sample Analysis Peak Detection”.注意将Step Size S改大一点,比如100,而不用其默认值,要不然很可能出不来结果计算完成之后,在“Genomic Regions (BED, COD)”下将出现新项 你看到CisGenome Browser说明你现在到了数据分析最令人兴奋的一个部分了。呵呵,你现在可以看到sequencing数据。你可以为每一个单独的峰检查其数据质量。b.4 ChIP-Seq Two Sample Analysisb.4.1 ExplorationSimilar to the one sample analysis, a two sample analysis starts with exploration. In this step, we will divide the genome into non-overlapping windows with fixed length W, and we will count how many IP reads (k1) and how many control reads (k2) are contained in each window. Conditional on the total window counts n=k1+k2, we will estimate the false positive rates for each (k1,k2) which will help us define the significance level. To perform exploration, click the menu “Sequencing Two Sample Analysis Exploration”.2. 选择为包含用于ChIP样本的read-to-genome alignments f的BAR文件 (红圈2);3. 选择为包含用于control (Mock IP or Input) 样本read-to-genome alignments的BAR f文件(红圈 3);When the program finishes running, it will summarize model fitting results into a table (see below). Similar to the one sample analysis, the table reports how many windows (col2) have exactly n reads (col1, n=k1+k2), what percentage of windows (col3) have n reads, what is the expected percentage of windows that should have n reads under the poisson model (col4) and under the negative binomial model (col6), and the ratio between the expected percentage and the observed percentage under the two models (col5 and col7).In addition to these numbers, there is a new number called “dP0_hat” (red circle 6). This is an important quantity that tells you, within a background window, what is the expected percentage of reads that will come from the IP sample (i.e., dP0_hat = Ek1/n). dP0_hat will help us determine how big k1-k2 should be in order to call a window significant. Based on this number, false discovery rate for each (k1,k2) will be estimated, and the results will be saved to a file named “outputfile_name.fdr”. Here outputfile_name is the name you gave in the red circle 4 above.When we search for binding regions, windows with too few reads will not be considered. In our example here, a window with 8 reads corresponds to a false positive rate 6.96% under the negative binomial model (red circle 7), so we may choose to exclude windows with Two Sample Analysis Peak Detection”.当到了这个点上,你已经会使用Cisgenome对ChIP-Seq实验数据进行分析了,现在要哦进入CisGenome browser and data visualization即Cisgenome浏览器和数据可视化部分!3 Visualization CisGenome很酷的一部分就是他的浏览器,即CisGenome Browser。利用它可以将各种各样的数据进行可视化(如基因结构的并列阵列信号、跨物种保守值和序列、原始阵列图像转录因子结合基序的序列标识等等)。当打开CisGenome GUI时,浏览器就自动启动。现随机选择一个峰值,双击第一列的序号,即可打开如下图所示的图点击按钮,你可以对轨迹高度(the height of the track)、图类型(the plot type)等进行设置。3.3.1 Preparing Genome Information and Visualization浏览器中每一个页面称作一个session,而且有指定显示何种信息以及如何显示这些信息的session文件。你可以点击Open Config File按钮然后用任一种文本编辑器打开。当文件打开时,你会意识到这是一个的符合标准INI文件(即$KEY$=$VALUE$没有任何空格字符 (例如,空格、制表符)格式文本文件。文件首先指定session信息,例如:sessiontype=genomegenomenum_tracks=2region=chr13:63574264-63579774Here, “type=genome” means that the data displayed is genomic regions. “num_tracks=2” means that two tracks will be displayed. Session文件为每一个track指定信息,例如title=my_peaktype=signalsrc_filename=D:Projectspeak_1.fc.bartrack2这里,“type=signal”表示track将被用来显示tiling array信号,并用Affymetrix BAR文件按格式保存。每一个session文件都是“.ini”后缀,你可以通过browser wwwroot sessions找到所有的session文件。对于程序员和软件开发者来说,你可以让你的程序来自动生成一个session文件,这样你就能充分利用CisGenome浏览器显示你的数据。如何导入一个session文件呢?按图示红圈操作这样一个空的session就创建成功了,然后点击“add track”按钮加入track(如图中红圈7所示)要导入的文件是BAR文件,可以是之前数据标准化保存在本地的。也可重复操作以导入多个文件。3.3.2 Visualizing Gene Annotations(add gene and DNA tracks)1. 自动加入基因注释(Add Gene Annotations Automatically) 方法:File Load Data Genome Database后面和前面ChIP-seq开始一样2. 手动加入基因注释(Add Gene Annotations Manually)以类似的步骤手动加入 DNA sequences和conservation scores 例如 加入sequences, 你选择 track type = sequence, 然后folder D:genomesmousemm8.加入conservation scores, 设置track type = conservation, folder = D:genomesmousemm8conservationphastcons这里,我们已经对CisGenome Browser了解的很多。然而,直到现在我们只是将其用到可视化基因区域而已。实际上,CisGenome Browser也可以用来观察原始阵列图像(raw array images )、基序测序标识(motif sequence logos)。原始阵列的可视化在session 2(a): tiling array analysis,有解释,我们将在session 5:基序分析中进行对基序测序的可视化既然已经学到CisGenome最重要的功能之一,现在进入 sequence and motif analysis!4 检索注释和测序(Retrieving Annotations and Sequences)4.1 Link Genomic Regions to Genes这里是一个COD文件样品:region1 chr1 0 200 +region2 chr3 5000 5200 -坐标索引是以0开头的。你可以将COD文件加入到 CisGenome Project ,点击File Load Data Genomic Region COD得到一系列基因区域的最邻近基因,你可以点击Genome Annotate with Closest Gene图释首先, how to define “The Closest, 即 how to define the distance between a gene and a region (or peak).If you choose “TSS-up, TES-down”:If a peak is located upstream of transcription start (TSS), distance =|TSS-Peak_center|;If a peak is located downstream of transcription end (TES), distance =|TES-Peak_center|;If a peak is located between TSS and TES, distance = 0.If you choose “TSS-up, TSS-down”: Distance = |TSS-Peak_center|.If you choose “TES-up, TES-down”:Distance = |TES-Peak_center|.类似地, distance can be defined using coding region start (CDSS) and coding region end (CDSE) as reference points.其次,when the closest gene of a peak is located too far away, you may not want to annotate the peak. In the “Maximal distance allowed” section, you can define what do you mean by “too far away”. “Upstream =“ specifies the maximal distance from TSS, and “Downstream Annotate with Neighboring Genes”,与得到最邻近基因功能相当类似的用法要靠你自己探索了!4.2 检索DNA序列(Retrieving DNA sequences)从所列的基因组区域检索DNA序列,你可以点击菜单Genome Get Sequence把所有参数设置好后,点击OK。一个新的FASTA文件将会产生并加入到Project Explorer窗口(在Sequences (FASTA)下)。双击后你会看到所有的序列。5 基序映射(Motif Mapping)完成ChIP-chip试验后,你能够以一个500bp分辨率的水平识别的转录因子结合区域,如果你通过映射基序到结合区域了解到转录因子结合基序,你可以将每一个转录因子结合位点精确定位到630bp区域。当你尝试设计knock-out animals时高分辨率的结合位点信息将会非常有用。在很多背景下,映射转录因子结合位点到基因序列也很有用(例如,在协同基因中重要的调控元件计算的预测性)。使用CisGenome,转录因子结合基序映射成为一个相当简单的任务,motif matrix 和motif consensus 都可用CisGenome进行映射。5.1 基序矩阵映射(Motif Matrix Mapping)存在如下的MAT格式的文件中1 1 1 1001 10 90 11 5 95 11 1 100 110 1 1 901 1 100 15 5 90 1列项分别代表A, C, G, T ,行向代表碱基在基序中相应的位置准备好基序矩阵后,点“File Load Data Motif Matrix MAT”.将MAT文件导入到CisGenome问题:MAT文件没有源,在第六步中操作中能产生并拿来这里使用映射基序到基因区域集合,点击Motif Known Motif Mapping Single Matrix - COD图释红圈2:一个基序(即 一个MAT文件 )红圈3:映射基于的基因组红圈4:COD文件,指定基序被映射的基因区域红圈5:输出保存结果When the matrix is mapped to genome, at each location, it will be compared to a Markov Background Model. A location will be called a TFBS if the likelihood ration (LR) between the motif model and the Markov background model is bigger than a cutoff. By default, the cutoff is set to LR=500. But you may change the cutoff to make it more stringent or less stringent. You may also choose to use pre-computed Markov background models (which we recommend), or to use a background model estimated from input genomic regions. As an optional parameter, you may choose to filter your TFBS by cross-species conservation, say, only keep TFBS that are located within 10% most conserved part of the genome, or only keep TFBS for which phastCons conservation score is greater than 40 (Note: phastCons scores range from 0 to 1, here we linearly scaled them to 0255 to fit into one byte). After you set all parameters, click “OK”. The program will start to run. After it finish, the results will be added to the Project Explorer window, under the section “Genomic Regions (BED, COD)”. A new window will also be opened to list all TFBS.使用CisGenome,你可以方便的映射基序到整个基因组。例如,你想基序到老鼠基因集合mm8,你仅仅需要知道他的数据库在安装在哪里。在文件夹下,你找到名为“chrlist.cod”的文件,首先点“File Load Data Genomic Region”将其导入到CisGenome中,然后使用描述的基序映射功能来映射基序到chrlist.cod文件,结果是你可以得到TFBS(转录因子结合位点),在作者的机器上,一般用3060分钟来执行整个基因映射。你也可以映射一个基序矩阵到一个FASTA序列中,点Motif Known Motif Mapping Single Matrix - FASTA,用法很类似前面介绍的。5.2 基序保守映射(Motif Consensus Mapping)在CisGenome中我们用CONS格式存放保守序列(CONSENSUS sequence)下面是保守序列样本:TGGGTAGGTCA,G,T。在这个例子中,位置5上的T允许被A替代,在最火位置上,C可以被A,G,T替代。对于这个基序,我们称TGGGTGGTC为保守序列(consensus),反之,TGGGTAGGTCA,G,T被定义为变异序列(degenerate consensus)。If a genomic sequence has the following pattern TGAGAGGTG, we say that the sequence has 3 mismatches to the consensus (MC=3) and 1 mismatch to the degenerate consensus (MD=1). When mapping a motif to genomes, you will be asked to set the maximal number of MC and the maximal number of MD allowed. All genomic loci that can satisfy your criteria will be reported as TFBS.The procedure to perform consensus mapping is quite similar to matrix mapping. First, you need to use the menu “File Load Data Motif Consensus CONS” to load the consensus to your project. Then you can click the menu “Motif Known Motif Mapping Single Consensus - COD” or “Motif Known Motif Mapping Single Consensus - FASTA”. A dialog will show up to guide you through the remaining procedure which is quite straightforward.有一些公开的数据库,包括 TRANSFAC 和 JASPAR,保存了大量已知的转录因子结合基序。有时候,你研究的转录因子可能没有一个已知的基序,那么也不用担心,你可以从ChIP-chip binding 数据中发现6 de novo motif discovery6.1 新基序发现( Novel Motif Discovery)这是一种发现新基序的方法,用以研究在用ChIP-chip标识的转录因子结合区域中的富集序列模式,这样的一个搜索可以使用CisGenome提供的Gibbs Motif Sampler来演示。运行 Gibbs motif sampler需要先为目标基因区域(target genomic regions)获得DNA序列( DNA sequences)然后将序列保存到FASTA文件中。这样操作:“Motif New Motif Discovery Gibbs Motif Sampler”。上图注释1. 输入序列 (即 FASTA文件);2. 输出文件保存结果 (只需要指定文件标题(file header)) 点击OK后,基序采样器(motif sampler )开始扫描序列。运行时间取决于输入序列的长度和搜索到基序的个数。完成后,许多新基序矩阵(new motif matrices (*.mat))和一列新基序(a list of new motifs (*.matl) )将被加到Project Explorer 窗口中像 Gibbs Motif Sampling 这样的新生基序发现(De novo motif discovery)通常会得到许多的基序。如果你不知道转录因子的基序,很自然的一个问题就是:“这些发现的基序中哪一个是我寻找的呢?”我们最近的工作中已经表明可以通过比较使用仔细筛选出的匹配的基因控制区域(genomic control regions)计算出的基序的相对富集水平(motifs relative enrichment levels)来处理。接下来,我们将说明how to do this comparison using CisGenome.6.2 确定关键基序(Determine the Key Motif)6.2.1 选出的匹配的基因控制区域(Choose Matched Genomic Control Regions)Often, from a collection of motifs discovered from ChIP-chip data, we wish to identify which one is the key motif that induces sequence specific protein-DNA finding. For this purpose, we propose to rank motifs according to their relative enrichment levels. In other words, for each motif, we first compute its occurrence rate in ChIP-chip binding regions (or any collection of positive regions where you think the motif should be enriched) as well as its occurrence rate in some negative control regions. We then compare the two occurrence rate to derive a relative fold enrichment. Finally we rank all motifs based on the enrichment. The hypothesis is that the key motif should have the highest enrichment and therefore should rank as the No. 1.Ji et al. (2006) showed that this is generally true, however only when the negative control regions are selected carefully to match physical properties of ChIP-chip binding regions. The paper proposed a simple way to choose matched genomic controls to solve the problem, which has now been incorporated into CisGenome.To choose matched genomic control regions for a collection of positive regions (e.g., ChIP-chip binding regions), you can first click the menu “Genome Get Matched Control Region下图注释1. A COD file that contains the coordinates of your positive regions (e.g., ChIP-chip regions, see red circle 1 below);2. A genome which your coordinates are based on (red circle 2).3. An output file to save the results (red circle 3).You can also specify the length of each negative control region (red circle 4) and how many negative regions to return (red circle 5). When you specify how many regions to return, you can set the number to be one times, two times, etc. of the number of input positive regions.After you finish, click “OK”. Soon you will get a list of negative control regions, listed in a new Window. A COD file that contains these regions will also be added to the Project Explorer window, under the section “Genomic Regions (BED, COD)”.De novo motif discovery such as Gibbs Motif Sampling usually returns multiple motifs. If you dont know the motif of your transcription fac

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论