




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Chapter 4Chapter 4Using mouse candidate cancer genes tonarrow down the candidates in regions of copynumber change in human cancers4.1IntroductionAs discussed in Section 1.3.3, copy number changes are a common feature of cancer genomes, and can be identified using comparative genomic hybridisation (C
2、GH)-based techniques. However, regions of copy number change are often large and encompass many genes, making it difficult to identify the “critical” genes that contribute to the tumourigenic process. Candidate cancer genes identified by insertional mutagenesis in the mouse can be used in a cross-sp
3、ecies oncogenomics approach to narrow down the candidates within regions of copy number change in human tumours. The use of cross- species comparative analysis for cancer gene discovery is discussed in Section 1.5. In this chapter, mouse candidate cancer genes are used to identify orthologous candid
4、ates within regions of copy number change in 713 human cancer cell lines generated usingSNP array CGH. The analyses were performed as part of a collaboration with theNetherlands Cancer Institute (NKI), published in Cell (Uren., 2008), and therefore,rather than using the mouse candidate cancer genes
5、generated from the work described inChapters 2 and 3, lists of candidates were provided by the NKI.The datasets are introduced in Section 4.2.This is followed, in Section 4.3, by adescription of the methods used to process the copy number data into regions of copy number change, and gains and losses
6、 within the human cancer cell lines are characterisedin Section 4.4.In Section 4.5.1, the mouse and human datasets are compared todetermine whether retroviral insertional mutagenesis is relevant to the discovery of amplified and deleted cancer genes in humans. Promising cancer gene candidates that a
7、re both disrupted by insertional mutagenesis in the mouse and amplified or deleted in human cancers are presented in Section 4.5.2. A range of algorithms have been developed for identifying regions of copy number change within CGH data, and these are described and compared in Section 4.6. Finally, i
8、n Section 4.7, the mouse candidate cancer genes are combined with copy number variation (CNV) data from apparently healthy individuals todetermine whether there is any overlap between candidates and regions of CNV.161Chapter 4Since the ploidy of the cell lines, and therefore the exact copy number of
9、 alterations, is difficult to establish, the terms “gain and “amplicon” are used interchangeably throughout this thesis to mean any gain of copy number, irrespective of the size or nature of thealteration.4.2Description of the datasetsAs well as the datasets described below, the set of known cancer
10、genes from the CancerGene Census (Futreal., 2004) was also used. This is described in Section .2.1Mouse candidate cancer genes identified by retroviral insertionalmutagenesisAs mentioned in the introduction, some of the work described in this chapter wasundertaken as part of a collaboration w
11、ith the NKI (Uren., 2008, reprinted on p.365).Te lists used in this chapter were therefore provided by the NKI but were generatedfrom the analysis of insertion sites identified in the retroviral insertional mutagenesis screen described in Chapter 2. There were 6 lists of putative tumour suppressor g
12、enes. These included 3 lists comprising all genes in which there were insertions in the entire transcribed region, including UTRs and introns, only in the translated region (no UTRs) but including introns, and only in the coding region (no UTRs or introns). These lists are described throughout this
13、thesis as genes in the transcribed region, translated region, and coding region, respectively. A further 3 lists contained genes with insertions in the same regions, but only where insertions comprised 2 or more sequence reads. Insertions represented by only 1 read are considered less likely to cont
14、ribute to tumourigenesis (see Section 2.8) and are therefore predicted to have a reduced overlap with human deletions. 2 additional lists contained genes that were closest to CISs with P-values of less than 0.05 and 0.001, as determined using the kernel convolution (KC)-based statistical method (de
15、Ridder et al., 2006, see Sections .2 and 2.10.2). From these, lists were also generated for genes that were adjacent to CISs of P<0.05 and P<0.001 but were further away than the closest gene. For each gene list, the human orthologues and their genomic coordinates were extracted from Ens
16、embl version 37 using Ensembl BioMart (see Section 3.2.1). Table 4.1 shows the number of mouse genes and human orthologues in each genelist. The P<0.001 and P<0.05 CISs and their associated nearest and further mouse genes162Chapter 4Number of mouse Number of human Number ofgenes with human ort
17、hologues in CIS% of human orthologues in CIS gene list17.1Gene ListORF onlymouse genes266orthologuesgene list240754122216ORF only (no singletons)8630248.315.246.258.523.530.1Translated region onlyTranslated region only (no singletons) Transcribed region onlyTranscribed region only (no sin
18、gletons) CIS nearest P<0.05CIS nearest P<0.001 CIS further P<0.05 CIS further P<0.0012647377333162755593555053134242653622191961558566Table 4.1.Descriptionof the lists ofmouse candidatecancer genesused forcomparison with human cancer copy number data.“ORF, Translated region,Transcribed r
19、egion only” are lists of genes containing insertions only in the open readingframe, translated region (but including introns) or transcribed region, respectively. “no singletons” means that the list does not include genes that only contain insertions represented by a single read. “CIS nearest P<0
20、.05” and “CIS nearest P<0.001” contain genes nearest to CISs identified by the kernel convolution (KC)-based method. “CIS further P<0.05” and “CIS further P<0.001” contain genes that flank CISs identified by the KC-based method but are not the nearest genes. The columns labelled “Number/% o
21、f human orthologues in CIS gene lists” show the overlap of each list with the list of candidate cancer genes generated and described in Chapters 2 and 3.163Chapter 4are listed in Appendix D. Due to their length, the lists of candidate tumour suppressorgenes are not included, but are available on req
22、uest.Table 4.1 also shows the overlap of each gene list with the list of candidate cancer genes generated and described in Chapters 2 and 3 (shown in Appendix B2 and referred to here as the CIS gene list). The CIS gene list contains only genes that are associated with a significant CIS and this, tog
23、ether with the fact that the screen identifies mainly oncogenes, accounts for the small overlap with the tumour suppressor gene lists, in which genes may contain any number of insertions. The differences between the CIS gene list and the remaining lists may reflect differences in gene selection, i.e
24、. a more sophisticated method was used to assign insertions to genes in the CIS gene list, and in read and insertion site processing, which were more conservative for the CIS gene list. Candidates from the CIS gene list are used in Chapter 5, where it is compared to higher resolution human CGH data
25、(Section 5.3), as well as to the CGH data described in this chapter(Section 5.4).4.2.2Copy number data for human cancer cell linesComparative genomic hybridisation (CGH) data were generated by the Wellcome TrustSanger Institute (WTSI) Cancer Genome Project for 713 human cancer cell lines from 29 tis
26、sues. A list of all cell lines and their tissue of origin is provided in Appendix E and issummarised in Table 4.2.None of the chosen cell lines had a common ancestor,according to cell line identity typing also performed by the WTSI Cancer Genome Project().Thisisimportant, since an amplicon or deleti
27、on might otherwise appear to be recurrent simply because it is within synonymous cell lines. CGH was performed using two Affymetrix GeneChip® Human Mapping Arrays. The 10K array, which comprises 11,555 SNPs, was used for 313 cell lines, while the 10K 2.0 array, comprising 10,204 SNPs, was used
28、for the remaining 400 lines. 10,136 SNPs were shared between the two arrays, and both used the Affymetrix GeneChip® Mapping 10K assay, described in Section . The SNPs were mapped to the NCBI 35 human genome assembly. The mean distance between SNPs was 258.50 (±634.21) kb in the 10K
29、array, and 292.82 (±683.49) kb in the10K 2.0 array. The minimum distance was 2 bp and 11 bp for the 10K and 10K 2.0arrays, respectively, and theum distance was 24.81 Mb for both arrays. 9.4% ofhuman protein-coding genes in Ensembl v37 (extracted using Ensembl BioMart, see164Chapter 4Number of c
30、ell linesTissue of originLung13111743424039382923212020191918141311111110653222211Haematopoietic and lymphoid BreastSkinCentral nervous system UnknownLarge intestine Autonomic ganglia BoneKidney Soft tissueOesophagus StomachUpper aerodigestive tract OvaryPancreas Urinary tract Liver Thyroid CervixEn
31、dometrium Biliary TractPleuraProstate Eye PlacentaAdrenal gland Small intestineTotal713Table 4.2. Tissues of origin of human cancer cell lines used in the 10K SNP array CGH analysis.165Chapter 4Section 3.2.1) contained at least one SNP in the 10K array, while 9.0% contained at leastone SNP in the 10
32、K 2.0 array.Genes were defined as the longest Ensembl genetranscript. The 10K and 10K 2.0 arrays contained an average of 0.176 (±0.735) and 0.157 (±0.648) SNPs per protein-coding gene, respectively. The interSNP distances and number of SNPs per gene are shown in Figures 4.1 and 4.2, respec
33、tively. The largestgaps between adjacent SNPs occur at the centromeres, while some gaps correspond toother regions of tsequences.ome that have not been assembled, e.g. due to highly repetitiveFor each cell line. the raw intensity values were normalised internally. This involved calculating the value
34、 for each SNP as a total of all the SNPs on the array, and obtaining a copy number ratio for each SNP by dividing the SNP value by the value for the sameSNP from a pool of reference normal samples. This is the point at which I received thedata.The copy number data for all cell lines are available fo
35、r download fromftp:/ftp.sanger.ac.uk/pub/CGP/10kData. Data generated on the 10K and 10K 2.0 arraysis pooled in subsequent analyses and is collectively referred to as 10K data.4.2.3Copy number variants (CNVs)CNVs are regions within tome that vary in copy number. Germline CNV regionsidentified within
36、270 HapMap samples from Redon. (2006) were downloaded from. Merged CNVs identified using theWhole Genome Tilepath (WGTP) array and Affymetrix GeneChip Human Mapping 500K early access array (500K EA) were used. The WGTP array comprises 26,574 BAC clones, while the 500K EA array covers 474,642 SNPs. T
37、he WGTP and 500K EA platforms are complementary, since they are able to detect smaller and larger CNVs,respectively (Kehrer-Sawatzki, 2007). There are 1,447 merged CNVs that cover 12% oftome. 1,390 CNVs that mapped to autosomes in the NCBI 35 human build wereused in this analysis.4.3Processing the c
38、opy number dataThe copy number ratios at individual SNPs must be processed into regions of copynumber change. As discussed in Section , a variety of methods have been166Chapter 4ABFigure 4.1. The distance between t 10K (A) and 10K 2.0 (B) SNP arrays.omic coordinates of adjacent SNPs on theABF
39、igure 4.2. The number of SNPs per human protein-coding gene on the 10K (A) and 10K 2.0 (B) SNP arrays.167Chapter 4developed for this purpose. At the time of the analysis, most of the available algorithmshad been developed primarily for conventional array CGH, i.e. using large genomicclones (see Sect
40、ion ). In a comparison of 11 methods, DNAcopy (Olshen.,2004) performed consistently well (Lai., 2005), and a comparison of 3 segmentationmethods by Willenbrock and Fridlyand (2005) demonstrated that DNAcopy performed better than GLAD (Hupe et al., 2004) and HMM (Fridlyand et al., 2004). A fur
41、therbenefit of DNAcopy is that it is freely available as an R package in BioConductor(). BioConductor is an open source software project thatprovides tools, mostly written in R, for analysing genomic data.DNAcopy (version1.4.0) was therefore chosen as the method for detecting regions of copy number
42、change in the 10K CGH data.DNAcopy uses a method called circular binary segmentation (CBS) to identify change-points in CGH data, which is input as log2 intensity ratios at consecutive positions in thegenome. The change-points correspond to positions in tome where the DNA copynumber has significantl
43、y changed. For each cell line, the copy number ratios for all SNPs were converted to log2-ratios and were smoothed, using a method within DNAcopy, to remove single point outliers before segmentation. Copy number ratios of 0 were given a log2-ratio of -6. Change-points may result from local trends in
44、 the data, and therefore allchange-points that were less than 3 standard deviations apart were removed. Defaultparameters were used for the segmentation.Different values were tested for theparameter alpha but, upon visual inspection of the graphical outputs, the default value of alpha=0.01 appeared
45、to be most suitable. Increasing alpha increases the sensitivity, resulting in more change-points but, potentially, more false positive change-points. Decreasing alpha results in fewer change-points, and regions of copy number change may therefore be missed. Increasing the number of standard deviatio
46、ns below which change- points were removed resulted in the loss of potentially important change-points. Figure4.3 shows an example of how changing the parameters can affect the output of DNAcopy for chromosomes 1 and 6 of ovarian cancer cell line 41M-CISR. The removal of change- points less than 3 s
47、tandard deviations apart results in the loss of a change-point in chromosome 1 (Figure 4.3B). However, the slight difference in copy number between the 2 arms of the chromosome may be due to trends in the data, and the difference in copy number is small. Increasing the number of standard deviations
48、to 4 results in the loss of a change-point in chromosome 6, for which there is a clear step in copy number that doeslook real (Figure 4.3C). Increasing alpha from 0.01 to 0.05 results in the inclusion of168Chapter 4Chromosome 1Chromosome 6ABCDEFFigure 4.3.Altering the values for parameters in DNAcop
49、y leads to differences inthe regions of copy number change detected by the algorithm, as demonstrated for chromosomes 1 and 6 of ovarian cancer cell line 41M-CISR. (A) Default parameters.(B) Default parameters and removal of change-points less than 3 standard deviations apart. (C) Default parameters
50、, smoothing and removal of change-points less than 4 standard deviations apart. (D) Alpha = 0.05 plus smoothing. (E) Copy number for chromosome 1, with values averaged across 3 consecutive SNPs. (F) Copy number for chromosome 6, with values averaged across 3 consecutive SNPs. Figures E and Fare (acr
51、osstakenfromtheWTSICancerGenomeProjectwebsite) and give a clearer picture of the copy number Figures A-D are extracted from the output of DNAcopy.the chromosome.Removing change-points that are close together results in fewer regions being detected,and the larger the number of standard deviations bel
52、ow which change-points are removed, the more regions are missed. Increasing alpha leads to the inclusion of additional change- points and, therefore, regions of copy number change.169Chapter 4additional change-points in chromosome 1 (Figure 4.3D). Since the data is relatively low resolution, it is h
53、ighly possible that a region of copy number change may be represented by just 1 or 2 SNPs. However, it is also possible that such SNPs are anomalies and, toavoid the identification of false positives, this is the preferred assumption.DNAcopy identifies changes in DNA copy number but does not indicat
54、e which regions are unchanged and which are gains or losses. It is therefore the responsibility of the user to set thresholds for calling gains and losses based on the mean log2-ratios of predicted segments. A disadvantage of DNAcopy is that it operates on individual chromosomes rather than the enti
55、re genome and the mean log2-ratios of segments representing no copy number change, or representing a gain or loss of a certain number of copies, will differ slightly across the genome. This makes it difficult to determine what is “normal” and therefore to call gains and losses, and the exact number
56、of copies within a gain or loss cannot be clearly determined. Willenbrock and Fridlyand (2005) have developed an algorithm called MergeLevels that merges segments across the genome that are not significantly different from one another and so produces a more interpretable set of copy number levels. C
57、ombining DNAcopy and MergeLevels was found to be more effective than using DNAcopy alone (Willenbrock and Fridlyand, 2005). MergeLevels is freely available within an R/BioConductor package called aCGH. Therefore, for each cell line, the DNAcopy segmentation results were merged across all autosomes using MergeLevels with default parameters, which were considered appropriate upon inspection of the graphical outputs. Example outputs for the kidney cancer cell line 786-0 and endometrial cancer cell line AN3-CA are shown in Figure 4.4. The merged segments with a log2-ratio cl
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 水饺猪肉采购方案(3篇)
- DB23-T2852-2021-白浆土水稻田生物炭应用技术规程-黑龙江省
- 工程维修小组管理制度
- 内部往来收据管理制度
- 公司检查评比管理制度
- 公司软件权限管理制度
- 专属管护方案么(3篇)
- 后院水井改造方案(3篇)
- 四S店化学品管理制度
- 展厅设计开放方案(3篇)
- 电工技术培训方案
- 中国偏头痛诊治指南(第一版)2023解读
- 2024年四川省绵阳市中考语文试卷与参考答案
- 北京市西城区2021-2022学年八年级下学期期末道德与法治试题(试题+答案)
- GB/T 44294-2024电主轴电动机通用技术规范
- 湖北省宜昌市2023-2024学年六年级下学期期末检测数学试题
- GB/T 4706.48-2024家用和类似用途电器的安全第48部分:加湿器的特殊要求
- 《高等数学(第2版)》 高职 全套教学课件
- 信息技术在旅游信息平台的建设与优化考核试卷
- 建设工程造价鉴定规范
- 医院培训课件:《静脉血栓栓塞症(VTE)专题培训》
评论
0/150
提交评论