plink1.9的GWAS数据处理流程.docx

上传人：T*** IP属地：江西上传时间：2019-11-22 格式：DOCX 页数：17 大小：41.51KB 积分：15 举报 版权申诉

已阅读5页，还剩12页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Data managementGenerate binary fileset-make-bed-make-bedcreates a newPLINK 1 binary fileset,afterapplying sample/variant filters and other operations below. For example,plink -filetext_fileset-maf0.05-make-bed -outbinary_filesetdoes the following:1. Autogeneratebinary_fileset-temporary.bed+.bim+.fam. (The MAF filter has not yet been applied at this stage. See theOrder of operationspage for more details.)2. Readbinary_fileset-temporary.bed+.bim+.fam. Calculate MAFs. Remove all variants with MAF 0.05from the current analysis.3. Generatebinary_fileset.bed+.bim+.fam. Any samples/variants removed from the current analysis are also not present in this fileset. (This is the -make-bed step.)4. Deletebinary_fileset-temporary.bed+.bim+.fam.In contrast, the fileset left behind by-keep-autoconvis just the result of step 1.-make-just-bim-make-just-fam-make-just-bimis a variant of -make-bed which only generates a .bim file, and-make-just-famplays the same role for .fam files. Unlike most other PLINK commands, these do not require the main input to include a .bed file (though you wont have access to many filtering flags when using these in no-.bed mode).Use these cautiously.It is very easy to desynchronize your binary genotype data and your .bim/.fam indexes if you use these commands improperly. If you have any doubt, stick with -make-bed.Generate text fileset-recode -recode-allele filename-recodecreates a new text fileset, after applying sample/variant filters and other operations. By default, the fileset includes a.pedand a.mapfile, readable with-file. The 12 modifier causes A1 (usually minor) alleles to be coded as 1 and A2 alleles to be coded as 2, while 01 maps A10 and A21. (PLINK forces you to combine 01 with-output-missing-genotypewhen this is necessary to prevent missing genotypes from becoming indistinguishable from A1 calls.) The 23 modifier causes a 23andMe-formatted file to be generated. This can only be used on a single samples data (a one-line-keepfile may come in handy here). There is currently no special handling of the XY pseudo-autosomal region. The AD modifier causes anadditive (0/1/2) + dominant (het = 1, otherwise 0) component file, suitable for loading from R, to be generated. A is the same, except without the dominance component.o By default, A1 alleles are counted; this can be customized with-recode-allele. -recode-alleles input file should have variant IDs in the first column and allele IDs in the second.o By default, the header line for .raw files only names the counted alleles. To include the alternate allele codes as well, add the include-alt modifier.o Haploid additive components are 0/2-valued instead of 0/1-valued, to maintain a consistent scale on the X chromosome.See also-R. The A-transpose modifier causes avariant-major additive component fileto be generated. This can also be used with -recode-allele. The beagle modifier causes unphased per-autosome.dat and .mapfiles, readable byBEAGLE3.3 and earlier, to be generated, while beagle-nomap generates a single .dat file (no chromosome splitting occurs in this case). The bimbam modifier causes aBIMBAM-formatted filesetto be generated. If your input data only contains one chromosome, you can use bimbam-1chr instead to write a two-column .pos.txt file. If all allele codes are single-character, you can use the compound-genotypes modifier to omit the space between each pair of allele codes in a single genotype call when generating a .ped + .map fileset. You will need to use the -compound-genotypes flag to load this data in PLINK 1.07, but its not needed for PLINK 1.9. The fastphase modifier causes per-chromosomefastPHASE filesto be generated. If your input data only contains one chromosome, you can use fastphase-1chr instead to exclude the chromosome number from the file extension. The HV modifier causes a Haploview-format .ped +.infofileset to be generated per chromosome. HV-1chr is analogous to fastphase-1chr. The lgen modifier causes along-format fileset, loadable with-lfile, to be generated. lgen-ref is equivalent to PLINK 1.07 -recode-lgen -with-reference. The list modifier causes agenotype-based listto be generated. This does not produce a .fam or .map file. The oxford modifier causes a Oxford-format.gen+.samplefileset to be generated. If you also include the gen-gz modifier, the .gen file is gzipped. The rlist modifier causes arare-genotype filesetto be generated (similar to -lists output, but with .fam and .map files, and without homozygous major genotypes). With the list and rlist formats, the omit-nonmale-y modifier causes nonmale genotypes to be omitted on the Y chromosome. The structure modifier causes aStructure-format fileto be generated. The transpose modifier causes atransposed text fileset, loadable with-tfile, to be generated. The vcf, vcf-fid, and vcf-iid modifiers result in production of aVCFv4.2 file. vcf-fid and vcf-iid cause family IDs and within-family IDs respectively to be used for the sample IDs in the last header row, while vcf merges both IDs and puts an underscore between them (in this case, a warning will be given if an ID already contains an underscore).If the bgz modifier is added, the VCF file is block-gzipped. (Gzipping of other -recode output files is not currently supported.)The A2 allele is saved as the reference and normally flagged as not based on a real reference genome (PR INFO field value). When it is important for reference alleles to be correct, youll usually also want to include-a2-allele and -real-ref-allelesin your command. The tab modifier makes the output mostly tab-delimited instead of mostly space-delimited when the format permits both delimiters. tabx and spacex force all tabs and all spaces, respectively. (Seethis pagefor guidelines on swapping tabs/spaces in other contexts.)For example,plink -bfilebinary_fileset-recode -outnew_text_filesetgeneratesnew_text_fileset.pedandnew_text_fileset.mapfrom the data inbinary_fileset.bed+.bim+.fam, whileplink -bfilebinary_fileset-recode vcf-iid -outnew_vcfgeneratesnew_vcf.vcffrom the same data, removing family IDs in the process.Irregular output coding-output-chr MT codeNormally, autosomal/sex/mitochondrial chromosome codes in PLINK output files are numeric, e.g. 23 for human X.-output-chrlets you specify a different coding scheme by providing the desired human mitochondrial code; supported options are 26 (default), M, MT, 0M, chr26, chrM, and chrMT. (PLINK 1.9 correctly interprets all of these encodings in input files.)-output-missing-genotype char-output-missing-phenotype string-output-missing-genotypeallows you to change the character (normally the-missing-genotypevalue) used to represent missing genotypes in PLINK output files, while-output-missing-phenotypechanges the string (normally the-missing-phenotypevalue) representing missing phenotypes.Note that these flags do not affect -bmerge/-merge-list or the autoconverters, since they generate files that may be reloaded during the same run. Add -make-bed if you want to change missing genotype/phenotype coding when performing those operations.Set blocks of genotype calls to missing-zero-cluster filenameIfclusters have been defined,-zero-clustertakes a file with variant IDs in the first column and cluster IDs in the second, and sets all the corresponding genotype calls to missing. See thePLINK 1.07 documentationfor an example.This flag must now be used with -make-bed and no other output commands (since PLINK no longer keeps the entire genotype matrix in memory).Heterozygous haploid errors-set-hh-missingNormally, heterozygous haploid and nonmale Y chromosome genotype calls are logged toplink.hhand treated as missing by all analysis commands, but left undisturbed by -make-bed and -recode (since, once gender and/or chromosome code errors have been fixed, the calls are often valid). If you actually want -make-bed/-recode to erase this information, use-set-hh-missing. (The scope of this flag is a bit wider than for PLINK 1.07, since commands like -list and -recode-rlist which previously did not respect -set-hh-missing have been consolidated under -recode.)Note that the most common source of heterozygous haploid errors is imported data which doesnt follow PLINKs convention for representing the X chromosome pseudo-autosomal region. This should be addressed with -split-x below, not -set-hh-missing.-set-mixed-mt-missingMitochondrial DNA is subject toheteroplasmy, so PLINK 1.9 permits heterozygous genotypes and treats MT more like a diploid than a haploid chromosome. However, some analytical methods dont use mixed MT genotype calls, and instead assume that no heterozygous MT calls exist. The-set-mixed-mt-missingflag can be used with -make-bed/-recode to export a dataset with mixed MT calls erased.X chromosome pseudo-autosomal region-split-x last bp position of head first bp position of tail -split-x build code -merge-x PLINK prefers to represent the X chromosomes pseudo-autosomal region as a separate XY chromosome (numeric code 25 in humans); this removes the need for special handling of male X heterozygous calls. However, this convention has not been widely adopted, and as a consequence, heterozygous haploid errors are commonplace when PLINK 1.07 is used to handle X chromosome data. The new -split-x and -merge-x flags address this problem.Given a dataset with no preexisting XY region,-split-xtakes the base-pair position boundaries of the pseudo-autosomal region, and changes the chromosome codes of all variants in the region to XY. As (typo-resistant) shorthand, you can use one of the following build codes: b36/hg18: NCBI build 36/UCSC human genome 18, boundaries 2709521 and 154584237 b37/hg19: GRCh37/UCSC human genome 19, boundaries 2699520 and 154931044 b38/hg38: GRCh38/UCSC human genome 38, boundaries 2781479 and 155701383By default, PLINK errors out if no variants would be affected by the split. This behavior may break data conversion scripts which are intended to work on e.g. VCF files regardless of whether or not they contain pseudo-autosomal region data; use the no-fail modifier to force PLINK to always proceed in this case.Conversely, in preparation for data export,-merge-xchanges chromosome codes of all XY variants back to X (and no-fail has the same effect). Both of these flags must be used with -make-bed and no other output commands.Mendel errors-set-me-missingIn combination with -make-bed,-set-me-missingscans the dataset for Mendel errors and sets implicated genotypes (as defined in the-mendeltable) to missing. -mendel-duoscauses samples with only one parent in the dataset to be checked, while -mendel-multigen causes (great-)ngrandparental data to be referenced when a parental genotype is missing. It is no longer necessary to combine this with e.g. -me1 1 to prevent the Mendel error scan from being skipped. Results may differ slightly from PLINK 1.07 when overlapping trios are present, since genotypes are no longer set to missing before scanning is complete.Fill in missing calls-fill-missing-a2It can be useful to fill in all missing calls in a dataset, e.g. in preparation for using an algorithm which cannot handle them, or as a decompression step when all variants not included in a fileset can be assumed to be homozygous reference matchesand there are no explicit missing calls that still need to be preserved.For the first scenario, a sophisticated imputation program such asBEAGLEorIMPUTE2should normally be used, and -fill-missing-a2 would be an information-destroying operation bordering on malpractice.However, sometimes the accuracy of the filled-in calls isnt important for whatever reason, or youre dealing with the second scenario. In those cases you can use the -fill-missing-a2 flag (in combination with -make-bed and no other output commands) to simply replace all missing calls with homozygous A2 calls. When used in combination with -zero-cluster/-set-hh-missing/-set-me-missing, this always acts last.You may want to combine this with-a2-allelebelow.Update variant information-set-missing-var-ids template string-new-id-max-allele-len n-missing-var-code missing ID stringWhole-exome and whole-genome sequencing results frequently contain variants which have not been assigned standard IDs. If you dont want to throw out all of that data, youll usually want to assign them chromosome-and-position-based IDs.-set-missing-var-idsprovides one way to do this. The parameter taken by these flags is a special template string, with a where the chromosome code should go, and a # where the base-pair position belongs. (Exactly one and one # must be present.) For example, given a .bim file starting withchr1 . 0 10583 A Gchr1 . 0 886817 C Tchr1 . 0 886817 CATTTT CchrMT . 0 64 T C-set-missing-var-ids:#b37 would name the first variant chr1:10583b37, the second variant chr1:886817b37. and then error out when naming the third variant, since it would be given the same name as the second variant. (Note that this position overlap is actually present in 1000 Genomes Project phase 1 data.)To maintain unique IDs in this situation, you can include $1 and $2 in your template string as well; these refer to the first and second allele namesin ASCII-sort order. So, if were using abashshell, we can try again with-set-missing-var-ids:#b37$1,$2which would name the first variant chr1:10583b37A,G, the second variant chr1:886817b37C,T, the third variant chr1:886817b37C,CATTTT, and the fourth variant chrMT:64b37C,T. Note the extra backslashes: they are necessary inbashbecause $ is a reserved character there.You may still get a small number of duplicate ID errors when using $1 and $2. If indels are involved, it is likely that the ambiguity cannot be resolved by PLINK 1 at all, because it matters which allele is the reference allele1. Instead, you must e.g. use a shell script to manually name variants in your original VCF file; seethis blog post by Giulio Genovesefor a detailed discussion. We apologize for the inconvenience; PLINK 2.0 will extend -set-missing-var-ids to support REF/ALT-based naming templates.Allele names associated with indels are occasionally very, very long, and the synthetic variant ID names which would be generated from such long alleles are very inconvenient to work with. As a result, when an allele name exceeds23characters, it is automatically truncated in the variant ID generated by -set-missing-var-ids. You can use-new-id-max-allele-lento change the limit.If your pipeline does not use . to represent unnamed variants, you can use-missing-var-codeto specify a different string to match. For example, -missing-var-code NA would be appropriate for a .bim file starting withchr1 NA 0 10583 A Gchr1 NA 0 886817 C Tchr1 NA 0 886817 CATTTT CchrMT NA 0 64 T C1: Technically, if youneverforget to use-keep-allele-orderbetween VCF file conversion and variant naming, an A1/A2-allele-based naming template would work. But we cannot justify directly supporting this workflow, since (i) its too error-prone, and (ii) if youre an advanced user who can get this right, you can just use anawkone-liner on the post-conversion .bim file.-update-chr filename chr col. number variant ID col. skip-update-cm filename cm col. number variant ID col. skip-update-name filename new ID col. number old ID col. skip-update-map filename bp col. number variant ID col. skip-update-alleles filename-allele1234 -alleleACGT (Also see-cm-map, which is an alternative to -update-cm.)-update-chr,-update-cm,-update-map, and-update-nameupdate variant chromosomes, centimorgan positions, base-pair positions, and IDs, respectively. By default, the new value is read from column 2 and the (old) variant ID from column 1, but you can adjust these positions with the second and third parameters. The optional fourth skip parameter is either a nonnegative integer, in which case it indicates the number of lines to skip at the top of the file, or a single nonnumeric character, which causes each line with that leading character to be skipped. (Note that, if you want to specify # as the skip character, you need to surround it with single- or double-quotes in some Unix shells.)Strictly speaking, you can use Unixtail,cut,paste, and/orsedto perform the same job (albeit with more time and hassle) as the three optional parameters we have introduced. If you have not used these Unix commands before, we recommend that you familiarize yourself with what they do because they are still likely to come in handy in other scenarios.You can combine -update-chr, -update-cm, and/or -update-map in the same run. (However, to avoid confusion regarding whether old or new variant IDs apply, we force -update-name to be run separately.)When invoking -update-chr,you now must use -make-bed in the same run, and no other output commands. Otherwise, we still recommend that you use -make-bed once instead of -update-.

人人文库> 全部分类> 教育资料 > 课设设计

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

plink1.9的GWAS数据处理流程.docx

文档简介

温馨提示

最新文档

评论

plink1.9的GWAS数据处理流程.docx

文档简介

温馨提示

最新文档

评论

相关文档