1 Setup

In order to run the examples in this section you might need to obtain some software and data:

software source documentation
perl www.perl.org www.perl.org/docs, perl one-liners
bcftools samtools.github.io/bcftools/ BCFtools HowTo (Danecek et al. 2021)
64-bit Java www.java.com FAQ
beagle JAR file docs (Browning, Zhou, and Browning 2018)

1.1 Linux

Users of Debian/Ubuntu Linux can easily install these software dependencies with apt:

sudo apt install bcftools beagle
 
# test java setup  
sudo R CMD javareconf

1.2 Windows

Windows users can instead use the WSL, which can be used to install Ubuntu, or MobaXterm. In the latter case, you can install bcftools as follows:

apt-get install perl_base make git gcc-core zlib-devel liblzma-devel libbz2-devel libcurl-devel
git clone --recurse-submodules https://github.com/samtools/htslib.git
git clone https://github.com/samtools/bcftools.git
cd bcftools/
make
# this might throw errors, but the main binary will be produced anyway
ls -ltr bcftools.exe

64-bit Java should be installed from www.java.com

The beagle JAR file should be downloaded and placed in a folder accessible from MobaXterm.

1.3 Sample data and code

For this course we will be using two compressed VCF files that will be provided in the Moodle platform.

However, external users can still follow the examples by cloning this course to get copies of sample data and code, which are in folders test_data and test_code:

git clone https://github.com/eead-csic-compbio/bioinformatics.git

2 Practicals

2.1 The VCF format

The de facto standard data format for genotyping studies is the Variant Call Format (VCF). In this section we will see what the VCF format looks like. Let’s inspect the example below, which is modified from the official VCFv4.3 format specification:

##fileformat=VCFv4.3
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a..,species="Homo sapiens",taxonomy=x>
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP 0/0:49:3 0/1:3:5 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP 1/2:21:6 2/1:2:2 ./.:.:.
20 1230237 . T . 47 PASS NS=3;DP=13 GT:GQ:DP 0|0:54:7 0|0:48:4 0/0:61:2

First, take your time to inspect the metadata in the header of the file, which are the lines that start with ##. These lines inform you about

  • the format (ie VCFv4.3)
  • the software that created the file
  • the reference genome sequence where variants polymorphisms are mapped (ie human chr 20).

In addition you should check the INFO, FILTER and FORMAT lines, as they define the fields that you will see in the actual data lines below. Note that these can change depending on the software / protocols used to produced the VCF file.

Then you can see the header line, with some fixed columns, starting with #CHROM, and then the actual names of the samples in the file (ie NA0001).

Data lines show actual data for all loci genotyped. In the example there are only three loci, in particular positions 17330, 1110696 and 1230237 of the human chr20. The (human) samples are diploid.

The first one is a biallelic Single Nucleotide Polymorphism (SNP) where the reference genome is T (thymine) and the alternative allele is A (adenine, see their structure here). Note that in this case the first sample is TT (0/0), the second is TA (0/1) and the third again TT (0/0). Note also that for each sample in the example there are three fields separated by “:”, in the order GT:GQ:DP. Thus, you now know that the read depth (DP) of these calls is 3, 5 and 3 respectively.

The second locus is a named SNP with two alternative alleles and the last sample is missing data.

The third locus shows two phased genotypes (0|0), which differ from unphased in that the underlying haplotypes are known.

2.2 The BCF format

This is how the BCF format is described at the official VCFv4.2 format specification:

VCF is very expressive, accommodates multiple samples, and is widely used in the community. Its biggest drawback is that it is big and slow. Files are text and therefore require a lot of space on disk. A normal batch of ̃10 exomes is a few GB, but large-scale VCFs with thousands of exome samples quickly become hundreds of GBs.

BCF2 is a binary, compressed equivalent of VCF that can be indexed with tabix and can be efficiently decoded from disk or streams. For efficiency reasons BCF2 only supports a subset of VCF, in that all info and genotype fields must have their full types specified.

2.3 Preparing, publishing and sharing your own VCF files

The following statements are taken from (Beier et al. 2022):

Genotyping data is often published and shared without sufficient metadata to ensure interoperability and reuse. Some information that should be recorded cannot be easily retrieved from the analysis results, such as the identification of biological material studied, or the reference genome assembly and version used.

Depending on who is handling the data and what skills are associated with the role, the difficulty of providing well-formatted metadata will vary.

  • Bioinformaticians who have directly performed the genotyping analyses and thus the creation of the VCF files will consider it a comparatively simple task to enter metadata directly into the file.
  • Gathering experimental data from conversations with wet lab colleagues or in a laboratory information management system (LIMS) search will be the more laborious.
  • Experimentalists who have little or no experience with the required metadata formats are most likely to be overwhelmed without a simple GUI or input template.
  • Principal investigators who want to submit the data at the end of an experiment may have similar difficulties.

From these observations, there is an urgent need for supporting tools or APIs for the structural and content validation of VCFs. We recommend performing both metadata and data validation. For the validation of VCF files, we recommend EBI’s VCF validator.

When the time comes to publish your own genotyping data, we encourage to follow the guidelines at (Beier et al. 2022) and to submit the data to the European Variation Archive to ensure long-term availability (Cezard et al. 2021).

2.4 Mixing allele calls from different experiments

It is common to mix genotyping data from different experiments and laboratories. For instance you might want to enrich your own data with other genotypes already published from collaborators or from you own previous experiments.

In this section we will visit frequent common problems that arise in this situation.

Note that having some genotypes, ie a couple barley cultivars, analyzed in separate experiments will be of great help, as we can use them as controls and guides to make sure the data is correct. For instance, at EEAD-CSIC we will typically add cv Morex to our new genotyping experiments.

2.4.1 Different reference genome sequence

If you have two properly curated VCF files, inspecting the metadata will tell you whether the reference sequence used to map variants is the same or not. Note that reference sequences are usually in FASTA format, which looks like this:

>chr1H
CGTGCACGACCATCGAGACGTCGCGGAAACTCGTCGTTTTTGTCGTTCGGGCCACTTTCA
TGGGCTATAGCACACGGTATTGGGGTCCCGATTCAATTTTGGATTCTC...

Let’s check the headers of two VCF files that use two versions of the barley reference genome (cultivar Morex) that correspond to assemblies GCA_902498975.1 and GCA_904849725.1:

##fileformat=VCFv4.3
##fileDate=20181104
##reference_ac=GCA_902498975.1
##reference_url=“ftp.ncbi.nlm.nih.gov/genomes/all/GCA/902/498/975/GCA_902498975.1_Morex_v2.0/GCA_902498975.1_Morex_v2.0_genomic.fna.gz”

and

##fileformat=VCFv4.3
##fileDate=20220112
##reference_ac=GCA_904849725.1
##reference_url=“https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/904/849/725/GCF_904849725.1_MorexV3_pseudomolecules_assembly/GCF_904849725.1_MorexV3_pseudomolecules_assembly_genomic.fna.gz”

If the metadata in the “##contig” line contains a md5 sum that can also be used to check the reference, but note that a different FASTA header is enough to change it.

Another quick check is to check the chromosome/contig names of the VCF files, which you can do as follows:

# use cat if file not compressed
zcat test_data/vcf/samples50.vcf.gz | grep -v "^#" | cut -f 1 | sort -u

If you are not sure then you must check explicitly whether the references are the same before any joint analysis can be done. There are different ways to do this; for instance you can use the command line:

# get two VCF files, ie those at test_data/vcf/
zcat test_data/vcf/samples50.vcf.gz | wc
zcat test_data/vcf/samples100.vcf.gz | wc

# extract CHR,POS,REF columns from first VCF
zcat test_data/vcf/samples50.vcf.gz | perl -lane 'next if(/^#/); print "$F[0]\t$F[1]"' > pos50.txt

# compute intersection with grep
zgrep -f pos50.txt test_data/vcf/samples50.vcf.gz | cut -f 1,2,4 | head 
zgrep -f pos50.txt test_data/vcf/samples100.vcf.gz | cut -f 1,2,4 | head

If the references are indeed different you have two choices:

  • Remap the reads used to build one of the VCF files on the same reference used for the other. This is probably the safest, as there might be large rearrangements among different genome versions, and some contigs might even change strand. However, this is time consuming and requires a little bioinformatics.

  • Lift-over positions from reference1 to reference2 and then edit the VCF files accordingly. Most likely this will result is dropping a fraction of the original mappings. This task also requires bioinformatics, you can follow my notes in folder test_code/liftover

2.4.2 Different platforms

Mixing genotype calls from different platforms might cause problems as SNPs and alleles might not be named equally or places in the same chromosome strand, even the reference genome sequence is the same.

To illustrate this we’ll have a look at genotypes called with Illumina GoldenGate and Infinium Assays, which are quite common in barley breeding. When you get the results from your genotyping provider you should receive a manifest file, which for the so called barley Illumina 50k chips looks like this:

head test_data/manifest.csv

The key columns in the manifest are CustomerStrand and IlmnStrand, which according to the documentation (Illumina 2021), are defined as follows:

  • IlmnStrand: The strand used to design each probe is designated as TOP/BOT for SNPs as indicated in the ‘IlmnStrand’ column of the manifest. The alleles in the ‘SNP’ column are on the IlmnStrand.

  • CustomerStrand: The strand submitted to the Illumina designer by Illumina or a customer is designated as TOP/BOT as indicated in the ‘SourceStrand’ column of the manifest.

These two pieces of information really mean that the actual allele used to design a probe might be actually different to the one submitted, which is usually what the researcher want. This system has created a lot of confusion (Nelson et al. 2012; Verma et al. 2014; Zhao et al. 2017) as SNPs genotyped in other platforms don’t follow this convention.

It is frequently assumed that the first allele in a SNP (ie A in [A/T]) corresponds to the allele found in the reference genome forward sequence. If that’s the case and that’s the type of sequence submitted to Illumina to produce a genotyping kit, then you can convert Illumina calls as follows:

\[ \begin{cases} call \text{ if } CustomerStrand = \text{TOP} \\ complement(call) \text{ if } CustomerStrand = \text{BOT} \end{cases} \]

Nevertheless, you might encounter cases where the reference allele is not the one in the genome sequence but perhaps the most common in a large population. For this reason, care should be taken when doing these conversions to make sure they’re correct. The recipe about computing identity should help.

2.5 Handling VCF files with the Linux terminal (bcftools, beagle, one-liners)

In this section we will see typical operations that are done on VCF files to prepare for further analyses.

The following examples probably require some setup.

2.5.1 Compressing and indexing VCF files

A compressed VCF file, usually with extension .vcf.gz, takes less disk space and can be efficiently handled as well. However, it needs to be indexed before. Note that the compression format is called bgzip.

bcftools view -I -o sample.vcf.gz -O z sample.vcf

You could check the format and index it as follows:

file sample.vcf.gz

bcftools index sample.vcf.gz
# should produce sample.vcf.gz.csi

If a VCF is uncompressed there’s no need to index it.

2.5.2 Renaming chromosome names and removing unwanted data fields from VCF files

Should you need to edit the chromosome names or to remove unwanted SNP data you can do that with a command such as the next one, which removes fields GQ, AD, RO, QR and QA:

bcftools annotate --rename-chrs chrnames.tsv \
        -x INFO,FORMAT/GQ,FORMAT/AD,FORMAT/RO,FORMAT/QR,FORMAT/QA -o GBS.vcf \
        sample.vcf

Note this requires a TSV file named chrnames:

LR890096.1      chr1H
LR890097.1      chr2H
LR890098.1      chr3H
LR890099.1      chr4H
LR890100.1      chr5H
LR890101.1      chr6H
LR890102.1      chr7H

2.5.3 Extracting samples from VCF files

Often you might need to extract only selected samples from a larger VCF file. To do we should first find out which genotypes are included in such a file:

bcftools query -l sample.vcf.gz

You can use that information to prepare a subset.txt file such as this one:

sample1
sample3
sample25

Finally, you can now substract the relevant VCF content like this:

bcftools view -S subset.txt -O z -o subset.sample.vcf.gz sample.vcf.gz
bcftools index subset.sample.vcf.gz

2.5.4 Merging separate VCF files

bcftools merge -o samples12.vcf.gz -O z samples1.vcf.gz samples2.vcf.gz

2.5.5 Computing sequence identity among samples

A good quality control before you merge different VCF files is to check whether the called genotypes of repeated control samples, ie cultivar Morex in barley, match across experiments.

The following example code does this for two samples in the same VCF file, which we know are the same from the same cultivar. Note that in this case they correspond to columns 10 and 17 of the file, you should adapt this to your own data:

# cut is 1-based, first column is 1
zcat sample.vcf.gz | cut -f 10,17 | perl -lane \
  'next if(/\.\/\./); if(/(\S\/\S).*?\t(\S\/\S)/){ if($1==$2){ $eq++ } else { $neq++} } END{ printf("%1.3f n=%d\n",100*$eq/($eq+$neq),$eq+$neq)}'

On a real dataset with DP=5 at hand I obtained:

99.565 n=41570

Note that the numbers I got with DP=3 where the following:

99.056 n=51082

This shows that read depth does have an effect on the accuracy of called genotypes.

2.5.6 Filtering out by depth

As we saw in the previous example, read depth matters. Here we will see how to filter a VCF file by read depth. In this example we take multiallelic SNPs with DP>=5, assuming that DP is the second field ($d[1]):

# perl is 0-based, first elem is 0, second is 1, last if $#F
zcat sample.vcf.gz | perl -lane \
  'if(/^#/){ print } elsif(length($F[4]) > 1){ next } else { foreach $c (9 .. $#F){ @d = split(/:/,$F[$c]); if($d[1]<5){$d[0]="./."}; $F[$c]=join(":",@d)  }; print join("\t",@F) }'  > sample.DP5.vcf

2.5.7 Filtering out by missing data

To select SNPs with a max fraction of missing data (./.) the following one-liner would do:

zcat sample.vcf.gz | perl -lane 'if(/^#/){ print } else { $M=0; while(/\.\/\.:/g){$M++}; $M/=($#F-8); print if($M <= 0.2) }' > sample.MISS0.2.vcf

2.5.8 Filtering by frequency of minor allele (MAF)

# assumes individuals are essentially homozygous
# R is the reference allele
# A is the alternative allele
# MAF is set to 0.01
zcat sample.vcf.gz | perl -lne \
  'if(/^#/){ print } else { $R=$A=0; while(/0\/0:/g){$R++}; while(/1\/1:/g){$A++}; $t=$A+$R; print if($A && $R && $A/$t >= 0.01 && $R/$t >= 0.01) } > sample.MAF0.01.vcf

2.5.9 Taking only biallelic data

After cleaning out the data often the next analysis requires biallelic SNPs from high-quality samples. This can be done with:

bcftools view -m2 -M2 -v snps -S good.subset.txt -O z -o good.samples.bi.vcf.gz samples.vcf

2.5.10 Imputing missing data

After the previous steps you might want to impute the missing data, a step in which missing genotype calls are inferred from the existing calls in order to increase the total number of marker SNPs. We can do this with the software Beagle v5 (Browning, Zhou, and Browning 2018), which is efficient and has been shown to work well in wheat (He et al. 2015) and yields an accuracy of [0.977 -0.991] in our tests with a panel of 500 barleys genotyped by sequencing (GBS) with MAF=0.01.

# Linux binary
beagle gt=samples.vcf.gz out=samples.imputed

# downloaded JAR
java -jar beagle.22Jul22.46e.jar gt=samples.vcf.gz out=samples.imputed

# if using several CPU threads
java -jar beagle.22Jul22.46e.jar nthreads=8 gt=samples.vcf.gz out=samples.imputed

# if you need more RAM
java -Xmx14564m -jar beagle.22Jul22.46e.jar gt=samples.vcf.gz out=samples.imputed

Note that beagle will add phased genotypes, ie 0|0 instead of the standard unphased 0/0, and by default will leave only the GT field for each called genotype. This means read depth (DP) is lost after this step.

Bibliography

Beier, S, A Fiebig, C Pommier, I Liyanage, M Lange, PJ Kersey, S Weise, et al. 2022. “Recommendations for the Formatting of Variant Call Format (VCF) Files to Make Plant Genotyping Data FAIR [Version 2; Peer Review: 2 Approved].” F1000Research 11 (231). https://doi.org/10.12688/f1000research.109080.2.
Browning, Brian L., Ying Zhou, and Sharon R. Browning. 2018. “A One-Penny Imputed Genome from Next-Generation Reference Panels.” The American Journal of Human Genetics 103 (3): 338–48. https://doi.org/https://doi.org/10.1016/j.ajhg.2018.07.015.
Cezard, Timothe, Fiona Cunningham, Sarah E Hunt, Baron Koylass, Nitin Kumar, Gary Saunders, April Shen, et al. 2021. The European Variation Archive: a FAIR resource of genomic variation for all species.” Nucleic Acids Research 50 (D1): D1216–20. https://doi.org/10.1093/nar/gkab960.
Danecek, P., J. K. Bonfield, J. Liddle, J. Marshall, V. Ohan, M. O. Pollard, A. Whitwham, et al. 2021. Twelve years of SAMtools and BCFtools.” Gigascience 10 (2).
He, S., Y. Zhao, M. F. Mette, R. Bothe, E. Ebmeyer, T. F. Sharbel, J. C. Reif, and Y. Jiang. 2015. Prospects and limits of marker imputation in quantitative genetic studies in European elite wheat (Triticum aestivum L.).” BMC Genomics 16 (March): 168.
Illumina, Inc. 2021. How to interpret DNA strand and allele information for Infinium genotyping array data.” https://emea.support.illumina.com/bulletins/2017/06/how-to-interpret-dna-strand-and-allele-information-for-infinium-.html.
Nelson, Sarah C., Kimberly F. Doheny, Cathy C. Laurie, and Daniel B. Mirel. 2012. “Is ‘Forward’ the Same as ‘Plus’?…and Other Adventures in SNP Allele Nomenclature.” Trends in Genetics 28 (8): 361–63. https://doi.org/https://doi.org/10.1016/j.tig.2012.05.002.
Verma, S. S., M. de Andrade, G. Tromp, H. Kuivaniemi, E. Pugh, B. Namjou-Khales, S. Mukherjee, et al. 2014. Imputation and quality control steps for combining multiple genome-wide datasets.” Front Genet 5: 370. https://doi.org/10.3389/fgene.2014.00370.
Zhao, Shilin, Wang Jing, David C Samuels, Quanghu Sheng, Yu Shyr, and Yan Guo. 2017. Strategies for processing and quality control of Illumina genotyping arrays.” Briefings in Bioinformatics 19 (5): 765–75. https://doi.org/10.1093/bib/bbx012.