Installing Dependencies in R

For this tutorial you need a few R libraries. Bioconductor has packages for bioinfmatics. The following command will install the libraries used in the tutorial.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.18")
BiocManager::install(c("rtracklayer", "ggbio", "knitr", "ggplot2", "tidyr", "dplyr","VariantAnnotation"))

The documentation of the packages used is in:

Reading the VCF file

The VCF file is a file format that contains SNPs in a database. We will be working with a selection of SNPs from the following paper:

Elizabeth S. A. Sollars, et al (2016) Genome sequence and genetic diversity of European ash trees. Nature 541: 212–216 doi:10.1038/nature20786

Loading the data

For this analysis we will need several R libraries to do the analysis, filtering and plot.

library(rtracklayer)
library(knitr)
library(ggplot2)
library(ggbio)
library(tidyr)
library(dplyr)
library(VariantAnnotation)
vcf <- readVcf(file = "extractedContig1.6.23.vcf.gz")

VCF header

You can use the following command to see all the data that is stored in the VCF

head(vcf)

## class: CollapsedVCF 
## dim: 6 37 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 45 columns: NS, DP, DPB, AC, AN, AF, RO, AO, PRO, PAO, QR, ...
## info(header(vcf)):
##                        Number Type    Description                              
##    NS                  1      Integer Number of samples with data              
##    DP                  1      Integer Total read depth at the locus            
##    DPB                 1      Float   Total read depth per bp at the locus; ...
##    AC                  A      Integer Total number of alternate alleles in c...
##    AN                  1      Integer Total number of alleles in called geno...
##    AF                  A      Float   Estimated allele frequency in the rang...
##    RO                  1      Integer Reference allele observation count, wi...
##    AO                  A      Integer Alternate allele observations, with pa...
##    PRO                 1      Float   Reference allele observation count, wi...
##    PAO                 A      Float   Alternate allele observations, with pa...
##    QR                  1      Integer Reference allele quality sum in phred    
##    QA                  A      Integer Alternate allele quality sum in phred    
##    PQR                 1      Float   Reference allele quality sum in phred ...
##    PQA                 A      Float   Alternate allele quality sum in phred ...
##    SRF                 1      Integer Number of reference observations on th...
##    SRR                 1      Integer Number of reference observations on th...
##    SAF                 A      Integer Number of alternate observations on th...
##    SAR                 A      Integer Number of alternate observations on th...
##    SRP                 1      Float   Strand balance probability for the ref...
##    SAP                 A      Float   Strand balance probability for the alt...
##    AB                  A      Float   Allele balance at heterozygous sites: ...
##    ABP                 A      Float   Allele balance probability at heterozy...
##    RUN                 A      Integer Run length: the number of consecutive ...
##    RPP                 A      Float   Read Placement Probability: Phred-scal...
##    RPPR                1      Float   Read Placement Probability for referen...
##    RPL                 A      Float   Reads Placed Left: number of reads sup...
##    RPR                 A      Float   Reads Placed Right: number of reads su...
##    EPP                 A      Float   End Placement Probability: Phred-scale...
##    EPPR                1      Float   End Placement Probability for referenc...
##    DPRA                A      Float   Alternate allele depth ratio.  Ratio b...
##    ODDS                1      Float   The log odds ratio of the best genotyp...
##    GTI                 1      Integer Number of genotyping iterations requir...
##    TYPE                A      String  The type of allele, either snp, mnp, i...
##    CIGAR               A      String  The extended CIGAR representation of e...
##    NUMALT              1      Integer Number of unique non-reference alleles...
##    MEANALT             A      Float   Mean number of unique non-reference al...
##    LEN                 A      Integer allele length                            
##    MQM                 A      Float   Mean mapping quality of observed alter...
##    MQMR                1      Float   Mean mapping quality of observed refer...
##    PAIRED              A      Float   Proportion of observed alternate allel...
##    PAIREDR             1      Float   Proportion of observed reference allel...
##    technology.ILLUMINA A      Float   Fraction of observations supporting th...
##    ANN                 .      String  Functional annotations: 'Allele | Anno...
##    LOF                 .      String  Predicted loss of function effects for...
##    NMD                 .      String  Predicted nonsense mediated decay        
## geno(vcf):
##   List of length 8: GT, GQ, GL, DP, RO, QR, AO, QA
## geno(header(vcf)):
##       Number Type    Description                                               
##    GT 1      String  Genotype                                                  
##    GQ 1      Float   Genotype Quality, the Phred-scaled marginal (or uncondi...
##    GL G      Float   Genotype Likelihood, log10-scaled likelihoods of the da...
##    DP 1      Integer Read Depth                                                
##    RO 1      Integer Reference allele observation count                        
##    QR 1      Integer Sum of quality of the reference observations              
##    AO A      Integer Alternate allele observation count                        
##    QA A      Integer Sum of quality of the alternate observations

Looking at the qualities of the

To get the a GRanges object with the common data the function rowRanges can be used. This let us query questions like the distribution of the quality of the SNPs.

variations <- rowRanges(vcf)
head(variations, 3)

## GRanges object with 3 ranges and 5 metadata columns:
##                    seqnames    ranges strand | paramRangeID            REF
##                       <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##    Contig23:70_A/T Contig23        70      * |           NA              A
##    Contig23:72_T/A Contig23        72      * |           NA              T
##   Contig23:129_T/G Contig23       129      * |           NA              T
##                                   ALT      QUAL      FILTER
##                    <DNAStringSetList> <numeric> <character>
##    Contig23:70_A/T                  T   1826.92           .
##    Contig23:72_T/A                  A   1884.80           .
##   Contig23:129_T/G                  G   4029.34           .
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Summary of qualities

summary(variations$QUAL)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    50.17  3830.37  8543.81  7804.08 11484.65 56504.60

Visalising the distribution

To get an idea of the overall quality of the data, a histogram is useful. The SNPs with lower quality tend to be noisy. For this project, we validated some random SNPs using KASP across different qualitues, from 50 to 500 and we concluded that 300 was a good spot. This histogram helped us to decided the thresholds to test, as most of the tutorials online deem QUAL>50 as good quality.

ggplot(variations, aes(QUAL)) + geom_histogram(binwidth = 50) + theme_bw() +
   coord_cartesian(xlim=c(0,20000)) + geom_vline(xintercept=300, color="red")

Depth for individual samples

We can look at the depth of coverage of various samples. This can be useful to find if a particular sample has low quality.

DP <-geno(vcf)$DP 
summary_dp <- DP[is.na(DP)] <- 0
long_dp <- gather(data.frame(DP),value = "DP", key="Sample")
long_dp %>%
  group_by(Sample) %>%
  summarise(
    n = n(),
    mean_DP = mean(DP, na.rm = TRUE),
    std_DP  = sd(DP, na.rm = TRUE)
  ) -> summary_dp

Sample	n	mean_DP	std_DP
Fex1	20620	7.306353	9.729400
Fex10	20620	10.898739	14.177238
Fex11	20620	9.175849	13.543446
Fex12	20620	12.205286	16.032310
Fex13	20620	14.063531	19.484195
Fex14	20620	9.818671	13.411106
Fex15	20620	10.319981	13.646079
Fex16	20620	13.047769	18.056399
Fex17	20620	12.447769	16.089833
Fex18	20620	14.220369	22.543101
Fex19	20620	10.075412	14.050866
Fex2	20620	12.068865	16.972320
Fex20	20620	15.964161	19.647557
Fex21	20620	12.637876	15.395132
Fex22	20620	11.289137	15.800327
Fex23	20620	11.863822	17.689616
Fex24	20620	10.070466	18.857208
Fex25	20620	8.327158	9.791787
Fex26	20620	13.012221	17.368032
Fex27	20620	10.450534	13.977195
Fex28	20620	12.704850	18.177778
Fex29	20620	10.298885	14.201916
Fex3	20620	9.199466	12.967614
Fex30	20620	10.940349	14.018360
Fex31	20620	7.026285	9.781015
Fex32	20620	8.080068	10.486800
Fex33	20620	9.358584	11.326032
Fex34	20620	10.241804	10.995112
Fex35	20620	9.575121	10.794247
Fex36	20620	10.823909	11.976831
Fex37	20620	8.476285	10.591030
Fex4	20620	9.694956	15.275143
Fex5	20620	10.755383	13.830538
Fex6	20620	9.284627	11.856945
Fex7	20620	9.811930	12.764079
Fex8	20620	8.646848	11.660582
Fex9	20620	7.525897	9.669511

Can you make a summary per contig?
Can you make a summary per contig and per sample?

Distribution across sample

You can focus on the distribution of a few samples.

long_dp %>% filter( Sample %in% c("Fex1", "Fex2")) %>% 
ggplot() +  
geom_density(aes(x=DP, colour=Sample, fill=Sample), alpha=0.5) + 
coord_cartesian(xlim = c(0,100))

Displaying all the samples

For all the samples, you can show just a boxplot,at the cost of having less resultion of the shape of the distribution

long_dp %>% 
ggplot() +  
geom_boxplot(aes(y=DP, x=Sample), alpha=0.5) + 
coord_cartesian(ylim = c(0,25)) + 
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Questions for you

How many seqnames are present?
Plot the count of genotypes on each sequence.
Can you calculate the allele frequency (\(\frac{R0}{DP}\)) for each sample? Plot 2 or 3 samples.
Plot the distribution of several samples together (Team1: Sample Fex1-Fex5, Team 2: Fex6-Fex10, etc)
Are there differences in in DP across different scaffolds? Try to figure out how to extract the data for a single reference.
- Maybe directly from the VCF object, or exporting the DP matrix and filter by the rowname.
Do you think the samples are homozygous, heterozygous or a mix of both (you can calculate the distribution of allele frequencies or look at the GT column).
To compare all the sample, overlapping all the distributions looks messy. Plot the DP for all the samples. Try geom_boxplot and facet_grid. Which one you find clearer? Is there any sample that has a significant lower coverage?

How many `seqnames` are present?

seqnames(vcf)

## factor-Rle of length 20620 with 3 runs
##   Lengths:     5473     9517     5630
##   Values : Contig23 Contig1  Contig6 
## Levels(3): Contig23 Contig1 Contig6

Plot the count of genotypes on each sequence.

We need to produce a dataframe with the contig names and the GT column

GT <-geno(vcf)$GT
GT <- data.frame(GT)
GT$seqnames <- data.frame(seqnames(variations))[,1]
GT %>% gather( -seqnames ,key = "Sample", value = "GT") -> long_gt

long_gt %>%
  group_by(Sample, GT, seqnames) %>%
  summarise(
    n = n(),
    GT = unique(GT, na.rm = TRUE)
  )

## `summarise()` has grouped output by 'Sample', 'GT'. You can override using the
## `.groups` argument.

## # A tibble: 492 × 4
## # Groups:   Sample, GT [185]
##    Sample GT    seqnames     n
##    <chr>  <chr> <fct>    <int>
##  1 Fex1   .     Contig23   201
##  2 Fex1   .     Contig1   1322
##  3 Fex1   .     Contig6    290
##  4 Fex1   ./.   Contig1      1
##  5 Fex1   0/0   Contig23   200
##  6 Fex1   0/0   Contig1    512
##  7 Fex1   0/0   Contig6    217
##  8 Fex1   0/1   Contig23   131
##  9 Fex1   0/1   Contig1    165
## 10 Fex1   0/1   Contig6    113
## # ℹ 482 more rows

Plot the count of genotypes on each sequence.

Now we can plot from the GT data frame and plot the counts

 long_gt %>% 
filter( Sample %in% c("Fex1", "Fex2", "Fex3", "Fex4", "Fex5")) %>% 
ggplot( aes(GT, fill=GT)) + geom_bar( ) + facet_grid(seqnames~Sample)

Can you calculate the allele frequency (\(\frac{RO}{DP}\)) for each sample? Plot 2 or 3 samples.

DP <-geno(vcf)$DP 
RO <-geno(vcf)$RO 
ro_freq <- data.frame(RO/DP)

ggplot(ro_freq, aes(Fex1)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 1813 rows containing non-finite values (`stat_bin()`).

Plot the distribution of several samples together (Team1: Sample Fex1-Fex5, Team 2: Fex6-Fex10, etc)

long_ro_freq <- gather(ro_freq ,value = "RO_freq", key="Sample")
long_ro_freq %>% filter( Sample %in% c("Fex1", "Fex3","Fex4", "Fex2","Fex5","Fex6")) %>% 
ggplot() +  
geom_histogram(aes(x=RO_freq), alpha=0.5) + facet_wrap(Sample ~ .)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 8212 rows containing non-finite values (`stat_bin()`).

Are there differences in in DP across different scaffolds? Try to figure out how to extract the data for a single reference.

Hint: directly from the VCF object, or exporting the DP matrix and filter by the rowname.

DP <-geno(vcf)$DP
DP <- data.frame(DP)
DP$seqnames <- data.frame(seqnames(variations))[,1]
DP %>% gather(  -seqnames ,value = "DP", key="Sample") -> long_dp

The coverage is very similar across sequences

long_dp %>%
  group_by(seqnames) %>%
  summarise(
    n = n(),
    DP = mean(DP, na.rm = TRUE)
  )

## # A tibble: 3 × 3
##   seqnames      n    DP
##   <fct>     <int> <dbl>
## 1 Contig23 202501  11.7
## 2 Contig1  352129  11.0
## 3 Contig6  208310  11.3

Most of the variation comes from the sample

long_dp %>%
  group_by(Sample) %>%
  summarise(
    n = n(),
    DP = mean(DP, na.rm = TRUE)
  )

## # A tibble: 37 × 3
##    Sample     n    DP
##    <chr>  <int> <dbl>
##  1 Fex1   20620  8.01
##  2 Fex10  20620 11.8 
##  3 Fex11  20620  9.75
##  4 Fex12  20620 12.8 
##  5 Fex13  20620 14.7 
##  6 Fex14  20620 10.5 
##  7 Fex15  20620 11.1 
##  8 Fex16  20620 13.7 
##  9 Fex17  20620 13.2 
## 10 Fex18  20620 14.8 
## # ℹ 27 more rows

To think about…

Do you think the samples are homozygous, heterozygous or a mix of both (you can calculate the distribution of allele frequencies or look at the GT column).

The samples look homozygous, with small regions with heterozygous calls.

To compare all the sample, overlapping all the distributions looks messy. Plot the DP for all the samples. Try geom_boxplot and facet_grid. Which one you find clearer? Is there any sample that has a significant lower coverage? This is a personal choice. The plots above reflect it.

Filtering.

Based on the discussion before, using bcftools filger out for files with quality over 300. Read the documentation in: https://samtools.github.io/bcftools/bcftools.html#expressions to find more expression.

First simplify the VCF file to only include the GT,AO and RO files.

bcftools annotate -x "INFO,^FORMAT/GT,FORMAT/AO,Format/RO" extractedContig1.6.23.vcf.gz | bgzip -c > extractedContig1.6.23.anno.vcf.gz

beagle requires that even missing values correspond to the ploidy, so we need to change . in the genotype to ./.

gunzip -c extractedContig1.6.23.anno.vcf.gz | sed -e 's/    \.:/    .\/.:/g' | bgzip -c >  extractedContig1.6.23.anno.fixed.vcf.gz

Now we can run the filtering of the formatted file

bcftools filter -e 'QUAL<300' extractedContig1.6.23.anno.fixed.vcf.gz  -O z -o extractedContig1.6.23.anno.fixed.QUAL300.vcf.gz

Can you design a filter to remove only with heterozygous calls on line Fex10? Hint: look for FORMAT/DP in the documentation above.
How does the number of heterozygous calls are affected on each line after filtering? Hint: You need to look at the GT tutorial above.

Phasing.

We will use beagle to do the phasing and the imputing: http://faculty.washington.edu/browning/beagle/beagle.html

The documentation is here: http://faculty.washington.edu/browning/beagle/beagle_5.1_08Nov19.pdf

java -jar beagle.21Sep19.ec3.jar gt=extractedContig1.6.23.anno.fixed.QUAL300.vcf.gz  out=out.gt

Now, you can load out.gt.vcf.gz to see the results

Do you see a change in the number of heterozygous samples?

Data filtering

Installing Dependencies in R

Reading the VCF file

Loading the data

VCF header

Looking at the qualities of the

Summary of qualities

Visalising the distribution

Depth for individual samples

Distribution across sample

Displaying all the samples

Questions for you

How many seqnames are present?

Plot the count of genotypes on each sequence.

Plot the count of genotypes on each sequence.

Can you calculate the allele frequency (\(\frac{RO}{DP}\)) for each sample? Plot 2 or 3 samples.

Plot the distribution of several samples together (Team1: Sample Fex1-Fex5, Team 2: Fex6-Fex10, etc)

Are there differences in in DP across different scaffolds? Try to figure out how to extract the data for a single reference.

To think about…

Filtering.

Phasing.

How many `seqnames` are present?