Data filtering

Ricardo H. Ramirez Gonzalz

14/12/2023

Installing Dependencies in R

For this tutorial you need a few R libraries. Bioconductor has packages for bioinfmatics. The following command will install the libraries used in the tutorial.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.18")
BiocManager::install(c("rtracklayer", "ggbio", "knitr", "ggplot2", "tidyr", "dplyr","VariantAnnotation"))

The documentation of the packages used is in:

Reading the VCF file

The VCF file is a file format that contains SNPs in a database. We will be working with a selection of SNPs from the following paper:

Elizabeth S. A. Sollars, et al (2016) Genome sequence and genetic diversity of European ash trees. Nature 541: 212–216 doi:10.1038/nature20786

Loading the data

For this analysis we will need several R libraries to do the analysis, filtering and plot.

library(rtracklayer)
library(knitr)
library(ggplot2)
library(ggbio)
library(tidyr)
library(dplyr)
library(VariantAnnotation)
vcf <- readVcf(file = "extractedContig1.6.23.vcf.gz")

VCF header

You can use the following command to see all the data that is stored in the VCF

head(vcf)
## class: CollapsedVCF 
## dim: 6 37 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 45 columns: NS, DP, DPB, AC, AN, AF, RO, AO, PRO, PAO, QR, ...
## info(header(vcf)):
##                        Number Type    Description                              
##    NS                  1      Integer Number of samples with data              
##    DP                  1      Integer Total read depth at the locus            
##    DPB                 1      Float   Total read depth per bp at the locus; ...
##    AC                  A      Integer Total number of alternate alleles in c...
##    AN                  1      Integer Total number of alleles in called geno...
##    AF                  A      Float   Estimated allele frequency in the rang...
##    RO                  1      Integer Reference allele observation count, wi...
##    AO                  A      Integer Alternate allele observations, with pa...
##    PRO                 1      Float   Reference allele observation count, wi...
##    PAO                 A      Float   Alternate allele observations, with pa...
##    QR                  1      Integer Reference allele quality sum in phred    
##    QA                  A      Integer Alternate allele quality sum in phred    
##    PQR                 1      Float   Reference allele quality sum in phred ...
##    PQA                 A      Float   Alternate allele quality sum in phred ...
##    SRF                 1      Integer Number of reference observations on th...
##    SRR                 1      Integer Number of reference observations on th...
##    SAF                 A      Integer Number of alternate observations on th...
##    SAR                 A      Integer Number of alternate observations on th...
##    SRP                 1      Float   Strand balance probability for the ref...
##    SAP                 A      Float   Strand balance probability for the alt...
##    AB                  A      Float   Allele balance at heterozygous sites: ...
##    ABP                 A      Float   Allele balance probability at heterozy...
##    RUN                 A      Integer Run length: the number of consecutive ...
##    RPP                 A      Float   Read Placement Probability: Phred-scal...
##    RPPR                1      Float   Read Placement Probability for referen...
##    RPL                 A      Float   Reads Placed Left: number of reads sup...
##    RPR                 A      Float   Reads Placed Right: number of reads su...
##    EPP                 A      Float   End Placement Probability: Phred-scale...
##    EPPR                1      Float   End Placement Probability for referenc...
##    DPRA                A      Float   Alternate allele depth ratio.  Ratio b...
##    ODDS                1      Float   The log odds ratio of the best genotyp...
##    GTI                 1      Integer Number of genotyping iterations requir...
##    TYPE                A      String  The type of allele, either snp, mnp, i...
##    CIGAR               A      String  The extended CIGAR representation of e...
##    NUMALT              1      Integer Number of unique non-reference alleles...
##    MEANALT             A      Float   Mean number of unique non-reference al...
##    LEN                 A      Integer allele length                            
##    MQM                 A      Float   Mean mapping quality of observed alter...
##    MQMR                1      Float   Mean mapping quality of observed refer...
##    PAIRED              A      Float   Proportion of observed alternate allel...
##    PAIREDR             1      Float   Proportion of observed reference allel...
##    technology.ILLUMINA A      Float   Fraction of observations supporting th...
##    ANN                 .      String  Functional annotations: 'Allele | Anno...
##    LOF                 .      String  Predicted loss of function effects for...
##    NMD                 .      String  Predicted nonsense mediated decay        
## geno(vcf):
##   List of length 8: GT, GQ, GL, DP, RO, QR, AO, QA
## geno(header(vcf)):
##       Number Type    Description                                               
##    GT 1      String  Genotype                                                  
##    GQ 1      Float   Genotype Quality, the Phred-scaled marginal (or uncondi...
##    GL G      Float   Genotype Likelihood, log10-scaled likelihoods of the da...
##    DP 1      Integer Read Depth                                                
##    RO 1      Integer Reference allele observation count                        
##    QR 1      Integer Sum of quality of the reference observations              
##    AO A      Integer Alternate allele observation count                        
##    QA A      Integer Sum of quality of the alternate observations

Looking at the qualities of the

To get the a GRanges object with the common data the function rowRanges can be used. This let us query questions like the distribution of the quality of the SNPs.

variations <- rowRanges(vcf)
head(variations, 3)
## GRanges object with 3 ranges and 5 metadata columns:
##                    seqnames    ranges strand | paramRangeID            REF
##                       <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##    Contig23:70_A/T Contig23        70      * |           NA              A
##    Contig23:72_T/A Contig23        72      * |           NA              T
##   Contig23:129_T/G Contig23       129      * |           NA              T
##                                   ALT      QUAL      FILTER
##                    <DNAStringSetList> <numeric> <character>
##    Contig23:70_A/T                  T   1826.92           .
##    Contig23:72_T/A                  A   1884.80           .
##   Contig23:129_T/G                  G   4029.34           .
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths

Summary of qualities

summary(variations$QUAL)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    50.17  3830.37  8543.81  7804.08 11484.65 56504.60

Visalising the distribution

To get an idea of the overall quality of the data, a histogram is useful. The SNPs with lower quality tend to be noisy. For this project, we validated some random SNPs using KASP across different qualitues, from 50 to 500 and we concluded that 300 was a good spot. This histogram helped us to decided the thresholds to test, as most of the tutorials online deem QUAL>50 as good quality.

ggplot(variations, aes(QUAL)) + geom_histogram(binwidth = 50) + theme_bw() +
   coord_cartesian(xlim=c(0,20000)) + geom_vline(xintercept=300, color="red")

Depth for individual samples

We can look at the depth of coverage of various samples. This can be useful to find if a particular sample has low quality.

DP <-geno(vcf)$DP 
summary_dp <- DP[is.na(DP)] <- 0
long_dp <- gather(data.frame(DP),value = "DP", key="Sample")
long_dp %>%
  group_by(Sample) %>%
  summarise(
    n = n(),
    mean_DP = mean(DP, na.rm = TRUE),
    std_DP  = sd(DP, na.rm = TRUE)
  ) -> summary_dp
Sample n mean_DP std_DP
Fex1 20620 7.306353 9.729400
Fex10 20620 10.898739 14.177238
Fex11 20620 9.175849 13.543446
Fex12 20620 12.205286 16.032310
Fex13 20620 14.063531 19.484195
Fex14 20620 9.818671 13.411106
Fex15 20620 10.319981 13.646079
Fex16 20620 13.047769 18.056399
Fex17 20620 12.447769 16.089833
Fex18 20620 14.220369 22.543101
Fex19 20620 10.075412 14.050866
Fex2 20620 12.068865 16.972320
Fex20 20620 15.964161 19.647557
Fex21 20620 12.637876 15.395132
Fex22 20620 11.289137 15.800327
Fex23 20620 11.863822 17.689616
Fex24 20620 10.070466 18.857208
Fex25 20620 8.327158 9.791787
Fex26 20620 13.012221 17.368032
Fex27 20620 10.450534 13.977195
Fex28 20620 12.704850 18.177778
Fex29 20620 10.298885 14.201916
Fex3 20620 9.199466 12.967614
Fex30 20620 10.940349 14.018360
Fex31 20620 7.026285 9.781015
Fex32 20620 8.080068 10.486800
Fex33 20620 9.358584 11.326032
Fex34 20620 10.241804 10.995112
Fex35 20620 9.575121 10.794247
Fex36 20620 10.823909 11.976831
Fex37 20620 8.476285 10.591030
Fex4 20620 9.694956 15.275143
Fex5 20620 10.755383 13.830538
Fex6 20620 9.284627 11.856945
Fex7 20620 9.811930 12.764079
Fex8 20620 8.646848 11.660582
Fex9 20620 7.525897 9.669511

Distribution across sample

You can focus on the distribution of a few samples.

long_dp %>% filter( Sample %in% c("Fex1", "Fex2")) %>% 
ggplot() +  
geom_density(aes(x=DP, colour=Sample, fill=Sample), alpha=0.5) + 
coord_cartesian(xlim = c(0,100))

Displaying all the samples

For all the samples, you can show just a boxplot,at the cost of having less resultion of the shape of the distribution

long_dp %>% 
ggplot() +  
geom_boxplot(aes(y=DP, x=Sample), alpha=0.5) + 
coord_cartesian(ylim = c(0,25)) + 
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Questions for you

How many seqnames are present?

seqnames(vcf)
## factor-Rle of length 20620 with 3 runs
##   Lengths:     5473     9517     5630
##   Values : Contig23 Contig1  Contig6 
## Levels(3): Contig23 Contig1 Contig6

Plot the count of genotypes on each sequence.

We need to produce a dataframe with the contig names and the GT column

GT <-geno(vcf)$GT
GT <- data.frame(GT)
GT$seqnames <- data.frame(seqnames(variations))[,1]
GT %>% gather( -seqnames ,key = "Sample", value = "GT") -> long_gt

long_gt %>%
  group_by(Sample, GT, seqnames) %>%
  summarise(
    n = n(),
    GT = unique(GT, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'Sample', 'GT'. You can override using the
## `.groups` argument.
## # A tibble: 492 × 4
## # Groups:   Sample, GT [185]
##    Sample GT    seqnames     n
##    <chr>  <chr> <fct>    <int>
##  1 Fex1   .     Contig23   201
##  2 Fex1   .     Contig1   1322
##  3 Fex1   .     Contig6    290
##  4 Fex1   ./.   Contig1      1
##  5 Fex1   0/0   Contig23   200
##  6 Fex1   0/0   Contig1    512
##  7 Fex1   0/0   Contig6    217
##  8 Fex1   0/1   Contig23   131
##  9 Fex1   0/1   Contig1    165
## 10 Fex1   0/1   Contig6    113
## # ℹ 482 more rows

Plot the count of genotypes on each sequence.

Now we can plot from the GT data frame and plot the counts

 long_gt %>% 
filter( Sample %in% c("Fex1", "Fex2", "Fex3", "Fex4", "Fex5")) %>% 
ggplot( aes(GT, fill=GT)) + geom_bar( ) + facet_grid(seqnames~Sample)

Can you calculate the allele frequency (\(\frac{RO}{DP}\)) for each sample? Plot 2 or 3 samples.

DP <-geno(vcf)$DP 
RO <-geno(vcf)$RO 
ro_freq <- data.frame(RO/DP)

ggplot(ro_freq, aes(Fex1)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1813 rows containing non-finite values (`stat_bin()`).

Plot the distribution of several samples together (Team1: Sample Fex1-Fex5, Team 2: Fex6-Fex10, etc)

long_ro_freq <- gather(ro_freq ,value = "RO_freq", key="Sample")
long_ro_freq %>% filter( Sample %in% c("Fex1", "Fex3","Fex4", "Fex2","Fex5","Fex6")) %>% 
ggplot() +  
geom_histogram(aes(x=RO_freq), alpha=0.5) + facet_wrap(Sample ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8212 rows containing non-finite values (`stat_bin()`).

Are there differences in in DP across different scaffolds? Try to figure out how to extract the data for a single reference.

Hint: directly from the VCF object, or exporting the DP matrix and filter by the rowname.

DP <-geno(vcf)$DP
DP <- data.frame(DP)
DP$seqnames <- data.frame(seqnames(variations))[,1]
DP %>% gather(  -seqnames ,value = "DP", key="Sample") -> long_dp

The coverage is very similar across sequences

long_dp %>%
  group_by(seqnames) %>%
  summarise(
    n = n(),
    DP = mean(DP, na.rm = TRUE)
  )
## # A tibble: 3 × 3
##   seqnames      n    DP
##   <fct>     <int> <dbl>
## 1 Contig23 202501  11.7
## 2 Contig1  352129  11.0
## 3 Contig6  208310  11.3

Most of the variation comes from the sample

long_dp %>%
  group_by(Sample) %>%
  summarise(
    n = n(),
    DP = mean(DP, na.rm = TRUE)
  )
## # A tibble: 37 × 3
##    Sample     n    DP
##    <chr>  <int> <dbl>
##  1 Fex1   20620  8.01
##  2 Fex10  20620 11.8 
##  3 Fex11  20620  9.75
##  4 Fex12  20620 12.8 
##  5 Fex13  20620 14.7 
##  6 Fex14  20620 10.5 
##  7 Fex15  20620 11.1 
##  8 Fex16  20620 13.7 
##  9 Fex17  20620 13.2 
## 10 Fex18  20620 14.8 
## # ℹ 27 more rows

To think about…

The samples look homozygous, with small regions with heterozygous calls.

Filtering.

Based on the discussion before, using bcftools filger out for files with quality over 300. Read the documentation in: https://samtools.github.io/bcftools/bcftools.html#expressions to find more expression.

First simplify the VCF file to only include the GT,AO and RO files.

bcftools annotate -x "INFO,^FORMAT/GT,FORMAT/AO,Format/RO" extractedContig1.6.23.vcf.gz | bgzip -c > extractedContig1.6.23.anno.vcf.gz

beagle requires that even missing values correspond to the ploidy, so we need to change . in the genotype to ./.

gunzip -c extractedContig1.6.23.anno.vcf.gz | sed -e 's/    \.:/    .\/.:/g' | bgzip -c >  extractedContig1.6.23.anno.fixed.vcf.gz

Now we can run the filtering of the formatted file

bcftools filter -e 'QUAL<300' extractedContig1.6.23.anno.fixed.vcf.gz  -O z -o extractedContig1.6.23.anno.fixed.QUAL300.vcf.gz

Phasing.

We will use beagle to do the phasing and the imputing: http://faculty.washington.edu/browning/beagle/beagle.html

The documentation is here: http://faculty.washington.edu/browning/beagle/beagle_5.1_08Nov19.pdf

java -jar beagle.21Sep19.ec3.jar gt=extractedContig1.6.23.anno.fixed.QUAL300.vcf.gz  out=out.gt

Now, you can load out.gt.vcf.gz to see the results