Ricardo H. Ramirez Gonzalz
14/12/2023
For this tutorial you need a few R libraries. Bioconductor has packages for bioinfmatics. The following command will install the libraries used in the tutorial.
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(version = "3.18")
BiocManager::install(c("rtracklayer", "ggbio", "knitr", "ggplot2", "tidyr", "dplyr","VariantAnnotation"))
The documentation of the packages used is in:
The VCF file is a file format that contains SNPs in a database. We will be working with a selection of SNPs from the following paper:
Elizabeth S. A. Sollars, et al (2016) Genome sequence and genetic diversity of European ash trees. Nature 541: 212–216 doi:10.1038/nature20786
For this analysis we will need several R libraries to do the analysis, filtering and plot.
You can use the following command to see all the data that is stored in the VCF
## class: CollapsedVCF
## dim: 6 37
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 45 columns: NS, DP, DPB, AC, AN, AF, RO, AO, PRO, PAO, QR, ...
## info(header(vcf)):
## Number Type Description
## NS 1 Integer Number of samples with data
## DP 1 Integer Total read depth at the locus
## DPB 1 Float Total read depth per bp at the locus; ...
## AC A Integer Total number of alternate alleles in c...
## AN 1 Integer Total number of alleles in called geno...
## AF A Float Estimated allele frequency in the rang...
## RO 1 Integer Reference allele observation count, wi...
## AO A Integer Alternate allele observations, with pa...
## PRO 1 Float Reference allele observation count, wi...
## PAO A Float Alternate allele observations, with pa...
## QR 1 Integer Reference allele quality sum in phred
## QA A Integer Alternate allele quality sum in phred
## PQR 1 Float Reference allele quality sum in phred ...
## PQA A Float Alternate allele quality sum in phred ...
## SRF 1 Integer Number of reference observations on th...
## SRR 1 Integer Number of reference observations on th...
## SAF A Integer Number of alternate observations on th...
## SAR A Integer Number of alternate observations on th...
## SRP 1 Float Strand balance probability for the ref...
## SAP A Float Strand balance probability for the alt...
## AB A Float Allele balance at heterozygous sites: ...
## ABP A Float Allele balance probability at heterozy...
## RUN A Integer Run length: the number of consecutive ...
## RPP A Float Read Placement Probability: Phred-scal...
## RPPR 1 Float Read Placement Probability for referen...
## RPL A Float Reads Placed Left: number of reads sup...
## RPR A Float Reads Placed Right: number of reads su...
## EPP A Float End Placement Probability: Phred-scale...
## EPPR 1 Float End Placement Probability for referenc...
## DPRA A Float Alternate allele depth ratio. Ratio b...
## ODDS 1 Float The log odds ratio of the best genotyp...
## GTI 1 Integer Number of genotyping iterations requir...
## TYPE A String The type of allele, either snp, mnp, i...
## CIGAR A String The extended CIGAR representation of e...
## NUMALT 1 Integer Number of unique non-reference alleles...
## MEANALT A Float Mean number of unique non-reference al...
## LEN A Integer allele length
## MQM A Float Mean mapping quality of observed alter...
## MQMR 1 Float Mean mapping quality of observed refer...
## PAIRED A Float Proportion of observed alternate allel...
## PAIREDR 1 Float Proportion of observed reference allel...
## technology.ILLUMINA A Float Fraction of observations supporting th...
## ANN . String Functional annotations: 'Allele | Anno...
## LOF . String Predicted loss of function effects for...
## NMD . String Predicted nonsense mediated decay
## geno(vcf):
## List of length 8: GT, GQ, GL, DP, RO, QR, AO, QA
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## GQ 1 Float Genotype Quality, the Phred-scaled marginal (or uncondi...
## GL G Float Genotype Likelihood, log10-scaled likelihoods of the da...
## DP 1 Integer Read Depth
## RO 1 Integer Reference allele observation count
## QR 1 Integer Sum of quality of the reference observations
## AO A Integer Alternate allele observation count
## QA A Integer Sum of quality of the alternate observations
To get the a GRanges
object with the common data the
function rowRanges
can be used. This let us query questions
like the distribution of the quality of the SNPs.
## GRanges object with 3 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## Contig23:70_A/T Contig23 70 * | NA A
## Contig23:72_T/A Contig23 72 * | NA T
## Contig23:129_T/G Contig23 129 * | NA T
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## Contig23:70_A/T T 1826.92 .
## Contig23:72_T/A A 1884.80 .
## Contig23:129_T/G G 4029.34 .
## -------
## seqinfo: 3 sequences from an unspecified genome; no seqlengths
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.17 3830.37 8543.81 7804.08 11484.65 56504.60
To get an idea of the overall quality of the data, a histogram is useful. The SNPs with lower quality tend to be noisy. For this project, we validated some random SNPs using KASP across different qualitues, from 50 to 500 and we concluded that 300 was a good spot. This histogram helped us to decided the thresholds to test, as most of the tutorials online deem QUAL>50 as good quality.
We can look at the depth of coverage of various samples. This can be useful to find if a particular sample has low quality.
DP <-geno(vcf)$DP
summary_dp <- DP[is.na(DP)] <- 0
long_dp <- gather(data.frame(DP),value = "DP", key="Sample")
long_dp %>%
group_by(Sample) %>%
summarise(
n = n(),
mean_DP = mean(DP, na.rm = TRUE),
std_DP = sd(DP, na.rm = TRUE)
) -> summary_dp
Sample | n | mean_DP | std_DP |
---|---|---|---|
Fex1 | 20620 | 7.306353 | 9.729400 |
Fex10 | 20620 | 10.898739 | 14.177238 |
Fex11 | 20620 | 9.175849 | 13.543446 |
Fex12 | 20620 | 12.205286 | 16.032310 |
Fex13 | 20620 | 14.063531 | 19.484195 |
Fex14 | 20620 | 9.818671 | 13.411106 |
Fex15 | 20620 | 10.319981 | 13.646079 |
Fex16 | 20620 | 13.047769 | 18.056399 |
Fex17 | 20620 | 12.447769 | 16.089833 |
Fex18 | 20620 | 14.220369 | 22.543101 |
Fex19 | 20620 | 10.075412 | 14.050866 |
Fex2 | 20620 | 12.068865 | 16.972320 |
Fex20 | 20620 | 15.964161 | 19.647557 |
Fex21 | 20620 | 12.637876 | 15.395132 |
Fex22 | 20620 | 11.289137 | 15.800327 |
Fex23 | 20620 | 11.863822 | 17.689616 |
Fex24 | 20620 | 10.070466 | 18.857208 |
Fex25 | 20620 | 8.327158 | 9.791787 |
Fex26 | 20620 | 13.012221 | 17.368032 |
Fex27 | 20620 | 10.450534 | 13.977195 |
Fex28 | 20620 | 12.704850 | 18.177778 |
Fex29 | 20620 | 10.298885 | 14.201916 |
Fex3 | 20620 | 9.199466 | 12.967614 |
Fex30 | 20620 | 10.940349 | 14.018360 |
Fex31 | 20620 | 7.026285 | 9.781015 |
Fex32 | 20620 | 8.080068 | 10.486800 |
Fex33 | 20620 | 9.358584 | 11.326032 |
Fex34 | 20620 | 10.241804 | 10.995112 |
Fex35 | 20620 | 9.575121 | 10.794247 |
Fex36 | 20620 | 10.823909 | 11.976831 |
Fex37 | 20620 | 8.476285 | 10.591030 |
Fex4 | 20620 | 9.694956 | 15.275143 |
Fex5 | 20620 | 10.755383 | 13.830538 |
Fex6 | 20620 | 9.284627 | 11.856945 |
Fex7 | 20620 | 9.811930 | 12.764079 |
Fex8 | 20620 | 8.646848 | 11.660582 |
Fex9 | 20620 | 7.525897 | 9.669511 |
You can focus on the distribution of a few samples.
For all the samples, you can show just a boxplot,at the cost of having less resultion of the shape of the distribution
seqnames
are present?VCF
object, or exporting the DP
matrix and filter by the rowname
.DP
for all the samples. Try
geom_boxplot
and facet_grid
. Which one you
find clearer? Is there any sample that has a significant lower
coverage?seqnames
are present?## factor-Rle of length 20620 with 3 runs
## Lengths: 5473 9517 5630
## Values : Contig23 Contig1 Contig6
## Levels(3): Contig23 Contig1 Contig6
We need to produce a dataframe with the contig names and the
GT
column
GT <-geno(vcf)$GT
GT <- data.frame(GT)
GT$seqnames <- data.frame(seqnames(variations))[,1]
GT %>% gather( -seqnames ,key = "Sample", value = "GT") -> long_gt
long_gt %>%
group_by(Sample, GT, seqnames) %>%
summarise(
n = n(),
GT = unique(GT, na.rm = TRUE)
)
## `summarise()` has grouped output by 'Sample', 'GT'. You can override using the
## `.groups` argument.
## # A tibble: 492 × 4
## # Groups: Sample, GT [185]
## Sample GT seqnames n
## <chr> <chr> <fct> <int>
## 1 Fex1 . Contig23 201
## 2 Fex1 . Contig1 1322
## 3 Fex1 . Contig6 290
## 4 Fex1 ./. Contig1 1
## 5 Fex1 0/0 Contig23 200
## 6 Fex1 0/0 Contig1 512
## 7 Fex1 0/0 Contig6 217
## 8 Fex1 0/1 Contig23 131
## 9 Fex1 0/1 Contig1 165
## 10 Fex1 0/1 Contig6 113
## # ℹ 482 more rows
Now we can plot from the GT
data frame and plot the
counts
DP <-geno(vcf)$DP
RO <-geno(vcf)$RO
ro_freq <- data.frame(RO/DP)
ggplot(ro_freq, aes(Fex1)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1813 rows containing non-finite values (`stat_bin()`).
long_ro_freq <- gather(ro_freq ,value = "RO_freq", key="Sample")
long_ro_freq %>% filter( Sample %in% c("Fex1", "Fex3","Fex4", "Fex2","Fex5","Fex6")) %>%
ggplot() +
geom_histogram(aes(x=RO_freq), alpha=0.5) + facet_wrap(Sample ~ .)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 8212 rows containing non-finite values (`stat_bin()`).
Hint: directly from the VCF
object, or exporting the DP
matrix and filter by the rowname
.
DP <-geno(vcf)$DP
DP <- data.frame(DP)
DP$seqnames <- data.frame(seqnames(variations))[,1]
DP %>% gather( -seqnames ,value = "DP", key="Sample") -> long_dp
The coverage is very similar across sequences
## # A tibble: 3 × 3
## seqnames n DP
## <fct> <int> <dbl>
## 1 Contig23 202501 11.7
## 2 Contig1 352129 11.0
## 3 Contig6 208310 11.3
Most of the variation comes from the sample
## # A tibble: 37 × 3
## Sample n DP
## <chr> <int> <dbl>
## 1 Fex1 20620 8.01
## 2 Fex10 20620 11.8
## 3 Fex11 20620 9.75
## 4 Fex12 20620 12.8
## 5 Fex13 20620 14.7
## 6 Fex14 20620 10.5
## 7 Fex15 20620 11.1
## 8 Fex16 20620 13.7
## 9 Fex17 20620 13.2
## 10 Fex18 20620 14.8
## # ℹ 27 more rows
The samples look homozygous, with small regions with heterozygous calls.
DP
for all the samples. Try
geom_boxplot
and facet_grid
. Which one you
find clearer? Is there any sample that has a significant lower coverage?
This is a personal choice. The plots above reflect it.Based on the discussion before, using bcftools
filger
out for files with quality over 300. Read the documentation in: https://samtools.github.io/bcftools/bcftools.html#expressions
to find more expression.
First simplify the VCF file to only include the GT,AO and RO files.
bcftools annotate -x "INFO,^FORMAT/GT,FORMAT/AO,Format/RO" extractedContig1.6.23.vcf.gz | bgzip -c > extractedContig1.6.23.anno.vcf.gz
beagle requires that even missing values correspond to the ploidy, so
we need to change .
in the genotype to ./.
gunzip -c extractedContig1.6.23.anno.vcf.gz | sed -e 's/ \.:/ .\/.:/g' | bgzip -c > extractedContig1.6.23.anno.fixed.vcf.gz
Now we can run the filtering of the formatted file
bcftools filter -e 'QUAL<300' extractedContig1.6.23.anno.fixed.vcf.gz -O z -o extractedContig1.6.23.anno.fixed.QUAL300.vcf.gz
Can you design a filter to remove only with heterozygous calls on
line Fex10
? Hint: look for FORMAT/DP
in the
documentation above.
How does the number of heterozygous calls are affected on each line after filtering? Hint: You need to look at the GT tutorial above.
We will use beagle to do the phasing and the imputing: http://faculty.washington.edu/browning/beagle/beagle.html
The documentation is here: http://faculty.washington.edu/browning/beagle/beagle_5.1_08Nov19.pdf
Now, you can load out.gt.vcf.gz
to see the results