In Genomics and Computational Biology, annotating a sequence means decorating it with quantums of biological information, usually about its molecular function, the biological process(es) in which it takes part and the cellular compartment where that happens (see the Gene Ontology). Other relevant annotations include the evolutionary conservation of a sequence or the biological sample where it was found. The next figure summarizes example genomic features whose annotation has been recognized of importance in wheat breeding by (Borrill, Harrington, and Uauy 2019):
The annotation process requires at least two data sources:
A collection of previously annotated sequences.
Indexed literature in repositories such as PubMed or EuroPMC, where publications are uniquely identified with accession codes such as the DOI.
The best source of annotated protein sequences from plants is the database UniProt, which contains both expert-review sequences (SwissProt) and automatically annotated entries (learn more here in Spanish). The next table summarizes total reviewed plant proteins as of Feb 2020:
annot.stats <- read.csv(file="test_data/uniprot_stats.tsv", sep="\t", comment.char=";", header=F)
annot.stats = annot.stats[,1:2]
names(annot.stats) <- c("Species", "reviewed proteins")
## Print the table
kable(annot.stats,format.args = list(big.mark=","))
Species | reviewed proteins |
---|---|
Viridiplantae | 40,218 |
Eudicotyledons | 27,063 |
Liliopsida | 8,305 |
others | 4,850 |
NA | |
Arabidopsis thaliana | 15,922 |
Nicotiana tabacum | 857 |
Glycine max | 650 |
Solanum lycopersicum | 480 |
Populus trichocarpa | 210 |
Vitis vinifera | 229 |
Brassica napus | 158 |
Helianthus annuus | 145 |
Phaseolus vulgaris | 159 |
Medicago truncatula | 116 |
NA | |
Oryza sativa subsp japonica | 4,070 |
Zea mays | 831 |
Triticum aestivum | 379 |
Sorghum bicolor | 353 |
Hordeum vulgare | 351 |
Brachypodium distachyon | 37 |
NA | |
Marchantia polymorpha | 158 |
Physcomitrella patens | 149 |
Chlamydomonas reinhardtii | 356 |
Genomic sequences are most often nucleotide and peptide sequences. The first can be large genome fragments such as contigs or smaller genes, transcripts, coding sequences or non-coding RNAs; peptides are usually proteins translated from open reading frames encoded in coding sequences.
In this session we will focus on coding sequences, or in other words, genes that encode proteins.
A natural way of comparing protein or nucleic acid molecules is to align their (primary) sequences. This is due to the general expectation that sequence drives folding and to the fact that sequences are easy to work with in a computer or even your notebook. At least easier than structure.
When two sequences are aligned residues from one sequence are matched one-by-one to residues in the other. Matches are obvious when the sequences are nearly identical, but less so when mutations accumulate. A simple way to compute how similar two sequences are is to compute their edit distance.
Another way to compute how similar two sequences are is to compute their % sequence identity.
Sequence identity is a simple way of measuring conservation. However, it lacks resolution and handles all residue substitutions the same way. This is not ideal as we know that purine (A,G) and pyrimidine (C,T) nucleotides, or aromatic amino acid resides if we talk about proteins, are often not interchanged with equal probability in real genes or proteins:
These preferences are captured by computing log-odds ratios of frequencies of observed mutations (a,b) with respect to estimates assuming no preference:
\[ s(a,b) = \lambda \ log(\frac{f_{ab}}{f_{a} f_{b}}) \approx log\frac{f_{homologues}}{f_{bychance}} \] These log-odds are additive.
The most frequent substitution matrices used to score protein alignments are the BLOSUM matrices. These matrices are described by a number X, as in BLOSUM50, derived from the analysis of alignments of protein blocks/domains with identities < X percent. Below you can see BLOSUM50:
These log-odds have been scaled so that they can be accurately represented by integers.
BLOSUM matrices are scaled to 1/2-bit units (Pearson 2013); a substitution score such as s(E,E) can be broken down to:
\[ s(E,E) = 6 = 2.0 \ log_{2}(\frac{f_{EE}}{f_{E} f_{E}})\] \[ \frac{f_{EE}}{f_{E} f_{E}} = 2^3 \]
S(E,E) is thus 8 times more likely to occur because of homology than by chance.
By using matrices such as BLOSUM it is possible to compute the similarity between two aligned sequences, which is added up along the alignment:
In addition to residue substitutions, insertions and deletions are usually considered while computing similarity. This can be done in many ways. The simplest is to assume a linear cost for insertions, proportional to their length. However, it is more accurate to compute affine gap costs, which charge a fix cost to openning a gap (a) and then a linear cost proportional to the gap length (bk). In this way, a gap of k residues receives a total score of -(a+bk)
When more than two sequences are to be aligned we talk about multiple alignments, which can be computed in many ways. The most instuitive way, comparing them all, requires quadratically more resources as more sequences are added.
In this section we will visit some of the most frequent algorithms used to align sequences. The goal is to learn what you can and cannot do with each of them so that you can apply them correctly.
These are the simplest alignments as they involve only two sequences (of length \(m\) and \(n\)). There are several flavours, but the most important are global and local alignments, depicted in this figure taken from (Mäkinen et al. 2015):
The Needleman-Wunsch (NW) algorithm is the original deterministic approach for the alignment of pairs of protein sequences end-to-end (Needleman and Wunsch 1970). It was subsequently optimized with more ellaborated algorithms (Gotoh 1982). These algorithms belong to the family of dynamic programming (DP) algorithms. The NW and related algorithms break the initial problem in subalignments which are recursively solved. In order to reach the final solution, an alignment matrix DP must be filled iteratively:
\[ DP(0,0) = 0 \\ DP(i,0) = -id , \text{ for } 1\leq i \leq m \\ DP(0,j) = -jd , \text{ for } 1\leq j \leq n \\ \]
\[DP(i,j) = max \begin{cases} DP(i-1,j-1) + s(x_{i},y_{j})\\ DP(i-1,j)-d\\ DP(i,j-1)-d \end{cases} \]
During the computation of the DP matrix, another matrix is used to record the previous cell which was visited before the current cell under consideration. This second matrix is used to perform the traceback step, in which the alignment is reconstructed backwards from the bottom right cell, as illustrated in this figure from (Durbin et al. 1998):
Local pairwise alignment is not end-to-end. Instead it assumes that the sequences to be compared might have different domain structure and perhaps share only part of the sequence by homology. The Smith-Waterman (SW) algorithm was designed for this purpose (Smith and Waterman 1981). It is a modification of NW with two differences:
\[DP(i,j) = max \begin{cases} 0\\ DP(i-1,j-1) + s(x_{i},y_{j})\\ DP(i-1,j)-d\\ DP(i,j-1)-d \end{cases} \]
In addition, the traceback step now starts in the cell of DP with the highest score, instead of the bottom right corner (Durbin et al. 1998):
The algorithms presented so far have linear gap costs. Affine gap costs, with different cost for openning (\(d\)) and extending (\(e\)) an insertion, are believed to be more accurate for protein sequences. They also introduce extra complexity in the algorithm, as they require creating two extra matrices to store indels in sequences \(x\) and \(y\) (Durbin et al. 1998):
\[DP(i,j) = max \begin{cases} DP(i-1,j-1) + s(x_{i},y_{j})\\ I_{x}(i-1,j-1) + s(x_{i},y_{j})\\ I_{y}(i-1,j-1) + s(x_{i},y_{j}) \end{cases}\]
\[I_{x}(i,j) = max \begin{cases} DP(i-1,j) -d\\ I_{x}(i-1,j) -e \end{cases}\]
\[I_{y}(i,j) = max \begin{cases} DP(i,j-1) -d\\ I_{y}(i,j-1) -e \end{cases}\]
The DP algorithms just described (NW, SW) are expensive if one wants to compare a query sequence to a large collection of sequences (database). For each pair of sequences the computation time and memory used is proportional to the product of the sequence lengths. Thus, for a collection of millions of sequences this is unfeasible.
For this reason other approaches have been developed, aiming at running times proportional to the sum of the lengths of the query and database sequences. These approaches are heuristic, meaning that they work fine in most cases, but cannot guarantee to produce always optimal alignments. The most standard of these methods is probably BLAST (Altschul et al. 1997).
BLAST saves computing time by pre-processing the sequences producing indexed tables of words of size 3 and 11 for BLASTP and BLAST, respectively. For each word, all similar words with high scores (by default BLOSUM62 in BLASTP) are short-listed. BLAST relies on two fundamental assumptions (Dwyer 2003):
Most high-scoring local alignments contain several high-scoring pairs of words. These can be used as seeds from which high-scoring alignments can be extended.
Homologous proteins contain segments that can be aligned without indels. This simplifies the process of extending seed alignments.
The natural data structure for indexing words of length \(k\), or \(k\)-mers, is the \(k\)-mer index. The figure below depicts the index for string ‘AGAGCGAGAGCGCGC’ and \(k=2\) (Mäkinen et al. 2015):
High throughput sequencing has found many applications in all fields of Science in recent years. The primary product of this array of technologies are sequence reads, usually in FASTQ format, of either Single-End (SE) or Paired-End (PE) sequences. Long-reads, several Kb long, are increasingly popular as well. PE reads usually have the following arrangement (Minikel 2012):
Once produced, reads are usually aligned, or mapped, to the closest reference genome (Mäkinen et al. 2015):
Since a single sequencing experiment produces millions of reads, and the reference genomes can be very large (see for instance some plant genomes at RSAT::Plants), algorithms for read mapping need to be very efficient in terms of speed and disk space. As sequencers are not error-free, read aligners must also be flexible and tolerate mismatches (Marco-Sola et al. 2012). In addition, these approaches must compute mapping quality, as repeated genomic segments imply that some reads could be placed in several locations with equal probability (Li, Ruan, and Durbin 2008).
They achieve those goals by using a data structure called suffix array (SA) which is more flexible than the \(k\)-mer index, as it supports queries of variable length (Mäkinen et al. 2015) (see here how to build one, in Spanish):
Burrows-Wheeler indexes are space-efficient SA that can be queried and aligned by dynamic programming (Mäkinen et al. 2015). See here an example of how the Burrows-Wheeler transform works (in Spanish).
The next diagram summarizes these and other algorithmic choices in the context of approximate protein sequence matching (Koskinen and Holm 2012):
In theory, the DP algorithms described earlier can be generalized to sets of more than two sequences. However, the recursive functions grow in complexity with every new sequence added, as well as the computational resources required. For this reason, multiple alignments are usually computed using a progressive approach, in which first all sequences are first aligned by pairs, and then these are progressively merged Mäkinen et al. (2015). This is illustrated in the following diagram (Caporaso 2018):
One of the many software packages available for multiple alignment, which follows this approach and produces scalable alignments of high quality, is Clustal Omega (Sievers et al. 2011).
Multiple sequence alignments (MSA) have been used to convey evolutionary information, particularly about protein families, and have been called sequence profiles in that context (Gribskov, McLachlan, and Eisenberg 1987). Profiles are essentially position-specific substitution matrices. The most successful way of encoding protein profiles are Hidden Markov Models (HMM) (Eddy 2004), as those built with HMMER for the Pfam database of protein families. The next figure illustrates how a MSA can be used to build a HMM (Lindly 2018):
Another standard tool used to build and scan sequence profiles, both protein and DNA, is MEME.
Profiles can be aligned to single sequences and also to other profiles (Soding 2005).
Coding sequences are usually abbreviated as CDS. This is a definition from UniProt:
A CoDing Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a protein. It should not be mixed up with an Open Reading Frame (ORF), which is a series of DNA codons that does not contain any STOP codons.
This is an example ORF sequence:
5' atg ttc cgc tcc gtg atc cca gtt tcc tgt tcc 3'
Most genes in plants contain CDS made up of several concatenated exons. This means they contain introns, that is, sequence fragments that must be spliced in order to produce a processed mRNA/transcript.
Nucleotide sequences are the physical media of coding sequences, which are ultimately translated into proteins. If mutations occur, the corresponding CDS sequencen might be altered. Mutations can be Single Nucleotide Polymorphisms (SNPs), insertions or deletions.
CDS and ORF sequences are nucleotide sequences, written in the A,C,G,T alphabet. By using a codon table, trinucleotides in an ORF can be translated into peptide sequences, which take a third of the original length and a use the 20 amino acid alphabet.
Peptide sequences describe the primary structure of proteins. However, proteins are molecules that must fold in order to carry out their functions, and thus their linear sequences is often not enough to represent them.
To neutralize polar charges of the peptidic backbone, proteins adopt conformations that can be generally classified as \(\alpha\)-helices (H), \(\beta\)-sheets (E), loops and disordered regions (C). These are secondary structure states, which can be encoded as strings of the alphabet H,E and C.
The following example shows that kind of 3-state encoding for a sequence:
MFRSVIPVLL FLIPLLLSAQ AANSLRACGP...
EEEECCEEEE HHHHHHHHHH CCCCCCCCCC...
Tertiary structure is the folding of secondary structure elements as globular or transmembrane units.
Minimal evolutionary and folding units are called domains (Porter and Rose 2012). As protein domains are both folding and functional units, guessing which domains are contained in a sequence provides clues on possible functions. Of course the concept of function is a bit subjective, and that’s why resources such as the Gene Ontology are valuable.
Two key resources for annotating proteins domains in input sequences are:
Another resource comparable to Pfam is PANTHER, which is related to the Gene Ontology.
The next figure depicts the tertiary structure of a protein with two domains, adapted from (Wuerges et al. 2006). Note that strands are shown as arrows and helices as cilinders:
You can read more about these topics in Algoritmos en Bioinformática Estructural (in Spanish), which includes a chapter of AlphaFold and related approaches.
The gene ontology is a standard language that describes the functions of gene products (protein, RNA) using a controlled and unified vocabularies across all species.
These vocabulary terms (GO terms) annotate the gene products at three different levels: - Biological Process (BP): to which process or pathway is the gen contributing - Molecular Function (MF): the molecular activities of individual gene products - Cellular Component (CC): where the gene products are active
They are organized as a hierarchy and connected to each other through parent-child relationships. The next example is from QuickGO (Binns et al. 2009):
See here and here examples on how to navigate the GO ontology with scripts (in Spanish).
Orthologous genes are derived by speciation, in contrast to paralogues, which evolve through duplication. Comparative genomic studies on a variety of clades suggest that orthologues among species typically are the most similar genes in terms of sequence, structure, domain architecture and function (Gabaldon and Koonin 2013). There are however many exceptions and thus orthology should be used as a proxy for similarity. In benchmarks with curated GO terms, it seems that orthology improves the annotation of sequences particularly on the Molecular Function side of the Gene Ontology (Huerta-Cepas et al. 2017).
Ensembl Plants and PLAZA are resources providing plant-specific comparative genomics tools for plants, such as orthology calls and sintenic/collinear genes.
The goal of this exercise is to learn how to produce different kinds of sequence alignments with standard software. You will need to download the following required software:
software | source | documentation |
---|---|---|
BLAST | ftp.ncbi.nlm.nih.gov/blast/executables/blast+ | URL |
Clustal Omega | www.clustal.org/omega | URL |
HMMER | hmmer.org | URL |
The first task is to format our test sequence set, which was obtained from UniProt:
cd test_data/
gunzip uniprot_Atha.fasta.gz
/path/to/ncbi-blast/bin/makeblastdb -dbtype prot -in uniprot_Atha.fasta
Exe1) How many sequences have been formatted and how does this affect the E-value of BLAST searches?
Here you will work with protein ARF6. Its protein and transcript sequence is in files test.faa and test.fna, respectively.
>AT1G30330.2 [Arabidopsis thaliana] ARF6 (AUXIN RESPONSE FACTOR 6)
MRLSSAGFNPQPHEVTGEKRVLNSELWHACAGPLVSLPPVGSRVVYFPQGHSEQVAASTNKEVDAHIPNYPSLHPQL...
The transcript is:
>AT1G30330.2 [Arabidopsis thaliana] ARF6 (AUXIN RESPONSE FACTOR 6) transcript
ATGAGATTATCTTCAGCTGGGTTTAATCCTCAACCTCATGAAGTTACAGGAGAGAAAAGAGTTCTTAATTCTGAGCTCTG
GCATGCTTGTGCTGGTCCTCTTGTCTCACTACCTCCTGTTGGAAGCAGAGTTGTGTATTTTCCTCAAGGTCACAGTGAAC
AGGTTGCTGCTTCGACCAACAAAGAAGTAGATGCTCATATACCAAATTATCCGAGCTTGCATCCGCAGCTTATCTGT...
We shall now look for similar sequences in our collection:
/path/to/ncbi-blast/bin/blastp -db uniprot_Atha.fasta -query test.faa -outfmt 6
/path/to/ncbi-blast/bin/blastx -db uniprot_Atha.fasta -query test.fna -outfmt 6
Exe2) Can you redirect the output to separate files called test.faa.blast and test.fna.blast?
Exe3) What is the default alignment format, can you show an example?
Exe4) Are there differences in the results retrieved in both searches?
Here you will learn to make a sequence profile out of similar sequences matched with three iterations of BLASTP, using PSI-BLAST:
/path/to/ncbi-blast/bin/psiblast -db uniprot_Atha.fasta -query test.faa -num_iterations 3 -out_ascii_pssm profile.out
Exe 5) Can you explain the contents of the output file profile.out?
Search for homologous sequences and define protein domains of AT1G30330.2 with HHPred.
Note that you can also search for proteins with AlphaFold structural predictions with the EBI BLAST against AlphaFoldDB.
Exe 7) Produce a table with i) domains defined by the boundaries of matched entries from the Protein Data Bank and Pfam and ii) similar sequences in AlphaFoldDB.
Search for functional annotations for protein AT1G30330.2 with help from eggNOG-mapper. Make sure you set one-to-one orthologues only.
Exe 8) What are the GO terms and eggNOG orthology groups of this protein?
In this exercise we will use the GO-web browser QuickGo as it is an intuitive and weekly updated resource. Although we won’t be using them in this session, there are many other related tools out there, such as UniProt or Ensembl Plants.
The goal of this task is for you to learn how to: - Search for GO annotations using different inputs (protein IDs, gene IDs and GO terms). - Search for all products annotated to specific GO term or GO ID. - Customize the research to better fit the user preferences.
Let’s practice then:
Search for the GO terms and the functional categories of the following GO IDs GO:0009414, GO:0035618, GO:0016491. Tip : For multiple search GO IDs needs to be separated by a space.
What are the GO ID and the functional category corresponding to photosynthesis?
What are the immediate parent(s) and children of the photosynthesis GO term?
Search for the GO annotation terms of the following protein A0A068LKP4,A0A097PR28, A0A059Q6N8? What do you observe?
How many gene products are involved in leaf development? Give the GO ID corresponding to this term.
How many proteins of Arabidopsis thaliana, Prunus perisca and Zea mays are assigned to the leaf development GO term. Tip : Zea mays. Taxonomy ID=4577
Check the total number of BP annotations and proteins supported by the experimental evidence codes in both Arabidopsis thaliana and Prunus persica. (see the evidence codes) Tip : check the ‘Statistics’ box.
Exe 9) Summarize your results in a table.
Your report should be a single document, preferably in PDF format, with solutions Exe1 to Exe9.