1 Introduction

In Genomics and Computational Biology, annotating a sequence means decorating it with quantums of biological information, usually about its molecular function, the biological process(es) in which it takes part and the cellular compartment where that happens (see the Gene Ontology). Other relevant annotations include the evolutionary conservation of a sequence or the biological sample where it was found. The next figure summarizes example genomic features whose annotation has been recognized of importance in wheat breeding by (Borrill, Harrington, and Uauy 2019):

The annotation process requires at least two data sources:

  • A collection of previously annotated sequences.

  • Indexed literature in repositories such as PubMed or EuroPMC, where publications are uniquely identified with accession codes such as the DOI.

The best source of annotated protein sequences from plants is the database UniProt, which contains both expert-review sequences (SwissProt) and automatically annotated entries (learn more here in Spanish). The next table summarizes total reviewed plant proteins as of Feb 2020:

annot.stats <- read.csv(file="test_data/uniprot_stats.tsv", sep="\t", comment.char=";", header=F)
annot.stats = annot.stats[,1:2]
names(annot.stats) <- c("Species", "reviewed proteins")

## Print the table
kable(annot.stats,format.args = list(big.mark=","))
Species reviewed proteins
Viridiplantae 40,218
Eudicotyledons 27,063
Liliopsida 8,305
others 4,850
NA
Arabidopsis thaliana 15,922
Nicotiana tabacum 857
Glycine max 650
Solanum lycopersicum 480
Populus trichocarpa 210
Vitis vinifera 229
Brassica napus 158
Helianthus annuus 145
Phaseolus vulgaris 159
Medicago truncatula 116
NA
Oryza sativa subsp japonica 4,070
Zea mays 831
Triticum aestivum 379
Sorghum bicolor 353
Hordeum vulgare 351
Brachypodium distachyon 37
NA
Marchantia polymorpha 158
Physcomitrella patens 149
Chlamydomonas reinhardtii 356

Genomic sequences are most often nucleotide and peptide sequences. The first can be large genome fragments such as contigs or smaller genes, transcripts, coding sequences or non-coding RNAs; peptides are usually proteins translated from open reading frames encoded in coding sequences.

In this session we will focus on coding sequences, or in other words, genes that encode proteins.

1.1 Sequence comparison

A natural way of comparing protein or nucleic acid molecules is to align their (primary) sequences. This is due to the general expectation that sequence drives folding and to the fact that sequences are easy to work with in a computer or even your notebook. At least easier than structure.

1.2 Pairwise alignment: edit distance

When two sequences are aligned residues from one sequence are matched one-by-one to residues in the other. Matches are obvious when the sequences are nearly identical, but less so when mutations accumulate. A simple way to compute how similar two sequences are is to compute their edit distance.

1.3 Pairwise alignment: sequence identity

Another way to compute how similar two sequences are is to compute their % sequence identity.

1.4 Substitution matrices

Sequence identity is a simple way of measuring conservation. However, it lacks resolution and handles all residue substitutions the same way. This is not ideal as we know that purine (A,G) and pyrimidine (C,T) nucleotides, or aromatic amino acid resides if we talk about proteins, are often not interchanged with equal probability in real genes or proteins:

These preferences are captured by computing log-odds ratios of frequencies of observed mutations (a,b) with respect to estimates assuming no preference:

\[ s(a,b) = \lambda \ log(\frac{f_{ab}}{f_{a} f_{b}}) \approx log\frac{f_{homologues}}{f_{bychance}} \] These log-odds are additive.

1.5 BLOSUM substitution matrices

The most frequent substitution matrices used to score protein alignments are the BLOSUM matrices. These matrices are described by a number X, as in BLOSUM50, derived from the analysis of alignments of protein blocks/domains with identities < X percent. Below you can see BLOSUM50:

These log-odds have been scaled so that they can be accurately represented by integers.

BLOSUM matrices are scaled to 1/2-bit units (Pearson 2013); a substitution score such as s(E,E) can be broken down to:

\[ s(E,E) = 6 = 2.0 \ log_{2}(\frac{f_{EE}}{f_{E} f_{E}})\] \[ \frac{f_{EE}}{f_{E} f_{E}} = 2^3 \]

S(E,E) is thus 8 times more likely to occur because of homology than by chance.

1.6 Pairwise alignment: similarity

By using matrices such as BLOSUM it is possible to compute the similarity between two aligned sequences, which is added up along the alignment:

1.7 Pairwise alignment: handling insertions and deletion (indels)

In addition to residue substitutions, insertions and deletions are usually considered while computing similarity. This can be done in many ways. The simplest is to assume a linear cost for insertions, proportional to their length. However, it is more accurate to compute affine gap costs, which charge a fix cost to openning a gap (a) and then a linear cost proportional to the gap length (bk). In this way, a gap of k residues receives a total score of -(a+bk)

1.8 Multiple alignment

When more than two sequences are to be aligned we talk about multiple alignments, which can be computed in many ways. The most instuitive way, comparing them all, requires quadratically more resources as more sequences are added.

2 Algorithms for sequence alignment

In this section we will visit some of the most frequent algorithms used to align sequences. The goal is to learn what you can and cannot do with each of them so that you can apply them correctly.

2.1 Pairwise alignments

These are the simplest alignments as they involve only two sequences (of length \(m\) and \(n\)). There are several flavours, but the most important are global and local alignments, depicted in this figure taken from (Mäkinen et al. 2015):

2.1.1 Global alignment

The Needleman-Wunsch (NW) algorithm is the original deterministic approach for the alignment of pairs of protein sequences end-to-end (Needleman and Wunsch 1970). It was subsequently optimized with more ellaborated algorithms (Gotoh 1982). These algorithms belong to the family of dynamic programming (DP) algorithms. The NW and related algorithms break the initial problem in subalignments which are recursively solved. In order to reach the final solution, an alignment matrix DP must be filled iteratively:

  • The top and left margins of DP are filled with gaps of increasing length, and the origin set to zero:

\[ DP(0,0) = 0 \\ DP(i,0) = -id , \text{ for } 1\leq i \leq m \\ DP(0,j) = -jd , \text{ for } 1\leq j \leq n \\ \]

  • The alignment DP matrix can now be computed from top left to bottom right according to the following recursive function, where \(s(x,y)\) is a scoring function/substitution matrix and \(d\) a linear gap cost:

\[DP(i,j) = max \begin{cases} DP(i-1,j-1) + s(x_{i},y_{j})\\ DP(i-1,j)-d\\ DP(i,j-1)-d \end{cases} \]

During the computation of the DP matrix, another matrix is used to record the previous cell which was visited before the current cell under consideration. This second matrix is used to perform the traceback step, in which the alignment is reconstructed backwards from the bottom right cell, as illustrated in this figure from (Durbin et al. 1998):

2.1.2 Local alignment of two subsequences

Local pairwise alignment is not end-to-end. Instead it assumes that the sequences to be compared might have different domain structure and perhaps share only part of the sequence by homology. The Smith-Waterman (SW) algorithm was designed for this purpose (Smith and Waterman 1981). It is a modification of NW with two differences:

\[DP(i,j) = max \begin{cases} 0\\ DP(i-1,j-1) + s(x_{i},y_{j})\\ DP(i-1,j)-d\\ DP(i,j-1)-d \end{cases} \]

In addition, the traceback step now starts in the cell of DP with the highest score, instead of the bottom right corner (Durbin et al. 1998):

2.1.3 Affine gap penalties

The algorithms presented so far have linear gap costs. Affine gap costs, with different cost for openning (\(d\)) and extending (\(e\)) an insertion, are believed to be more accurate for protein sequences. They also introduce extra complexity in the algorithm, as they require creating two extra matrices to store indels in sequences \(x\) and \(y\) (Durbin et al. 1998):

\[DP(i,j) = max \begin{cases} DP(i-1,j-1) + s(x_{i},y_{j})\\ I_{x}(i-1,j-1) + s(x_{i},y_{j})\\ I_{y}(i-1,j-1) + s(x_{i},y_{j}) \end{cases}\]

\[I_{x}(i,j) = max \begin{cases} DP(i-1,j) -d\\ I_{x}(i-1,j) -e \end{cases}\]

\[I_{y}(i,j) = max \begin{cases} DP(i,j-1) -d\\ I_{y}(i,j-1) -e \end{cases}\]

2.1.4 Heuristic pairwise alignments

The DP algorithms just described (NW, SW) are expensive if one wants to compare a query sequence to a large collection of sequences (database). For each pair of sequences the computation time and memory used is proportional to the product of the sequence lengths. Thus, for a collection of millions of sequences this is unfeasible.

For this reason other approaches have been developed, aiming at running times proportional to the sum of the lengths of the query and database sequences. These approaches are heuristic, meaning that they work fine in most cases, but cannot guarantee to produce always optimal alignments. The most standard of these methods is probably BLAST (Altschul et al. 1997).

BLAST saves computing time by pre-processing the sequences producing indexed tables of words of size 3 and 11 for BLASTP and BLAST, respectively. For each word, all similar words with high scores (by default BLOSUM62 in BLASTP) are short-listed. BLAST relies on two fundamental assumptions (Dwyer 2003):

  • Most high-scoring local alignments contain several high-scoring pairs of words. These can be used as seeds from which high-scoring alignments can be extended.

  • Homologous proteins contain segments that can be aligned without indels. This simplifies the process of extending seed alignments.

The natural data structure for indexing words of length \(k\), or \(k\)-mers, is the \(k\)-mer index. The figure below depicts the index for string ‘AGAGCGAGAGCGCGC’ and \(k=2\) (Mäkinen et al. 2015):

2.1.5 Read alignment or mapping

High throughput sequencing has found many applications in all fields of Science in recent years. The primary product of this array of technologies are sequence reads, usually in FASTQ format, of either Single-End (SE) or Paired-End (PE) sequences. Long-reads, several Kb long, are increasingly popular as well. PE reads usually have the following arrangement (Minikel 2012):

Once produced, reads are usually aligned, or mapped, to the closest reference genome (Mäkinen et al. 2015):

Since a single sequencing experiment produces millions of reads, and the reference genomes can be very large (see for instance some plant genomes at RSAT::Plants), algorithms for read mapping need to be very efficient in terms of speed and disk space. As sequencers are not error-free, read aligners must also be flexible and tolerate mismatches (Marco-Sola et al. 2012). In addition, these approaches must compute mapping quality, as repeated genomic segments imply that some reads could be placed in several locations with equal probability (Li, Ruan, and Durbin 2008).

They achieve those goals by using a data structure called suffix array (SA) which is more flexible than the \(k\)-mer index, as it supports queries of variable length (Mäkinen et al. 2015) (see here how to build one, in Spanish):

Burrows-Wheeler indexes are space-efficient SA that can be queried and aligned by dynamic programming (Mäkinen et al. 2015). See here an example of how the Burrows-Wheeler transform works (in Spanish).

The next diagram summarizes these and other algorithmic choices in the context of approximate protein sequence matching (Koskinen and Holm 2012):

2.2 Multiple alignments and sequence profiles

In theory, the DP algorithms described earlier can be generalized to sets of more than two sequences. However, the recursive functions grow in complexity with every new sequence added, as well as the computational resources required. For this reason, multiple alignments are usually computed using a progressive approach, in which first all sequences are first aligned by pairs, and then these are progressively merged Mäkinen et al. (2015). This is illustrated in the following diagram (Caporaso 2018):

One of the many software packages available for multiple alignment, which follows this approach and produces scalable alignments of high quality, is Clustal Omega (Sievers et al. 2011).

Multiple sequence alignments (MSA) have been used to convey evolutionary information, particularly about protein families, and have been called sequence profiles in that context (Gribskov, McLachlan, and Eisenberg 1987). Profiles are essentially position-specific substitution matrices. The most successful way of encoding protein profiles are Hidden Markov Models (HMM) (Eddy 2004), as those built with HMMER for the Pfam database of protein families. The next figure illustrates how a MSA can be used to build a HMM (Lindly 2018):

Another standard tool used to build and scan sequence profiles, both protein and DNA, is MEME.

Profiles can be aligned to single sequences and also to other profiles (Soding 2005).

3 Annotation of protein-coding sequences

Coding sequences are usually abbreviated as CDS. This is a definition from UniProt:

A CoDing Sequence (CDS) is a region of DNA or RNA whose sequence determines the sequence of amino acids in a protein. It should not be mixed up with an Open Reading Frame (ORF), which is a series of DNA codons that does not contain any STOP codons.

This is an example ORF sequence:

5' atg ttc cgc tcc gtg atc cca gtt tcc tgt tcc 3'

3.1 CDS and exons

Most genes in plants contain CDS made up of several concatenated exons. This means they contain introns, that is, sequence fragments that must be spliced in order to produce a processed mRNA/transcript.

Nucleotide sequences are the physical media of coding sequences, which are ultimately translated into proteins. If mutations occur, the corresponding CDS sequencen might be altered. Mutations can be Single Nucleotide Polymorphisms (SNPs), insertions or deletions.

3.2 Peptide sequences

CDS and ORF sequences are nucleotide sequences, written in the A,C,G,T alphabet. By using a codon table, trinucleotides in an ORF can be translated into peptide sequences, which take a third of the original length and a use the 20 amino acid alphabet.

Peptide sequences describe the primary structure of proteins. However, proteins are molecules that must fold in order to carry out their functions, and thus their linear sequences is often not enough to represent them.

3.3 Secondary and tertiary structure and protein domains

To neutralize polar charges of the peptidic backbone, proteins adopt conformations that can be generally classified as \(\alpha\)-helices (H), \(\beta\)-sheets (E), loops and disordered regions (C). These are secondary structure states, which can be encoded as strings of the alphabet H,E and C.

The following example shows that kind of 3-state encoding for a sequence:

MFRSVIPVLL FLIPLLLSAQ AANSLRACGP...
EEEECCEEEE HHHHHHHHHH CCCCCCCCCC...

Tertiary structure is the folding of secondary structure elements as globular or transmembrane units.

Minimal evolutionary and folding units are called domains (Porter and Rose 2012). As protein domains are both folding and functional units, guessing which domains are contained in a sequence provides clues on possible functions. Of course the concept of function is a bit subjective, and that’s why resources such as the Gene Ontology are valuable.

Two key resources for annotating proteins domains in input sequences are:

  • Pfam, which defines sequence families (domains) by calculating and refining multiple alignments which are ultimately converted to hidden Markov models. It is integrated in a larger collection called InterPro.
  • HHPred, which explicitely uses secondary structure, sequence profiles and tertiary structure classifications to call protein domains and annotate protein sequences.

Another resource comparable to Pfam is PANTHER, which is related to the Gene Ontology.

The next figure depicts the tertiary structure of a protein with two domains, adapted from (Wuerges et al. 2006). Note that strands are shown as arrows and helices as cilinders:

You can read more about these topics in Algoritmos en Bioinformática Estructural (in Spanish), which includes a chapter of AlphaFold and related approaches.

3.4 Gene Ontology

The gene ontology is a standard language that describes the functions of gene products (protein, RNA) using a controlled and unified vocabularies across all species.

These vocabulary terms (GO terms) annotate the gene products at three different levels: - Biological Process (BP): to which process or pathway is the gen contributing - Molecular Function (MF): the molecular activities of individual gene products - Cellular Component (CC): where the gene products are active

They are organized as a hierarchy and connected to each other through parent-child relationships. The next example is from QuickGO (Binns et al. 2009):

See here and here examples on how to navigate the GO ontology with scripts (in Spanish).

3.5 Phylogenetic analysis of sequences

Orthologous genes are derived by speciation, in contrast to paralogues, which evolve through duplication. Comparative genomic studies on a variety of clades suggest that orthologues among species typically are the most similar genes in terms of sequence, structure, domain architecture and function (Gabaldon and Koonin 2013). There are however many exceptions and thus orthology should be used as a proxy for similarity. In benchmarks with curated GO terms, it seems that orthology improves the annotation of sequences particularly on the Molecular Function side of the Gene Ontology (Huerta-Cepas et al. 2017).

Ensembl Plants and PLAZA are resources providing plant-specific comparative genomics tools for plants, such as orthology calls and sintenic/collinear genes.

4 Exercises

4.1 Annotating coding sequences by alignment to other sequences

The goal of this exercise is to learn how to produce different kinds of sequence alignments with standard software. You will need to download the following required software:

software source documentation
BLAST ftp.ncbi.nlm.nih.gov/blast/executables/blast+ URL
Clustal Omega www.clustal.org/omega URL
HMMER hmmer.org URL

4.1.1 Formatting a sequence collection for BLAST

The first task is to format our test sequence set, which was obtained from UniProt:

cd test_data/
gunzip uniprot_Atha.fasta.gz
/path/to/ncbi-blast/bin/makeblastdb -dbtype prot -in uniprot_Atha.fasta

Exe1) How many sequences have been formatted and how does this affect the E-value of BLAST searches?

4.1.2 Querying the collection with a sample coding sequence

Here you will work with protein ARF6. Its protein and transcript sequence is in files test.faa and test.fna, respectively.

>AT1G30330.2 [Arabidopsis thaliana]  ARF6 (AUXIN RESPONSE FACTOR 6)
MRLSSAGFNPQPHEVTGEKRVLNSELWHACAGPLVSLPPVGSRVVYFPQGHSEQVAASTNKEVDAHIPNYPSLHPQL...

The transcript is:

>AT1G30330.2 [Arabidopsis thaliana]  ARF6 (AUXIN RESPONSE FACTOR 6)  transcript 
ATGAGATTATCTTCAGCTGGGTTTAATCCTCAACCTCATGAAGTTACAGGAGAGAAAAGAGTTCTTAATTCTGAGCTCTG
GCATGCTTGTGCTGGTCCTCTTGTCTCACTACCTCCTGTTGGAAGCAGAGTTGTGTATTTTCCTCAAGGTCACAGTGAAC
AGGTTGCTGCTTCGACCAACAAAGAAGTAGATGCTCATATACCAAATTATCCGAGCTTGCATCCGCAGCTTATCTGT...

We shall now look for similar sequences in our collection:

/path/to/ncbi-blast/bin/blastp -db uniprot_Atha.fasta -query test.faa -outfmt 6

/path/to/ncbi-blast/bin/blastx -db uniprot_Atha.fasta -query test.fna -outfmt 6

Exe2) Can you redirect the output to separate files called test.faa.blast and test.fna.blast?

Exe3) What is the default alignment format, can you show an example?

Exe4) Are there differences in the results retrieved in both searches?

4.1.3 Producing a sequence profile

Here you will learn to make a sequence profile out of similar sequences matched with three iterations of BLASTP, using PSI-BLAST:

/path/to/ncbi-blast/bin/psiblast -db uniprot_Atha.fasta -query test.faa -num_iterations 3 -out_ascii_pssm profile.out

Exe 5) Can you explain the contents of the output file profile.out?

4.1.4 Making a Hidden Markov Model (HMM) with aligned sequences

This task actually comprises four steps:

  • Create a FASTA file with the complete protein sequences of the matches of your protein search with bit score > 200. You might find one-liners useful for this.

  • Compute a multiple alignment of these sequences with Clustal Omega. Check the available output formats.

  • Build a HMM out of these aligned sequences with hmmbuild

  • Scan the HMM against your sequence collection with hmmerscan and write a short report on the results. This should be deliverable Exe6.

4.1.5 Annotating function by predicted structural similarity

Search for homologous sequences and define protein domains of AT1G30330.2 with HHPred.

Note that you can also search for proteins with AlphaFold structural predictions with the EBI BLAST against AlphaFoldDB.

Exe 7) Produce a table with i) domains defined by the boundaries of matched entries from the Protein Data Bank and Pfam and ii) similar sequences in AlphaFoldDB.

4.1.6 Annotating function on orthology grounds

Search for functional annotations for protein AT1G30330.2 with help from eggNOG-mapper. Make sure you set one-to-one orthologues only.

Exe 8) What are the GO terms and eggNOG orthology groups of this protein?

4.1.7 Annotating function with Gene Ontology (GO) terms

In this exercise we will use the GO-web browser QuickGo as it is an intuitive and weekly updated resource. Although we won’t be using them in this session, there are many other related tools out there, such as UniProt or Ensembl Plants.

The goal of this task is for you to learn how to: - Search for GO annotations using different inputs (protein IDs, gene IDs and GO terms). - Search for all products annotated to specific GO term or GO ID. - Customize the research to better fit the user preferences.

Let’s practice then:

  1. Search for the GO terms and the functional categories of the following GO IDs GO:0009414, GO:0035618, GO:0016491. Tip : For multiple search GO IDs needs to be separated by a space.

  2. What are the GO ID and the functional category corresponding to photosynthesis?

  3. What are the immediate parent(s) and children of the photosynthesis GO term?

  4. Search for the GO annotation terms of the following protein A0A068LKP4,A0A097PR28, A0A059Q6N8? What do you observe?

  5. How many gene products are involved in leaf development? Give the GO ID corresponding to this term.

  6. How many proteins of Arabidopsis thaliana, Prunus perisca and Zea mays are assigned to the leaf development GO term. Tip : Zea mays. Taxonomy ID=4577

  7. Check the total number of BP annotations and proteins supported by the experimental evidence codes in both Arabidopsis thaliana and Prunus persica. (see the evidence codes) Tip : check the ‘Statistics’ box.

Exe 9) Summarize your results in a table.

4.1.8 Your report

Your report should be a single document, preferably in PDF format, with solutions Exe1 to Exe9.

Bibliography

Altschul, S. F., T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. 1997. “Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs.” Nucleic Acids Res. 25 (17): 3389–3402. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146917.
Binns, D., E. Dimmer, R. Huntley, D. Barrell, C. O’Donovan, and R. Apweiler. 2009. QuickGO: a web-based tool for Gene Ontology searching.” Bioinformatics 25 (22): 3045–46. https://dx.doi.org/10.1093/bioinformatics/btp536.
Borrill, P., S. A. Harrington, and C. Uauy. 2019. Applying the latest advances in genomics and phenomics for trait discovery in polyploid wheat.” Plant J 97 (1): 56–72. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6378701/.
Caporaso, G. 2018. “An Introduction to Applied Bioinformatics.” https://github.com/caporaso-lab/An-Introduction-To-Applied-Bioinformatics.
Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press. http://eddylab.org/cupbook.html.
Dwyer, R. A. 2003. Genomic Perl: From Bioinformatics Basics to Working Code. Cambridge University Press. http://www.cambridge.org/052180177X.
Eddy, Sean R. 2004. What is a hidden Markov model? Nat Biotechnol. 10 (22): 1315–16. https://www.ncbi.nlm.nih.gov/pubmed/15470472.
Gabaldon, T., and E. V. Koonin. 2013. Functional and evolutionary implications of gene orthology.” Nat. Rev. Genet. 14 (5): 360–66. http://www.ncbi.nlm.nih.gov/pubmed/23552219.
Gotoh, O. 1982. “An Improved Algorithm for Matching Biological Sequences.” J.Mol.Biol. 162 (3): 705–708. https://doi.org/10.1016/0022-2836(82)90398-9.
Gribskov, M., A. D. McLachlan, and D. Eisenberg. 1987. “Profile Analysis: Detection of Distantly Related Proteins.” PNAS 84 (13): 4355–4358. http://www.pnas.org/content/84/13/4355.long.
Hogeweg, P., and B. Hesper. 1984. The alignment of sets of sequences and the construction of phyletic trees: an integrated method.” J. Mol. Evol. 20 (2): 175–86. https://www.ncbi.nlm.nih.gov/pubmed/6433036.
Huerta-Cepas, J., K. Forslund, L. P. Coelho, D. Szklarczyk, L. J. Jensen, C. von Mering, and P. Bork. 2017. Fast Genome-Wide Functional Annotation through Orthology Assignment by eggNOG-Mapper.” Mol Biol Evol 34 (8): 2115–22. 10.1093/molbev/msx148.
Koskinen, J. P., and L. Holm. 2012. Bioinformatics 28 (18): i438–43. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3436844/.
Li, H., J. Ruan, and R. Durbin. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores.” Genome Res. 18 (11): 1851–58. https://www.ncbi.nlm.nih.gov/pubmed/18714091.
Lindly, A. 2018. “Hidden Markov Models.” http://slideplayer.com/user/4159968.
Mäkinen, Veli, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press. http://www.genome-scale.info.
Marco-Sola, S., M. Sammeth, R. Guigo, and P. Ribeca. 2012. The GEM mapper: fast, accurate and versatile alignment by filtration.” Nat. Methods 9 (12): 1185–88. https://www.ncbi.nlm.nih.gov/pubmed/23103880.
Minikel, E. 2012. “Forward and Reverse Reads in Paired-End Sequencing.” http://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing.
Needleman, S. B., and C. D. Wunsch. 1970. “A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins.” J.Mol.Biol. 48: 443–53. https://www.ncbi.nlm.nih.gov/pubmed/5420325.
Pearson, W. R. 2013. Selecting the Right Similarity-Scoring Matrix.” Curr Protoc Bioinformatics 43: 1–9. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3848038/.
Porter, L. L., and G. D. Rose. 2012. A thermodynamic definition of protein domains.” Proc. Natl. Acad. Sci. U.S.A. 109 (24): 9420–25. http://www.pnas.org/content/109/24/9420.long.
Sievers, F., A. Wilm, D. Dineen, T. J. Gibson, K. Karplus, W. Li, R. Lopez, et al. 2011. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.” Mol. Syst. Biol. 7 (October): 539. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3261699/.
Smith, T. F., and M.S. Waterman. 1981. Identification of Common Molecular Subsequences. J.Mol.Biol. 147: 195–97. https://www.ncbi.nlm.nih.gov/pubmed/7265238.
Soding, J. 2005. Protein homology detection by HMM-HMM comparison.” Bioinformatics 21: 951–60. http://www.ncbi.nlm.nih.gov/pubmed/15531603.
Wuerges, J., G. Garau, S. Geremia, S. N. Fedosov, T. E. Petersen, and L. Randaccio. 2006. Structural basis for mammalian vitamin B12 transport by transcobalamin.” Proc. Natl. Acad. Sci. U.S.A. 103: 4386–91. http://www.pnas.org/content/103/12/4386.long.