In this session we will continue annotating sequences, but now with a focus on non-coding sequences. If we revisit the next figure, I hope you’ll see that non-coding sequences are usually the largest fraction of genomes (Mäkinen et al. 2015):
For instance, the genome of the model grass Brachypodium distachyon is about 270Mbp long, but its genes take only about 122Mbp. The difference increases as the genomes grow larger, as you can see in the next table:
annot.stats <- read.csv(file="test_data/Ensembl_stats.tsv", sep="\t", comment.char=";")
kable(annot.stats,format.args = list(big.mark=","))
genome | genome.size | gene.space | percentage |
---|---|---|---|
Arabidopsis thaliana | 119,667,750 | 77,640,474 | 64.9 |
Arabidopsis halleri | 196,243,198 | 67,469,378 | 34.4 |
Prunus dulcis | 227,498,357 | 94,209,302 | 41.4 |
Brachypodium distachyon | 271,163,419 | 122,468,185 | 45.2 |
Brassica rapa | 283,822,783 | 83,199,064 | 29.3 |
Trifolium pratense | 304,842,038 | 133,448,980 | 43.8 |
Arabis alpina | 308,032,609 | 45,780,039 | 14.9 |
Cucumis melo | 357,857,370 | 99,736,823 | 27.9 |
Citrullus lanatus | 365,450,462 | 81,219,268 | 22.2 |
Oryza sativa | 375,049,285 | 130,066,257 | 34.7 |
Setaria viridis | 395,731,502 | 123,250,596 | 31.1 |
Vitis vinifera | 486,265,422 | 153,535,016 | 31.6 |
Rosa chinensis | 515,588,973 | 117,296,819 | 22.8 |
Camelina sativa | 641,356,059 | 214,928,593 | 33.5 |
Malus domestica | 702,961,352 | 145,783,855 | 20.7 |
Olea europaea | 1,140,987,834 | 153,461,985 | 13.4 |
Zea mays | 2,135,083,061 | 168,014,276 | 7.9 |
Helianthus annuus | 3,027,844,945 | 199,817,746 | 6.6 |
Aegilops tauschii | 4,224,915,394 | 348,412,472 | 8.2 |
Triticum turgidum | 10,463,058,104 | 475,613,274 | 4.5 |
Therefore, when dealing with genome variation, polymorphisms have a greater chance to occur in non-coding regions. Among these, probably the most interesting are regulatory sequences and repeated elements.
The goal of this session is to learn how regulatory sequences can be discovered in promoter sequences using statistical tests and aligned to build DNA motifs.
The annotation of Transposable Elements (TEs) within plant genomes can help in the interpretation of observed phenotypes, as sometimes TEs affect the expression of neighbor genes, and in computational tasks such as promoter whole genome alignment, promoter or pan-genome analyses.
Usually TEs are annotated by alignment to curated libraries of repeated elements such as RepetDB (Amselem et al. 2019), where each sequence or element is classified according to the Wicker classification (Wicker et al. 2007). Class I elements are “copy and paste”, while Class II are “cut and paste”. The next figure summarizes this taxonomy of TEs, which resembles that of protein domains:
More recently, there are other approaches which do not require any curation; instead, these simply identify repeated elements by counting words (\(K\)-mers) along the genome. An example of such tools is the Repeat Detector (Girgis 2015), which has been used to routinely mask plant genomes (Contreras-Moreira et al. 2021).
The next table summarizes the fraction of repeated sequences in diverse plant genomes, as annotated with REdat, Red and the original papers describing those genomes:
annot.stats <- read.csv(file="test_data/Ensembl_repeats.tsv", sep="\t", comment.char=";")
annot.stats = annot.stats[,c(1,2,6,8,10)]
kable(annot.stats,format.args = list(big.mark=","))
genome | size | perc_REdat | perc_Red | perc_literature |
---|---|---|---|---|
Arabidopsis thaliana | 119,667,750 | 14.2 | 36.7 | 19.0 |
Arabidopsis halleri | 196,243,198 | 15.5 | 31.1 | 32.7 |
Prunus dulcis | 227,498,357 | 6.5 | 33.4 | 37.6 |
Brachypodium distachyon | 271,163,419 | 27.4 | 31.1 | 21.4 |
Brassica rapa | 283,822,783 | 8.6 | 32.8 | 32.3 |
Trifolium pratense | 304,842,038 | 11.0 | 30.0 | 41.8 |
Arabis alpina | 308,032,609 | 15.1 | 37.7 | 47.9 |
Cucumis melo | 357,857,370 | 8.2 | 39.9 | 44.0 |
Citrullus lanatus | 365,450,462 | 6.8 | 40.8 | 45.2 |
Oryza sativa | 375,049,285 | 32.3 | 37.0 | 35.0 |
Setaria viridis | 395,731,502 | 18.5 | 40.8 | 46.0 |
Vitis vinifera | 486,265,422 | 9.0 | 40.0 | 41.4 |
Rosa chinensis | 515,588,973 | 8.4 | 48.1 | 67.9 |
Camelina sativa | 641,356,059 | 15.8 | 36.0 | 28.0 |
Malus domestica | 702,961,352 | 9.1 | 41.6 | 59.5 |
Olea europaea | 1,140,989,389 | 18.0 | 45.2 | 43.0 |
Zea mays | 2,135,083,061 | 59.5 | 79.0 | 85.0 |
Helianthus annuus | 3,027,844,945 | 10.0 | 73.6 | 74.7 |
Aegilops tauschii | 4,224,915,394 | 68.7 | 81.8 | 85.9 |
Triticum turgidum | 10,463,058,104 | 71.7 | 82.2 | 82.2 |
When the repeats themselves are not of interest for subsequent analyses, they are said to be masked out. This means that they are marked so that they can be avoided. Hard-masking means replacing the sequences of repeats with polyN oligonucleotides. Soft-masking means leaving the repeated sequences in lower-case:
ACTAGACTACGNNNNNNNNNNNATATATCA # hard-masked
ACTAGACTACGtttttttttttATATATCA # soft-masked
The regulation of gene expression is one of the fundamental topics in Genetics. Here you will learn about transcription factors (TFs) and cis-regulatory elements (CREs). TFs are proteins that bind specifically to DNA sequences called CREs and affect the expression of nearby genes.
DNA-binding proteins contain DNA-binding domains and have a specific or general affinity for either single or double stranded DNA. Here we will concentrate mostly on transcription factors, which generally recognize cis-regulatory elements in double-stranded DNA molecules.
Transcription factors recognize target DNA sequences through a binding interface, composed of protein residues and DNA stretches in intimate contact. The best descriptions of protein-DNA interfaces are provided by structural biology, usually by X-ray or NMR experiments.