1 Introduction

In this session we will continue annotating sequences, but now with a focus on non-coding sequences. If we revisit the next figure, I hope you’ll see that non-coding sequences are usually the largest fraction of genomes (Mäkinen et al. 2015):

For instance, the genome of the model grass Brachypodium distachyon is about 270Mbp long, but its genes take only about 122Mbp. The difference increases as the genomes grow larger, as you can see in the next table:

annot.stats <- read.csv(file="test_data/Ensembl_stats.tsv", sep="\t", comment.char=";")
kable(annot.stats,format.args = list(big.mark=","))
genome genome.size gene.space percentage
Arabidopsis thaliana 119,667,750 77,640,474 64.9
Arabidopsis halleri 196,243,198 67,469,378 34.4
Prunus dulcis 227,498,357 94,209,302 41.4
Brachypodium distachyon 271,163,419 122,468,185 45.2
Brassica rapa 283,822,783 83,199,064 29.3
Trifolium pratense 304,842,038 133,448,980 43.8
Arabis alpina 308,032,609 45,780,039 14.9
Cucumis melo 357,857,370 99,736,823 27.9
Citrullus lanatus 365,450,462 81,219,268 22.2
Oryza sativa 375,049,285 130,066,257 34.7
Setaria viridis 395,731,502 123,250,596 31.1
Vitis vinifera 486,265,422 153,535,016 31.6
Rosa chinensis 515,588,973 117,296,819 22.8
Camelina sativa 641,356,059 214,928,593 33.5
Malus domestica 702,961,352 145,783,855 20.7
Olea europaea 1,140,987,834 153,461,985 13.4
Zea mays 2,135,083,061 168,014,276 7.9
Helianthus annuus 3,027,844,945 199,817,746 6.6
Aegilops tauschii 4,224,915,394 348,412,472 8.2
Triticum turgidum 10,463,058,104 475,613,274 4.5

Therefore, when dealing with genome variation, polymorphisms have a greater chance to occur in non-coding regions. Among these, probably the most interesting are regulatory sequences and repeated elements.

The goal of this session is to learn how regulatory sequences can be discovered in promoter sequences using statistical tests and aligned to build DNA motifs.

2 Genomic repeated sequences

The annotation of Transposable Elements (TEs) within plant genomes can help in the interpretation of observed phenotypes, as sometimes TEs affect the expression of neighbor genes, and in computational tasks such as promoter whole genome alignment, promoter or pan-genome analyses.

Usually TEs are annotated by alignment to curated libraries of repeated elements such as RepetDB (Amselem et al. 2019), where each sequence or element is classified according to the Wicker classification (Wicker et al. 2007). Class I elements are “copy and paste”, while Class II are “cut and paste”. The next figure summarizes this taxonomy of TEs, which resembles that of protein domains:

More recently, there are other approaches which do not require any curation; instead, these simply identify repeated elements by counting words (\(K\)-mers) along the genome. An example of such tools is the Repeat Detector (Girgis 2015), which has been used to routinely mask plant genomes (Contreras-Moreira et al. 2021).

The next table summarizes the fraction of repeated sequences in diverse plant genomes, as annotated with REdat, Red and the original papers describing those genomes:

annot.stats <- read.csv(file="test_data/Ensembl_repeats.tsv", sep="\t", comment.char=";")
annot.stats = annot.stats[,c(1,2,6,8,10)]
kable(annot.stats,format.args = list(big.mark=","))
genome size perc_REdat perc_Red perc_literature
Arabidopsis thaliana 119,667,750 14.2 36.7 19.0
Arabidopsis halleri 196,243,198 15.5 31.1 32.7
Prunus dulcis 227,498,357 6.5 33.4 37.6
Brachypodium distachyon 271,163,419 27.4 31.1 21.4
Brassica rapa 283,822,783 8.6 32.8 32.3
Trifolium pratense 304,842,038 11.0 30.0 41.8
Arabis alpina 308,032,609 15.1 37.7 47.9
Cucumis melo 357,857,370 8.2 39.9 44.0
Citrullus lanatus 365,450,462 6.8 40.8 45.2
Oryza sativa 375,049,285 32.3 37.0 35.0
Setaria viridis 395,731,502 18.5 40.8 46.0
Vitis vinifera 486,265,422 9.0 40.0 41.4
Rosa chinensis 515,588,973 8.4 48.1 67.9
Camelina sativa 641,356,059 15.8 36.0 28.0
Malus domestica 702,961,352 9.1 41.6 59.5
Olea europaea 1,140,989,389 18.0 45.2 43.0
Zea mays 2,135,083,061 59.5 79.0 85.0
Helianthus annuus 3,027,844,945 10.0 73.6 74.7
Aegilops tauschii 4,224,915,394 68.7 81.8 85.9
Triticum turgidum 10,463,058,104 71.7 82.2 82.2

When the repeats themselves are not of interest for subsequent analyses, they are said to be masked out. This means that they are marked so that they can be avoided. Hard-masking means replacing the sequences of repeats with polyN oligonucleotides. Soft-masking means leaving the repeated sequences in lower-case:

ACTAGACTACGNNNNNNNNNNNATATATCA  # hard-masked
ACTAGACTACGtttttttttttATATATCA  # soft-masked

3 DNA-binding proteins and regulatory sequences

The regulation of gene expression is one of the fundamental topics in Genetics. Here you will learn about transcription factors (TFs) and cis-regulatory elements (CREs). TFs are proteins that bind specifically to DNA sequences called CREs and affect the expression of nearby genes.

3.1 Protein-DNA recognition

DNA-binding proteins contain DNA-binding domains and have a specific or general affinity for either single or double stranded DNA. Here we will concentrate mostly on transcription factors, which generally recognize cis-regulatory elements in double-stranded DNA molecules.

3.1.1 Dissecting a protein-DNA interface

Transcription factors recognize target DNA sequences through a binding interface, composed of protein residues and DNA stretches in intimate contact. The best descriptions of protein-DNA interfaces are provided by structural biology, usually by X-ray or NMR experiments.

Details of the interface of Lac repressor and its operator site ( Lewis et al. (1996)) Details of the interface of Lac repressor and its operator site ( Lewis et al. (1996))