1 Introduction

In this session we will continue annotating sequences, but now with a focus on non-coding sequences. If we revisit the next figure, I hope you’ll see that non-coding sequences are usually the largest fraction of genomes (Mäkinen et al. 2015):

For instance, the genome of the model grass Brachypodium distachyon is about 270Mbp long, but its genes take only about 122Mbp. The difference increases as the genomes grow larger, as you can see in the next table:

annot.stats <- read.csv(file="test_data/Ensembl_stats.tsv", sep="\t", comment.char=";")
kable(annot.stats,format.args = list(big.mark=","))
genome genome.size gene.space percentage
Arabidopsis thaliana 119,667,750 77,640,474 64.9
Arabidopsis halleri 196,243,198 67,469,378 34.4
Prunus dulcis 227,498,357 94,209,302 41.4
Brachypodium distachyon 271,163,419 122,468,185 45.2
Brassica rapa 283,822,783 83,199,064 29.3
Trifolium pratense 304,842,038 133,448,980 43.8
Arabis alpina 308,032,609 45,780,039 14.9
Cucumis melo 357,857,370 99,736,823 27.9
Citrullus lanatus 365,450,462 81,219,268 22.2
Oryza sativa 375,049,285 130,066,257 34.7
Setaria viridis 395,731,502 123,250,596 31.1
Vitis vinifera 486,265,422 153,535,016 31.6
Rosa chinensis 515,588,973 117,296,819 22.8
Camelina sativa 641,356,059 214,928,593 33.5
Malus domestica 702,961,352 145,783,855 20.7
Olea europaea 1,140,987,834 153,461,985 13.4
Zea mays 2,135,083,061 168,014,276 7.9
Helianthus annuus 3,027,844,945 199,817,746 6.6
Aegilops tauschii 4,224,915,394 348,412,472 8.2
Triticum turgidum 10,463,058,104 475,613,274 4.5

Therefore, when dealing with genome variation, polymorphisms have a greater chance to occur in non-coding regions. Among these, probably the most interesting are regulatory sequences and repeated elements.

The goal of this session is to learn how regulatory sequences can be discovered in promoter sequences using statistical tests and aligned to build DNA motifs.

2 Genomic repeated sequences

The annotation of Transposable Elements (TEs) within plant genomes can help in the interpretation of observed phenotypes, as sometimes TEs affect the expression of neighbor genes, and in computational tasks such as promoter whole genome alignment, promoter or pan-genome analyses.

Usually TEs are annotated by alignment to curated libraries of repeated elements such as RepetDB (Amselem et al. 2019), where each sequence or element is classified according to the Wicker classification (Wicker et al. 2007). Class I elements are “copy and paste”, while Class II are “cut and paste”. The next figure summarizes this taxonomy of TEs, which ressembles that of protein domains:

More recently, there are other approaches which don’t require any curation; instead, these simply identify repeated elements by counting words (\(K\)-mers) along the genome. An example of such tools is the Repeat Detector (Girgis 2015), which has been used to routinely mask plant genomes (Contreras-Moreira et al. 2021).

The next table summarizes the fraction of repeated sequences in diverse plant genomes, as annotated with REdat, Red and the original papers describing those genomes:

annot.stats <- read.csv(file="test_data/Ensembl_repeats.tsv", sep="\t", comment.char=";")
annot.stats = annot.stats[,c(1,2,6,8,10)]
kable(annot.stats,format.args = list(big.mark=","))
genome size perc_REdat perc_Red perc_literature
Arabidopsis thaliana 119,667,750 14.2 36.7 19.0
Arabidopsis halleri 196,243,198 15.5 31.1 32.7
Prunus dulcis 227,498,357 6.5 33.4 37.6
Brachypodium distachyon 271,163,419 27.4 31.1 21.4
Brassica rapa 283,822,783 8.6 32.8 32.3
Trifolium pratense 304,842,038 11.0 30.0 41.8
Arabis alpina 308,032,609 15.1 37.7 47.9
Cucumis melo 357,857,370 8.2 39.9 44.0
Citrullus lanatus 365,450,462 6.8 40.8 45.2
Oryza sativa 375,049,285 32.3 37.0 35.0
Setaria viridis 395,731,502 18.5 40.8 46.0
Vitis vinifera 486,265,422 9.0 40.0 41.4
Rosa chinensis 515,588,973 8.4 48.1 67.9
Camelina sativa 641,356,059 15.8 36.0 28.0
Malus domestica 702,961,352 9.1 41.6 59.5
Olea europaea 1,140,989,389 18.0 45.2 43.0
Zea mays 2,135,083,061 59.5 79.0 85.0
Helianthus annuus 3,027,844,945 10.0 73.6 74.7
Aegilops tauschii 4,224,915,394 68.7 81.8 85.9
Triticum turgidum 10,463,058,104 71.7 82.2 82.2

When the repeats themsleves are not of interest for subsequent analyses, they are said to be masked out. This means that they are marked so that they can be avoided. Hard-masking means replacing the sequences of repeats with polyN oligonucleotides. Soft-masking means leaving the repeated sequences in lower-case.

3 DNA-binding proteins and regulatory sequences

The regulation of gene expression is one of the fundamental topics in Genetics. Here you will learn about transcription factors (TFs) and cis-regulatory elements (CREs). TFs are proteins that bind specifically to DNA sequences called CREs and affect the expression of nearby genes.

3.1 Protein-DNA recognition

DNA-binding proteins contain DNA-binding domains and have a specific or general affinity for either single or double stranded DNA. Here we will concentrate mostly on transcription factors, which generally recognize cis-regulatory elements in double-stranded DNA molecules.

3.1.1 Dissecting a protein-DNA interface

Transcription factors recognize target DNA sequences through a binding interface, composed of protein residues and DNA stretches in intimate contact. The best descriptions of protein-DNA interfaces are provided by structural biology, usually by X-ray or NMR experiments.

Details of the interface of Lac repressor and its operator site ( Lewis et al. (1996)) Details of the interface of Lac repressor and its operator site ( Lewis et al. (1996)) Cartoon dissection of the interface of a NarL dimer (PDB entry 1JE8)

3.1.2 Atomic interactions between protein and DNA residues

The process of recognition of DNA sequences by proteins involves readout mechanisms, and also stabilizing atomic interactions that do not confer specificity.

Direct readout

  • Hydrogen bonds: direct + water-mediated
  • Hydrophobic interactions

Indirect readout

  • Sequence-specific deformation of DNA base steps

Stabilizing interactions

  • Not sequence-specific, involving DNA backbone

3.1.3 Direct readout: hydrogen bonds

Donor and acceptor hydrogen bond atoms in nitrogen bases, from Luscombe, Laskowski, and Thornton (2001)
Donor and acceptor hydrogen bond atoms in nitrogen bases, from Luscombe, Laskowski, and Thornton (2001)

3.1.4 Direct readout: Van der Waals interactions

DNA bases display VdW interactions involving mainly C atoms, from Luscombe, Laskowski, and Thornton (2001)
DNA bases display VdW interactions involving mainly C atoms, from Luscombe, Laskowski, and Thornton (2001)
Sequence-specific VdW interactions involve fundamentally thymine-C5M group, adapted from Luscombe, Laskowski, and Thornton (2001)
Sequence-specific VdW interactions involve fundamentally thymine-C5M group, adapted from Luscombe, Laskowski, and Thornton (2001)

3.1.5 Indirect/shape readout

Besides atomic interactions between protein and DNA, sequence-dependent deformability of duplexes, deduced from crystal complexes, implies that sequence recognition also involves DNA shape.

DNA step geometry descriptors, from Lu and Olson (2008)
DNA step geometry descriptors, from Lu and Olson (2008)

DNA deformation is described by the increase in energy brought about by instantaneous fluctuations of the step parameters from their equilibrium values:

\(deformation = \displaystyle\sum_{i=0}^6 \displaystyle\sum_{j=0}^6 spring_{ij} \Delta\theta_{i,st} \Delta\theta_{j,st}\) (Olson et al. (1998))

Roll and twist of DNA steps are different for purine and pyrimidine steps, from Olson et al. (1998)
Roll and twist of DNA steps are different for purine and pyrimidine steps, from Olson et al. (1998)

The accumulation of experimental and molecular dynamics data of DNA molecules currently supports predictive algorithms, such as DNAshape, which predict the geometry of DNA sequences:

Geometry descriptors of a DNA duplex by sliding a sequence window, from Zhou et al. (2013). MGW=Major Groove Width.
Geometry descriptors of a DNA duplex by sliding a sequence window, from Zhou et al. (2013). MGW=Major Groove Width.

3.1.6 Protein-DNA interface graphs

Interfaces can be explored as generic bipartite graphs (Sathyapriya, Vijayabaskar, and Vishveshwara (2008)) or with sub-graphs that focus only on specific sequence recognition:

Interface graph of NarL, where solid bases indicate indirect readout and arrows direct readout interactions. Adapted from 3d-footprint (Contreras-Moreira (2010))
Interface graph of NarL, where solid bases indicate indirect readout and arrows direct readout interactions. Adapted from 3d-footprint (Contreras-Moreira (2010))

3.2 Comparison of DNA-binding proteins

A great variety of DNA-binding proteins have been observed in nature, which can be analyzed and compared in terms of the features introduced above, such as readout, or instead with an evolutionary or topological perspective.

Example HTH transcription factors from the work of Luscombe et al. (2000)
Example HTH transcription factors from the work of Luscombe et al. (2000)

The Structural Classification of Proteins (SCOP) systematically groups protein folds in superfamilies, including DNA-binding proteins. The next table shows superfamilies with more than 20 non-redundant complexes in the Protein Data Bank as of Jun2023, as annotated in the database 3d-footprint:

SCOP superfamily Number of complexes
Winged helix (WH) 115
Homeodomain-like (H) 79
Glucocorticoid-receptor-like (GR) 40
Restriction endonuclease-like (RE) 21
Homing endonuclease (HE) 38
p53-like (P53) 24
Lambda-repressor-like (LR) 37

DNA-binding proteins can also be grouped in terms of the organisms in which they are found. For instance, the Plant-TFClass structural classification currently accepts 56 plant transcription factor types, which include up to 37 DNA-binding domains absent in mammals (Blanc-Mathieu et al. 2023).

Examples of plant transcription factor DNA binding domains in the Plant-TFClass structural classification, from Blanc-Mathieu et al. (2023)
Examples of plant transcription factor DNA binding domains in the Plant-TFClass structural classification, from Blanc-Mathieu et al. (2023)
Number of hydrogen bonds and VdW interactions across some SCOP superfamilies of DNA-binding proteins (right) and total atomic contacts per monomer (left). ZF=C2H2 Zinc Finger, RHH=Ribbon-helix-helix, from Contreras-Moreira, Sancho, and Angarica (2010)
Number of hydrogen bonds and VdW interactions across some SCOP superfamilies of DNA-binding proteins (right) and total atomic contacts per monomer (left). ZF=C2H2 Zinc Finger, RHH=Ribbon-helix-helix, from Contreras-Moreira, Sancho, and Angarica (2010)
Frequency of distorted DNA steps (indirect readout) of target DNA sites of some SCOP superfamilies of DNA-binding proteins, from Contreras-Moreira, Sancho, and Angarica (2010)
Frequency of distorted DNA steps (indirect readout) of target DNA sites of some SCOP superfamilies of DNA-binding proteins, from Contreras-Moreira, Sancho, and Angarica (2010)

3.2.1 Analysis of protein-DNA interfaces in transcription factors families

The available experimental structures of protein-DNA complexes in the PDB support the annotation of interface residues, those involved directly in sequence recognition, within protein families. The database footprintDB does exactly this.

Annotation of interface residues by transference from PDB complexes in the footprintDB database
Annotation of interface residues by transference from PDB complexes in the footprintDB database

Several examples in the literature have demonstrated the correlation between interface patterns and the bound DNA motifs within large transcription factor families, such as the work of Noyes et al. (2008):

Relation between (annotated) interface residues and cognate DNA motifs or homeobox transcription factors, from Noyes et al. (2008).
Relation between (annotated) interface residues and cognate DNA motifs or homeobox transcription factors, from Noyes et al. (2008).

The study of interfaces must be done in the appropriate biological context, for instance considering the oligomerization state of TFs in vivo, as each family of transcription factors has singularities, such as these (compiled by Álvaro Sebastián):

Family Motifs Multimeric Multidomain
Homeodomain TAATkr,TGAyA Sometimes Unusual
Basic helix-loop-helix (bHLH) CACGTG,CAsshG Always (homodimers, heterodimers) Never
Basic leucine zipper (bZIP) CACGTG,-ACGT-,TGAGTC Always (homodimers, heterodimers) Never
MYB GkTwGkTr Common (multimers) Common
High mobility group (HMG) mTT(T)GwT,TTATC,ATTCA Sometimes Unusual
GAGA GAGA Never Never
Fork head TrTTTr Unusual Never
Fungal Zn(2)-Cys(6) binuclear cluster CGG Common (homodimers) Never
Ets GGAw Common (homodimers, heterodimers, multimers) Never
Rel homology domain (RHD) GGnnwTyCC Always (homodimers, heterodimers) Never
Interferon regulatory factor AAnnGAAA Always (homodimers, heterodimers, multimers) Never

3.2.2 Aligning cis elements by superposition of atomic coordinates of interfaces

Here I briefly describe a structural alignment approach for the comparison of DNA-binding proteins and their interfaces, as discussed in Siggers, Silkov, and Honig (2005) and Sebastian and Contreras-Moreira (2013). In this context superpositions might guide the correct alignment of cis elements bound by homologous proteins, as illustrated in the figure.

Sequence and structure alignment of cis-elements bound by two homologous PDB complexes: 3A01_A1 and 1FJL_B1. Note that this optimal sequence alignment does not match equivalent interface positions, adapted from Sebastian and Contreras-Moreira (2013).
Sequence and structure alignment of cis-elements bound by two homologous PDB complexes: 3A01_A1 and 1FJL_B1. Note that this optimal sequence alignment does not match equivalent interface positions, adapted from Sebastian and Contreras-Moreira (2013).

3.2.3 Specificity-based classification of DNA-binding proteins

DNA motifs inferred from protein-complexes can be analyzed in terms of their information content:

Information content of structure-derived DNA motifs of proteins belonging to several SCOP superfamilies as of Jun2023, after Contreras-Moreira (2010)
Information content of structure-derived DNA motifs of proteins belonging to several SCOP superfamilies as of Jun2023, after Contreras-Moreira (2010)

4 Discovering cis elements in genomes by counting words (\(k\)-mers)

Now that we know how transcription factors recognize their target DNA sites, we will how learn how enrichment of DNA words in upstream gene promoter sequences can be computed in order to discover putative regulatory elements. We will follow the method of Helden, Andre, and Collado-Vides (1998).

Let’s have a look at the sample data:

# a set of upstream sequences of a group of co-regulated genes on Brachypodium distachyon
# see that they have been cut with coordinates -1000,+200 with respect to 
# transcription start site (TSS)
head test_data/regulon.raw.fna
echo

# same sequences but repeat-masked
head test_data/regulon.rm.fna
echo

# frequency of hexamers in those raw upstream sequences computed with RSAT  
# oligo-analysis -v 1 -seqtype dna -l 6 -return occ,freq -quick -str 2 -ovlp -sort
head -35 test_data/6nt_upstream_regulon.raw.tab | grep -v ";"
echo

# frequency of hexamers in the repeat-masked sequences 
head -35 test_data/6nt_upstream_regulon.rm.tab | grep -v ";"
echo

# frequency of hexamers in all (rm) upstream sequences of that genome
head -35 test_data/6nt_upstream_genome.tab | grep -v ";"
>BRADI1G02630   Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACAAA
AAAAAACAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA

>BRADI1G02630   Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACNNN
NNNNNNCAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA

#seq    id  obs_freq    occ
aaaaaa  aaaaaa|tttttt   0.0044876667803 1712
cgccgc  cgccgc|gcggcg   0.0024666439487 941
ccgccg  ccgccg|cggcgg   0.0023172298094 884
gccgcc  gccgcc|ggcggc   0.0023067446067 880
aaaaat  aaaaat|attttt   0.0020288867336 774

#seq    id  obs_freq    occ
cgccgc  cgccgc|gcggcg   0.0019319011803 693
gccgcc  gccgcc|ggcggc   0.0018761464565 673
ccgccg  ccgccg|cggcgg   0.0018677832479 670
aagaaa  aagaaa|tttctt   0.0016196747269 581
ctcctc  ctcctc|gaggag   0.0015806464203 567

#seq    id  obs_freq    occ
aaaaaa  aaaaaa|tttttt   0.0048170011903 307261
aaaaac  aaaaac|gttttt   0.0014555523659 92845
aaaaag  aaaaag|cttttt   0.0016166515755 103121
aaaaat  aaaaat|attttt   0.0023315959462 148725
aaaaca  aaaaca|tgtttt   0.0015835569442 101010

Now let’s inspect the R code which will be used in this demo. We’ll start by checking the function that computes the significance of hexamers (words of length \(K=6\)):

#' @author Najla Nksouri <najlaksouri@gmail.com> 
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Compare the statistics between query and background k_mer counts
#' @param query.oligo.file file containing k-mer counts computed from the query sequences, as produced by RSAT oligo-analysis.
#' This file must contain 
#' @param background.oligo.file file containing background k-mer occurrences

kmer.stats <- function(query.oligo.file,
                       background.oligo.file) {
  
  ## Check existence of input files
  if (!file.exists(query.oligo.file)) {
    stop("kmer.stats() error: query.oligo.file does not exist: ", query.oligo.file)
  }
  if (!file.exists(background.oligo.file)) {
    stop("kmer.stats() error: background.oligo.file does not exist: ", ... = background.oligo.file)
  }
  
  ## Read tables and rename first column
  oligo.query.freq.table <- 
    read.table(query.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
  names(oligo.query.freq.table)[1] <- "seq"
  
  oligo.genome.freq.table <- 
    read.table(background.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
  names(oligo.genome.freq.table)[1] <- "seq"
  
  oligo.table <- merge(
    oligo.query.freq.table,
    oligo.genome.freq.table,
    by=c("seq", "id"), suffixes=c("_query", "_background")
  )
  
  ## Compute the p-value of the observed occurrences in query frequencies
  nb.oligos <- nrow(oligo.table)
  N <- sum(oligo.table$occ_query) 
  
  ## Compute th number of occurrences expected in random background fragments
  ## of the same size as the query sequences.
  oligo.table$exp.occ <- oligo.table$obs_freq_background * N
  
  ### Compute the nominal (unadjusted) p-value of each oligonucleotide
  
  ## Test the over-representation: righ-tailed test
  oligo.table$pval.over <- pbinom(q = oligo.table$occ_query -1, size = N, 
                                  prob = oligo.table$obs_freq_background, lower.tail=FALSE)
  
  ## Compute log p-value to avoid the floating point limitation to 1e-320
  oligo.table$log.pval.over <- pbinom(q = oligo.table$occ_query -1, size = N, 
                                      prob = oligo.table$obs_freq_background, lower.tail=FALSE, log=TRUE)
  
  ## Test under-representation: left-tailed test
  oligo.table$pval.under <- pbinom(q = oligo.table$occ_query, size = N, 
                                   prob = oligo.table$obs_freq_background, lower.tail=TRUE)
  
  oligo.table$log.pval.under <- pbinom(q = oligo.table$occ_query, size = N, 
                                       prob = oligo.table$obs_freq_background, lower.tail=TRUE, log=TRUE)
  
  ## Two-sided test
  oligo.table$pval <- pmin(oligo.table$pval.under, oligo.table$pval.over) * 2
  oligo.table$log.pval <- pmin(oligo.table$log.pval.under, oligo.table$log.pval.over) * 2
  
  ## Compute the E-value (Bonferroni multiple testing correction).`
  ##  E-value indicates the expected number of false positives 
  ## (i.e. how many oligos would be expected to be be declared positive under 
  ## the null hypothesis)
  oligo.table$eval <- oligo.table$pval * nb.oligos
  
  ## Compute the significance
  ## sig = -log10(eval)
  oligo.table$sig <- -log10(nb.oligos) - oligo.table$log.pval/log(10)
  
  ## Compute maximal frequency for both files
  max.axis <- max(c(oligo.table$obs_freq_query, oligo.table$obs_freq_background))
  
  ## Log2 ratio observed/expected
  oligo.table$freq.ratio <- oligo.table$obs_freq_query/oligo.table$obs_freq_background
  oligo.table$log2.ratio <- log2(oligo.table$freq.ratio)
  
  return(oligo.table)
}

Now we shall look at the plotting function:

#' @title Scatter plot of background versus query k-mer frequencies. 
#' @author Najla Nksouri <najlaksouri@gmail.com> 
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Draw a scatter plot comparing background and query k-mer frequencies for a given size. 
#' @param kmer.stat.table a table with k-mer statistics produced by kmer.stats()
#' @param k oligonucleotide (k-mer) size
#' @param sig.threshold=2 Threshold on significance for the volcano plot. 
#' @param volcano.ymax=350
#' @return a data.frame with merged query and background k-mer frequencies + derived statistics
#' @examples
#' ## Plot query versus background 4-mer frequencies
#' plot.kmers(k=8)
#' 
plot.kmers <- function(kmer.stat.table,
                       k,
                       sig.threshold = 2,
                       log2FC.threshold = 1, ## Enrichment threshold
                       volcano.ymax = 350) {
  message(k, "-mer frequency plot")

  ## Define dot colors  
  dot.colors <- ifelse(
    (kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio > 0.1), "darkred",
         ifelse((kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio < -0.1),
                "darkblue", "darkgray"))

  ## Draw a scatter plot to compare k-mer frequencies
  max.freq <- max(kmer.stat.table[, c("obs_freq_query","obs_freq_background")])
  plot(x = kmer.stat.table$obs_freq_background,
       y = kmer.stat.table$obs_freq_query,
       
       ## Set axis limits
       xlim = c(0, max.freq),
       ylim = c(0, max.freq),
       
       ## Set labels
       xlab = paste("background ", k,"-mer frequencies", sep = ""),
       ylab = paste("query ", k,"-mer frequencies", sep = ""),
       main = paste("background vs query ", k,"-mer frequencies", sep = ""),
       
       col = dot.colors,
       
       ## Insert grid
       panel.first=grid(),
       
       ## Change symbol
       pch=1
  )
  ## Insert diagonal
  abline(0, 1, col="black", lwd=1, lty = 2)
  
  ## MA plot : log-ratio as a function of the log counts
  ## Note: usually, the MA plot represent the log-ratio versus log(mean) of the two measurements.
  ## However, in our case the query sequences are a subset of the background sequences. 
  ## We will thus plot the log(background freq) on the X axis
  plot(x = log2(kmer.stat.table$obs_freq_background),
       y = kmer.stat.table$log2.ratio,
       
       ## Set labels
       xlab = paste("log2(background)"),
       ylab = paste("log2(query/background)"),
       main = paste(k,"-mer frequencies", sep = ""),
       
       col = dot.colors,
       
       ## Insert grid
       panel.first=grid(),
       
       ## Change symbol
       pch=1
  )
  ## Insert diagonal
  abline(h=0, col="black", lwd=1, lty = 2)
  
  
  ## Draw a volcano plot
  ##
  ## Abcsissa indicates the effect,i.e. the log2-ratio query versus background. 
  ## Ordinate indicates the significance as computed above. 
  ##
  ## Mark in red the over-represented k-mers with sig > given threshold.
  ## and in blue the under-represented k-mers with sig > given threshold.
  
  ## Set axis limit 
  x.lim <- abs(max(kmer.stat.table$log2.ratio))
  ## Plotting
  with(kmer.stat.table,plot(x=log2.ratio, 
                        y=sig, 
                        pch = 1,
                        xlim = c(-x.lim, x.lim),
                        ylim = c (0, volcano.ymax),
                        
                        ## Set labels
                        xlab = paste("Log2.ratio ", k,"-mer frequencies", sep = ""),
                        ylab = paste("Significance ", k,"-mer frequencies", sep = ""),
                        main = paste("Volcano Plot ", k,"-mer frequencies", sep = ""),
                        # Insert grid
                        panel.first=grid(),
                        
                        #set the colors 
                        col = dot.colors))
  
  ## Denote significance exceeding the threshold by 
  above.ymax <- kmer.stat.table$sig > volcano.ymax
  with(kmer.stat.table[above.ymax,],
       points(x=log2.ratio, y=rep(x = volcano.ymax, length.out=sum(above.ymax)), pch=17, col="red",type="p"))
  
  ## insert diagonals
  abline(h=sig.threshold, col= "black", lty = 2)
  abline(v=0, col= "black", lty = 2)
  abline(v=-log2FC.threshold, col= "black", lty = 2)
  abline(v=log2FC.threshold, col= "black", lty = 2)
}

With this data and functions we can now compute \(k\)-mer enrichment in the promoters of a sample set of co-regulated genes or regulon:

Cluster of co-expressed genes
Cluster of co-expressed genes
# adjust as required 
dir_data = "test_data/" # where pre-computed k-mer freq files are 
dir_code = "test_code/" 

source(file.path(dir_code, "kmer_stats.R"))
source(file.path(dir_code, "plot_kmers.R"))

# set word/k-mer size
k = 6 

# set files with pre-computed words/k-mer frequencies
# raw sequences
#oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.raw.tab")
# recommended repeat-masked sequences
oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.rm.tab")
  
oligo.genome.freq.file = file.path(dir_data, "6nt_upstream_genome.tab")
 
# check the code of this function to see how the probabilities of word occurrences 
# are computed with the binomial function
kmer.stat.table <- kmer.stats(query.oligo.file = oligo.upstream.freq.file, 
                               background.oligo.file = oligo.genome.freq.file)
#View(kmer.stat.table)
  
plot.kmers(kmer.stat.table, k, volcano.ymax = 60) 

4.1 Exercises

The plant-dedicated mirror of the Regulatory Sequence Analysis Tools (RSAT, http://plants.rsat.eu) offers specialized options for researchers dealing with plant transcriptional regulation. The web site contains whole-sequenced species regularly updated from Ensembl Plants and other sources, and supports an array of tasks frequently required for the analysis of regulatory sequences, such as retrieving upstream sequences, motif discovery, motif comparison, pattern matching. RSAT::Plants also integrates the footprintDB collection of DNA motifs. This exercise consists on following a step-by-step protocol on how to discover DNA motifs in regulatory regions of clusters of co-expressed genes in plants. It also explains how to empirically control the significance of the result, and how to annotate any discovered motis.

This protocol has been published at (Contreras-Moreira et al. 2016) and updated at (Ksouri et al. 2021). Other related protocols can be found at (Castro-Mondragon et al. 2016; Contreras-Moreira and Sebastian 2016).

4.1.1 Background

Transcriptome data (microarrays, RNA-seq) have been extensively used as a proxy for genetic regulation in many organisms, as the analysis of genome-wide profiles of gene transcription under different treatments uncovers clusters of genes with correlated behaviors, which may result from direct or indirect co-regulation. A classical application of this approach was done by Beer and Tavazoie (2004) with yeast microarray data sets obtained in a variety of experimental conditions. In that experiment, expression data-mining was demonstrated to be an effective strategy for finding regulons, groups of genes that share regulatory mechanisms and functional annotations.

Other studies have unveiled that the outcome of these approaches largely depends on the genomic background of the species under study. For instance, Sand O. and Helden J. (2008) reported that the significance of DNA motifs discovered in Saccharomyces cerevisiae promoters is much higher for regulons than for random gene sets of the same sizes, but for human promoters the signal-to-noise ratio is almost null, because random gene sets give highly significant motifs due to heterogeneities in promoter compositions and biases due to repetitive elements. For metazoans, it is thus a real challenge to distinguish bona fide motifs from noise (Sand O. and Helden J. 2008). These observations suggest that motif discovery on sequence clusters faces intrinsic properties of the genomes under study, regardless of the software used for the task.

Among plants, these strategies have so far been tested on the model Arabidopsis thaliana, and they have been successfully applied to the identification of novel cis-regulatory elements validated with synthetic promoters (Koschmann et al. 2012). Yet, with the exception of this model, these sorts of experiments have not been possible in other plants until recently. In spite of this, the growing list of available plant genomes encourages these analyses in combination with expression profiles obtained from either microarray or RNA-seq data sets, as in the work of Yu et al. (2015), provided that these factors are considered:

  • Plant genomes are rich in repetitive elements (RE) distributed along the genome (Schmidt and Heslop-Harrison 1998), which pose particular problems for motif discovery statistics (violation of the independence assumption).

  • Current genome assemblies range from 120Mb (A. thaliana) to 13,420Mb (Triticum aestivum). Brachypodium distachyon, a model species for grasses, is 272Mb. The quality of these assemblies and their RE content is also quite variable, as shown in Table 1.

  • Upstream regions, defined by annotated gene coordinates, are also of variable length, going from 1,151b on average in A. thaliana to 2,820b in Triticum_turgidum (see Table 1). Work by Ksouri et al. (2021) has shown that the choice of upstream region length impacts on the discovery of regulatory sequences.

Table 1. Features of some plant genomes in RSAT::Plants, taken from https://plants.rsat.eu/data/stats. Each ID concatenates the organism, the assembly version and the source. Most genome IDs add to the end the Ensembl Plants release number. For instance, Arabidopsis_thaliana.TAIR10.55, corresponds to A.thaliana assembly 10 from TAIR https://www.arabidopsis.org, annotated in release 55 of Ensembl Plants. “%Ns” are stretches of uncharacterized nucleotides which often connect assembled sequence contigs. “%repeat-masked” segments are sequences with significant similarity to plant repetitive DNA sequences, which are masked.

genome.stats <- read.delim(file="test_data/summary.tab", row.names = NULL, comment.char=";")
genome.stats[,2] = as.numeric(genome.stats[,2])/1e6 ## Convert genome sizes to Mb
genome.stats[,5] = as.numeric(genome.stats[,5]-genome.stats[,4]) ## raw %mask also includes %N
names(genome.stats) <- c("Organism / assembly ID", "Genome size (Mb)", "Contigs", "%N", "% repeat-masked", "Gene models", "Mean upstream length")
# names(genome.stats)

## Print the table
kable(genome.stats,format.args = list(big.mark=","),
      digits=c(0,0,0,1,1,0,0))
Organism / assembly ID Genome size (Mb) Contigs %N % repeat-masked Gene models Mean upstream length
Aegilops_tauschii.Aetv4.0.55 4,225 109,583 2.3 82.2 43,362 2,533
Amborella_trichopoda.AMTR1.0.55 706 5,745 5.4 54.9 28,353 2,682
Ananas_comosus.F153.55 316 25 1.5 16.2 26,730 2,367
Arabidopsis_halleri.Ahal2.2.55 196 2,239 14.8 24.0 33,223 1,758
Arabidopsis_lyrata.v.1.0.55 207 695 11.1 32.9 33,500 1,799
Arabidopsis_thaliana.Columbia.GCF_000001735.3_TAIR10.NCBIRefseq 120 7 0.2 21.8 38,161 638
Arabidopsis_thaliana.TAIR10.55 120 7 0.2 23.4 34,262 1,154
Beta_vulgaris.RefBeet-1.2.2.55 566 40,245 8.6 18.0 28,031 2,260
Brachypodium_distachyon.Brachypodium_distachyon_v3.0.55 271 10 0.2 4.2 34,310 1,971
Brachypodium_hybridum.463.v1.1.JGI 509 15 1.2 17.8 70,160 2,052
Brachypodium_stacei.316.v1.1.JGI 234 112 1.1 13.9 29,898 2,139
Brachypodium_sylvaticum.490.v1.1.JGI 358 629 0.0 41.6 36,927 2,070
Brassica_juncea.ASM1870372v1.55 934 1,544 4.8 50.1 92,887 2,045
Brassica_napus.AST_PRJEB5043_v1.55 848 20,899 13.0 20.1 105,276 1,862
Brassica_oleracea.BOL.55 489 32,928 8.8 21.7 60,694 2,195
Brassica_rapa_ro18.SCU_BraROA_2.3.55 346 295 0.4 52.4 43,129 2,263
Cannabis_sativa.cs10.GCF_900626175.2.NCBI 876 221 15.9 46.9 29,813 2,528
Capsicum_annuum.ASM51225v2.55 3,064 35,797 3.2 8.0 35,845 2,683
Chenopodium_quinoa.PI614886.392.v1.JGI 1,385 3,486 4.5 63.2 44,776 2,591
Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.55 111 53 3.6 7.1 17,743 536
Chondrus_crispus.ASM35022v2.55 105 926 0.9 68.4 9,857 674
Cicer_arietinum.ASM33114v1.GCF_000331145.1.RefSeq 531 7,127 9.1 42.4 28,772 2,484
Citrus_clementina.Citrus_clementina_v1.0.55 301 1,398 2.1 21.0 27,179 2,338
Coffea_canephora.AUK_PRJEB4211_v1.55 569 13,007 17.1 6.1 25,574 2,360
Corchorus_capsularis.CCACVL1_1.0.55 317 16,522 0.0 46.5 31,549 2,070
Corylus_avellana.CavTom2PMs-1.0.55 370 11 0.5 48.2 27,270 2,445
Cucumis_sativus.ASM407v2.55 194 186 1.8 13.8 24,079 2,218
Cyanidioschyzon_merolae.ASM9120v1.55 17 22 0.0 6.7 5,021 816
Cynara_cardunculus.CcrdV1.55 725 13,588 9.7 8.7 26,505 2,595
Daucus_carota.ASM162521v1.55 422 4,826 8.2 4.6 33,488 2,552
Digitaria_exilis.Fonio_CM05836.55 716 19 2.8 40.1 59,844 2,329
Dioscorea_rotundata.TDr96_F1_Pseudo_Chromosome_v1.0.55 457 21 11.1 17.3 19,565 2,441
Echinochloa_crusgalli.ec_v3.55 1,341 2,058 0.1 53.1 103,850 2,404
Eragrostis_tef.ASM97063v1.55 607 13,883 18.3 3.5 41,920 2,321
Ficus_carica.UNIPI_FiCari_1.0.55 333 511 0.0 47.2 37,832 2,275
Fragaria_vesca.Hawaii4.Fvb2.0.GDR 212 8 7.0 20.6 33,538 1,822
Galdieria_sulphuraria.ASM34128v1.55 14 433 2.1 2.7 6,736 215
Glycine_max.Glycine_max_v2.1.55 978 1,190 2.4 48.9 57,147 2,521
Gossypium_raimondii.Graimondii2_0.55 761 1,033 1.8 57.6 38,719 2,492
Helianthus_annuus.HanXRQr2.0-SUNRISE.55 3,010 332 2.1 81.3 83,177 2,501
Hordeum_vulgare.IBSCv2.55 4,834 10 5.4 52.3 42,858 2,636
Hordeum_vulgare.MorexV3_pseudomolecules_assembly.55 4,226 290 0.0 85.3 35,826 2,779
Ipomoea_triloba.ASM357664v1.55 462 16 5.3 19.2 31,418 2,405
Klebsormidium_nitens.NIES2285.171026.v11.TITECH 104 1,814 1.0 0.0 17,153 865
Lactuca_sativa.Lsat_Salinas_v7.55 2,391 8,323 7.6 68.5 38,693 2,784
Leersia_perrieri.Lperr_V1.4.55 267 12 0.4 32.2 31,516 2,211
Lolium_perenne.MPB_Lper_Kyuss_1697.55 2,277 1,677 0.0 79.0 38,868 2,706
Lupinus_albus.CNRS_Lalb_10.GCA_009771035.1.NCBI 451 89 0.5 44.8 41,359 2,086
Lupinus_albus.La_Amiga31.GCA_010261695.1.NCBI 559 1,580 0.3 42.8 48,718 2,203
Lupinus_angustifolius.LupAngTanjil_v1.0.55 609 13,573 0.7 24.0 35,170 2,629
Manihot_esculenta.Manihotesculentav6.55 582 2,019 14.9 12.2 33,516 2,363
Marchantia_polymorpha.Marchanta_polymorpha_v1.55 226 2,957 6.7 4.2 19,287 2,174
Medicago_truncatula.MedtrA17_4.0.55 413 2,186 5.5 33.9 52,918 2,298
Musa_acuminata.ASM31385v1.55 473 12 17.4 15.8 38,658 1,968
Nicotiana_attenuata.NIATTr2.55 1,492 35,134 11.7 24.4 21,344 2,765
Nicotiana_tabacum.TN90.GCF_000715135.1.NCBI 3,643 168,247 2.9 51.4 74,273 2,432
Olea_europaea.OLEA9.55 1,316 9,751 4.1 54.3 55,595 2,563
Olea_europaea.sylvestris.451.v1.JGI 1,142 41,256 9.7 56.3 50,684 2,695
Oryza_barthii.O.barthii_v1.55 308 12 0.8 43.9 37,849 2,217
Oryza_brachyantha.Oryza_brachyantha.v1.4b.55 261 7,485 6.7 28.4 34,598 2,246
Oryza_glaberrima.Oryza_glaberrima_V1.55 316 1,951 4.1 40.3 74,579 2,028
Oryza_glumipatula.Oryza_glumaepatula_v1.5.55 373 12 19.2 35.1 39,702 2,165
Oryza_indica.ASM465v1.55 427 10,490 3.8 44.2 89,723 2,005
Oryza_longistaminata.O_longistaminata_v1.0.55 326 60,198 9.8 31.0 32,868 2,159
Oryza_meridionalis.Oryza_meridionalis_v1.3.55 336 12 23.8 29.9 32,549 2,204
Oryza_nivara.Oryza_nivara_v1.0.55 338 12 8.3 40.5 38,741 2,179
Oryza_punctata.Oryza_punctata_v1.2.55 394 12 0.4 53.1 37,791 2,218
Oryza_rufipogon.OR_W1943.55 338 12 1.9 46.1 40,584 2,171
Oryza_sativa.IRGSP-1.0.55 375 63 0.0 51.0 38,978 2,204
Ostreococcus_lucimarinus.ASM9206v1.55 13 21 0.0 15.0 7,664 521
Panicum_hallii_fil2.PHallii_v3.1.55 536 291 1.4 5.1 33,805 2,027
Phaseolus_vulgaris.PhaVulg1_0.55 521 708 9.3 20.1 28,472 2,596
Physcomitrium_patens.PhypaV3.55 472 357 1.3 58.6 32,674 2,044
Populus_trichocarpa.Pop_tri_v3.55 434 1,446 2.6 5.5 41,335 2,451
Prunus_avium.PAV_r1.0.55 272 10,148 9.4 16.0 43,349 1,868
Prunus_avium.Satonishiki.PAV_r1.0.GDR 374 9 34.0 18.9 43,673 2,094
Prunus_mume.BJFU1210120008.GCF_000346735.1.NCBIRefseq 234 8,164 7.2 22.5 26,524 2,047
Prunus_persica.Prunus_persica_NCBIv2.38 227 191 1.2 21.3 27,107 1,984
Prunus_persica.Prunus_persica_NCBIv2.55 227 191 1.2 19.2 27,107 1,984
Prunus_persica.Prupe1_0.37 227 202 1.2 16.1 28,898 2,217
Saccharum_spontaneum.Sspon.HiC_chr_asm.55 2,900 32 0.3 36.6 83,815 2,706
Selaginella_moellendorffii.v1.0.55 213 759 1.9 37.1 36,200 1,436
Setaria_italica.Setaria_italica_v2.0.55 406 336 1.2 2.8 35,831 2,300
Solanum_lycopersicum.SL3.0.55 828 3,148 9.8 51.8 35,092 2,300
Solanum_tuberosum.DM.v4.03.PGSC 725 12 12.5 0.0 39,028 2,401
Solanum_tuberosum.SolTub_3.0.55 811 13 15.8 49.3 40,430 2,527
Sorghum_bicolor.Sorghum_bicolor_NCBIv3.55 709 867 4.7 64.7 35,479 2,276
Trifolium_pratense.Trpr.55 305 8,595 11.7 19.0 41,184 1,849
Triticum_aestivum.IWGSC.55 14,550 22 1.9 85.1 120,744 2,605
Triticum_dicoccoides.WEWSeqv.1.0.55 10,080 14 1.5 86.1 67,375 2,656
Triticum_turgidum.svevo.55 10,460 15 1.5 4.9 66,559 2,820
Triticum_urartu.Tu2.0.55 4,852 10,284 0.6 89.7 41,492 2,613
Vigna_angularis.Vigan1.1.55 467 37,373 3.6 13.5 34,819 2,719
Vigna_radiata.Vradiata_ver6.55 463 2,497 7.2 20.2 23,164 2,643
Vigna_unguiculata.ASM411807v1.55 519 682 0.5 51.0 31,814 2,496
Vitis_vinifera.PN40024.v4.55 476 22 0.4 57.2 35,134 2,266
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55 2,182 685 0.2 84.5 44,303 2,502

The step-by-step protocol below has been tested for the task of discovering and annotating DNA motifs in clusters of upstream sequences for species supported by RSAT::Plants, which have been obtained mostly from Ensembl Plants (https://plants.ensembl.org), but also include data from the JGI Genome Portal (http://genome.jgi.doe.gov) and other sources. In addition, RSAT::Plants integrates footprintDB (https://footprintdb.eead.csic.es), a collection of position-specific scoring matrices (PSSM) representing transcription factor binding motifs (TFBM), as well as their cognate binding proteins (Sebastian and Contreras Moreira 2014), which can be used to annotate discovered motifs and to predict potentially binding transcription factors.

Discovering regulatory elements within natural genomic sequences is certainly an important scientific goal on its own, but can also be part of the design and validation of synthetic promoters. We envisage at least two applications in this context:

  • The characterization of promoters of genes with known expression properties, which can then be used to engineer the expression of genes of interest.

  • The validation of engineered promoters in order to make sure that they contain the expected regulatory elements which might be natural or engineered depending on the application.

4.1.2 Materials

This protocol requires disposing of:

  • A computer with a Web browser

  • A set of gene clusters from any of the species currently supported at RSAT::Plants ( http://plants.rsat.eu, see Note 1). Here we will use three example clusters of co-expressed maize genes, shown on Table 2 (see Note 2). First, they discovered potential regulatory motifs within their upstream sequences, and then they performed electrophoretic mobility shift assays (EMSA) to confirm them. Table 2 shows three of those clusters which are used in this protocol. For each cluster a list of gene identifiers is given next to the EMSA-confirmed motifs. The remaining nine clusters were left out for being too small, as the statistical approaches in this protocol require at least 10-15 genes.

4.1.3 Methods

The following protocol enumerates the steps required to discover DNA motifs, based on the over-representation of \(k\)-mers (oligonucleotides) and dyads (spaced pairs of oligonucleotides), in clusters of upstream sequences. The protocol comprises two stages, analyzing first co-expressed genes and then random clusters as a negative control (see Note 3). Only after both stages have been completed it is possible to objectively estimate the relevance of the results.

4.1.3.1 Collecting the full set of promoters for the genome of interest

Before the proper analysis of the gene cluster, we will retrieve the promoter sequences of all the genes of the organism of interest, which will serve below to estimate the background model. Table 2 shows three clusters of maize (Zea mays) genes used along the protocol, extracted from the published work of Yu et al. (2015). The original gene identifiers have been translated to current gene models in the Zm-B73-REFERENCE-NAM-5.0.55 annotation using maizeGDB (Woodhouse et al. 2021).

Experimentally verified regulatory motifs of these clusters are indicated:

Cluster name Confirmed motif #sequences Original gene IDs Current gene IDs
ABI4 GCGCRSGCGGSC 16 GRMZM2G025062
GRMZM2G053503
GRMZM2G069082
GRMZM2G069126
GRMZM2G069146
GRMZM2G076896
GRMZM2G081892
GRMZM2G124011
GRMZM2G129674
GRMZM2G142179
GRMZM2G169654
GRMZM2G172936
GRMZM2G173771
GRMZM2G174347
GRMZM2G175525
GRMZM2G421033
Zm00001eb330490
Zm00001eb340140
Zm00001eb318890
Zm00001eb318900
Zm00001eb318910
Zm00001eb202570
Zm00001eb335190
Zm00001eb335200
Zm00001eb103110
Zm00001eb006160
Zm00001eb119540
Zm00001eb156040
Zm00001eb074730
Zm00001eb193670
Zm00001eb352580
Zm00001eb422470
Zm00001eb432090
E2F TTCCCGCCA 18 AC197146.3_FG001
GRMZM2G017081
GRMZM2G021069
GRMZM2G037700
GRMZM2G057571
GRMZM2G062333
GRMZM2G065205
GRMZM2G066101
GRMZM2G075978
GRMZM2G100639
GRMZM2G112074
GRMZM2G117238
GRMZM2G130351
GRMZM2G139894
GRMZM2G154267
GRMZM2G162445
GRMZM2G327032
GRMZM2G450055
Zm00001eb405460
Zm00001eb227730
Zm00001eb350340
Zm00001eb228900
Zm00001eb284360
Zm00001eb028400
Zm00001eb162700
Zm00001eb257900
Zm00001eb290510
Zm00001eb137000
Zm00001eb248940
Zm00001eb421110
Zm00001eb169600
Zm00001eb169610
Zm00001eb037700
Zm00001eb351990
Zm00001eb342080
Zm00001eb192270
Zm00001eb192280
Zm00001eb243240
Zm00001eb243250
WRI1 CGGCGGCGS 56 AC210013.4_FG019
GRMZM2G008430
GRMZM2G009968
GRMZM2G010435
GRMZM2G010599
GRMZM2G014444
GRMZM2G015097
GRMZM2G017966
GRMZM2G022019
GRMZM2G027232
GRMZM2G028110
GRMZM2G035017
GRMZM2G041238
GRMZM2G045818
GRMZM2G047727
GRMZM2G048703
GRMZM2G064807
GRMZM2G068745
GRMZM2G074300
GRMZM2G076435
GRMZM2G078779
GRMZM2G078985
GRMZM2G080608
GRMZM2G092663
GRMZM2G096165
GRMZM2G098957
GRMZM2G107336
GRMZM2G108348
GRMZM2G111987
GRMZM2G115265
GRMZM2G119865
GRMZM2G122871
GRMZM2G126603
GRMZM2G126928
GRMZM2G132095
GRMZM2G140799
GRMZM2G148744
GRMZM2G150434
GRMZM2G151252
GRMZM2G152599
GRMZM2G170262
GRMZM2G181336
GRMZM2G311914
GRMZM2G312521
GRMZM2G322413
GRMZM2G325606
GRMZM2G343543
GRMZM2G353785
GRMZM2G409407
GRMZM2G439201
GRMZM5G823135
GRMZM5G827266
GRMZM5G831142
GRMZM5G835323
GRMZM5G870606
GRMZM5G882378
Zm00001eb212470
Zm00001eb347390
Zm00001eb351870
Zm00001eb066830
Zm00001eb420570
Zm00001eb238930
Zm00001eb031300
Zm00001eb267190
Zm00001eb222060
Zm00001eb222210
Zm00001eb283120
Zm00001eb063980
Zm00001eb335620
Zm00001eb342930
Zm00001eb095960
Zm00001eb011490
Zm00001eb174060
Zm00001eb405750
Zm00001eb172130
Zm00001eb166570
Zm00001eb316270
Zm00001eb285610
Zm00001eb061760
Zm00001eb212180
Zm00001eb134780
Zm00001eb037330
Zm00001eb138230
Zm00001eb235710
Zm00001eb271520
Zm00001eb277320
Zm00001eb160070
Zm00001eb359780
Zm00001eb235710
Zm00001eb399010
Zm00001eb036530
Zm00001eb354150
Zm00001eb361700
Zm00001eb317180
Zm00001eb173470
Zm00001eb192860
Zm00001eb205850
Zm00001eb192680
Zm00001eb330480
Zm00001eb372580
Zm00001eb276860
Zm00001eb320920
Zm00001eb282960
Zm00001eb329960
Zm00001eb227700
Zm00001eb227710
Zm00001eb352030
Zm00001eb158110
Zm00001eb323110
Zm00001eb225190
Zm00001eb225200
Zm00001eb121240
  1. Open a connection to the RSAT::Plants server. It can be reached at http://plants.rsat.eu. On the left-side menu, select ‘Sequence tools -> retrieve sequence’.

  2. Choose ‘Single organism -> Zea mays’ for the examples of this protocol (see Note 1), which corresponds to ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ in Ensembl Plants release 55 (Yates et al. 2022).

  3. Choose ‘Genes -> all’; this will retrieve all upstream sequences of the maize genome.

  4. Set appropriate upstream bounds. To replicate the work of Yu et al. (2015) these should be set to ‘From’ -1000 ‘To’ +200, with position 0 corresponding to transcriptional start sites (TSS); see Note 4. Beware that TSS positions in plant genomes often correspond to start codons, probably due to incomplete annotations.

  5. We recommend to tick the option ‘Mask repeats’, as plant genomes are frequently repeat-rich. This option should NOT be used if you suspect the transcription factors of interest bind to repeated sequences.

  6. Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note 5). The results include the executed command and a URL to the ‘sequences’ file, which must be saved. We will refer to this URL as ‘all.fasta.URL’. This FASTA-format file can also be stored as a local file on your computer, but note it can be rather large and might fail to upload later on if its size is larger than the server’s limit.

  7. Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note 4).

4.1.3.2 Analyzing upstream sequences of co-expressed genes

We will now retrieve the upstream sequences of a cluster of co-expressed genes, and use peak-motifs to discover exceptional motifs in their promoters. The tool peak-motifs was initially conceived to discover motifs in ChIP-seq peaks, but it can also be used to analyze other sequence types, as illustrated here.

  1. Choose cluster E2F from Table 2, copy the current gene IDs (last column). The IDs should be saved in a new text file, that you will store on your computer. Insert newline characters between genes6.

  2. In the left menu of the RSAT server, click on ‘retrieve sequence’ to get a fresh form. Make sure that the option ‘Genes -> selection’ is activated and that the right organism, in this case ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’, is selected. Tick ‘Mask repeats’, and set the same size limits as for the whole collection of promoters: from -1000 to +200. Paste the list of IDs of your gene cluster (one gene ID per row).

  3. Press ‘GO’ and wait a few seconds until the result page is displayed. Inspection of these sequences might reveal N-masked sequence stretches, which correspond to annotated repeats. Save both ‘query genes’ and ‘sequences’ files to local files on your computer, we will refer to them as ‘cluster.genes’ and ‘cluster.fasta’ later on this protocol.

  4. Press the ‘peak-motifs’ button. The peak sequences section is automatically filled with a link to the selected cluster sequences.

  5. Add a title for this job, such as ‘E2F cluster’.

  6. On the right side of ‘Peak sequences’, under Control sequences, paste the ‘all.fasta.URL’ on the ‘URL of a sequence file available on a Web server’ entry.

  7. Click on ‘Reduce peak sequences’ and leave both fields (‘number of top sequences to retain’ and ‘cut peak sequences’) blank to avoid having the sequences clipped.

  8. Click on ‘Motif discovery parameters’. Select two algorithms: ‘Discover over-represented words’ (oligo-analysis) and ‘Discover over-represented spaced word pairs’ (dyad-analysis). Uncheck the program position-analysis (see Note 7).

  9. Click on ‘Compare discovered motifs with databases’ and select appropriate databases which will be used to annotate any found motifs. For plant promoters, we recommend to check ‘footprintDB-plants’, but you can also check other databases such as ‘JASPAR plants’ (see Note 8).

You can also upload your own collection of DNA motifs in TRANSFAC format.

  1. Click on ‘Reporting Options’. Set ‘Origin’ to ‘end’ and ‘Offset’ to -200 (see Note 9).

  2. Select output delivery and press ‘GO’.

  3. After few seconds the server should have uploaded the sequences and display a page with the URL of the future result page. You can already click on this link: the result page will be periodically updated to show the progress of the analysis. At the end of the processing, a box will appear at the top of the result page, with a short summary of the discovered motifs, and links to different sections of the results.

Once the job is complete click on the link [Download all results (peak-motifs_archive.zip)] to save the results on your computer. You will later be able to uncompress this archive in order to check the result after its removal from the server (results are only available on the server for 7 days after job completion). We also recommend downloading the full set of discovered motifs, by clicking on the link [Download all matrices (transfac format)] and saving a local file named ‘cluster.motifs.tf’. This file contains all motifs in the form of position-weight matrices (PWMs) in TRANSFAC format.

On the result page, the section entitled ‘Discovered motifs (with motif comparison)’ lists the discovered motifs, displays their sequence logos and their distribution along clustered sequences, in addition to top matches with the motif databases selected on step 9.

4.1.3.3 Negative control: random groups of genes

In this section, we propose a procedure to obtain an empirical estimation of the rate of false positives, by discovering motifs in the promoters of genes picked up at random.

  1. On the left-side menu of RSAT::Plants select ‘Build control sets -> random gene selection’.

  2. Choose ‘Organism -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ for the examples of this protocol.

  3. Set ‘Number of genes’ to the size of one of the sample clusters on Table 2. For instance, the size of the negative control sets would be 18 for cluster E2F, 16 for cluster ABI4, and 56 for cluster WRI1. For convenience, in this tutorial only one random group is generated (the default), but this utility can generate several random groups in one go (see Note 10).

  4. Press ‘GO’ and click the ‘Next step’ button ‘retrieve sequences’ at the bottom of the result page. In the retrieve-seq form, set the other parameters as above: from -1000 to +200, check the ‘Mask repeats’ option and press ‘GO’.

  5. Save ‘query genes’ and ‘sequences’ files to local ‘random.genes’ and ‘random.fasta’ files and repeat steps 4-11 of section 4.1.3.2.

4.1.3.4 Validating motifs by scanning promoter sequences

This part of the protocol is devoted to validating sequence motifs discovered by their over-representation, which are scanned against the original sequences from which they were discovered, plus, optionally, orthologous sequences from a related species (see Note 11).

The first goal of this section is to check whether the discovered motifs show patterns of occurrence along promoter sequences, and to see how many cluster sequences actually harbor them. This can be done empirically by comparing the results of expression-based motifs with those of shuffled motifs, with columns permuted, which play the role of negative controls.

A second goal is to investigate whether these regulatory motifs are conserved on orthologous promoters of a related plant, Sorghum bicolor in this case study.

  1. On the left-side menu select ‘Comparative genomics -> get orthologs-compara’.

  2. Choose ‘Reference organism -> Sorghum bicolor’ for the maize example.

  3. Upload file ‘cluster.genes’ generated in step 3 of section 4.1.3.2. Press ‘GO’ and finally press ‘retrieve sequences’ on the next screen.

  4. Repeat steps 4-6 of section 4.1.3.1 but now select Sorghum bicolor as organism. Save ‘sequences’ to local file ‘cluster_orths.fasta’.

  5. On the left-side menu select ‘Build control sets -> permute-matrix’.

  6. Upload ‘cluster.motifs.tf’ (obtained in step 12 of section 4.1.3.2) and press ‘GO’. Save the results file as ‘cluster.motifs.perm1.tf’ (see Note 12).

  7. Select ‘Pattern matching -> matrix scan (full options)’.

  8. In the sequence box paste the contents of ‘cluster.fasta’ and, optionally, ‘cluster_orths.fasta’, if you wish to assess motif conservation. Alternatively, steps 7-12 can be performed separately with maize and S.bicolor sequences.

  9. Upload file ‘cluster.motifs.tf’ and select ‘transfac’ format.

  10. In the ‘Background model’ section select Markov order 2 and choose ‘Organism-specific -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’. Press ‘GO’.

  11. Save the ‘Scan result’ file as ‘cluster.scan.ft’ and press the ‘feature map’ button to draw a map of the matched motif instances.

  12. Repeat steps 6-11 using the set of permuted PWMs ‘cluster.motifs.perm1.tf’ and save the results as ‘cluster.perm1.scan.tf’.

4.1.3.5 Interpretation of results

The last stage of the protocol is the interpretation of results, which is which requires having at hand results of both clusters of co-expressed genes and random clusters, which play the role of negative controls. Figure 1 summarizes the results of cluster ABI4 compared to 50 random clusters of the same size. There are three types of evidence to look at, which will be discussed with the examples on this figure:

Figure 1. Summary of motif discovery results for cluster ABI4. Dark bars correspond to clusters of co-expressed genes, grey bars to 50 random clusters of genes drawn from the maize genome. Maximum significance of oligo-analysis (A) and dyad-analysis (B) motifs. The sequence logo of motifs reported by each algorithm is shown on top, indicating the number of sites used to compute it and the \(\text{Ncor}\) score of the comparison to the expected motif (bottom, see Note 11). Panel C shows the scores of discovered motifs when scanned back to the original maize upstream sequences and sequences from orthologous genes in Sorghum bicolor. Here dark bars are the reported PWMs, while the grey bars correspond to permuted PWMs. Panel D show the \(\text{Ncor}\) scores of discovered motifs when compared to annotated PWMs in footprintDB.

  • The distribution of motif significance yielded by oligo-analysis (A) and dyad-analysis (B). Motifs discovered in random clusters (grey bars) typically have significances below 4. The motifs found in cluster ABI4 (black bars) are not more significant than those of random gene sets of the same sizes. The reason for having significant motifs in the random gene sets may result from the occasional presence of low complexity motifs, which should not be considered as reliable predictions.

  • Panel A, E and I also show the comparisons between some motifs returned by peak-motifs and those reported by the authors of the reference experimental study (Yu et al. 2015); they used MEME as motif discovery tools).

  • The distribution of scanning scores (C) show to which extent motif matches in upstream sequences of both maize and their S. bicolor orthologues (dark boxes) depart from matches of permuted matrices (lighter boxes), used here as negative controls. On these boxplots, the horizontal bars indicate the median score of all the predicted sites in a given set of promoter sequences, and the shaded rectangles show the interquartile range, i.e. the extent between the 25% and 75% percentiles. For the ABI4 cluster there is a noticeable overlap between the interquartile boxes of the cluster and the random gene selections. Besides, the random selections show several “outliers” (circles) indicating sites predicted with high matching scores. Even though the mean scores are clearly higher for the actual cluster, the results may thus not be considered very significant.

  • The distribution of scores in footprintDB (D) describe how similar the discovered motifs are when compared to motifs (PWMs) annotated in footprintDB. Similarities are measured by normalized correlation score (\(\text{Ncor}\), see Note 13). In each example 50 random sets of promoters were analyzed with peak-motifs, and the discovered motifs compared to footprintDB. The black bar indicates the best matching score for the original, expression-based gene clusters, and the corresponding logo is overlaid on the histogram. For ABI4, the best-scoring matches correspond to phytochrome interacting factors. These proteins belong to the bHLH family of transcription factors and there are many annotated motifs for them in databases such as footprintDB.

In summary, motifs discovered in promoters of co-expressed genes should always be evaluated based on a combination of complementary criteria:

  1. The primary key of interpretation is the significance reported by the motif discovery algorithms. This significance has to be interpreted by comparison with the results obtained in random promoter sets of the same size as the gene cluster of interest (negative controls).
  2. Sequence scanning permits to predict putative binding sites, but the matching scores should be evaluated relative to randomized motifs (column-permuted).
  3. Comparison between discovered motifs and databases of known TF-binding motifs suggests candidate transcription factors which could intervene in the co-regulation of the co-expressed cluster.

4.1.3.6 Your report

Your work is to run the protocol with cluster E2F and write a brief report with:

  • a table of the top hexamers and dyads enriched on the E2F cluster of maize usptream sequences and at least a random cluster of the same size. The table shall include the following columns: exp_freq=expected relative frequency, occ=observed occurrences, exp_occ=expected occurrences, occ_P=occurrence probability (binomial), occ_E=E-value for occurrences, occ_sig=occurrence significance.

  • a table with similar DNA motifs in footprintDB

  • an overall interpretation of results

References and notes

Amselem, J., G. Cornut, N. Choisne, M. Alaux, F. Alfama-Depauw, V. Jamilloux, F. Maumus, et al. 2019. RepetDB: a unified resource for transposable element references.” Mob DNA 10: 6. https://doi.org/10.1186/s13100-019-0150-y.
Beer, Michael A, and Saeed Tavazoie. 2004. Predicting gene expression from sequence.” Cell 117 (2): 185–98. https://doi.org/10.1016/S0092-8674(04)00304-6.
Blanc-Mathieu, R., R. Dumas, L. Turchi, J. Lucas, and F. Parcy. 2023. Plant-TFClass: a structural classification for plant transcription factors.” Trends Plant Sci, July. https://doi.org/10.1016/j.tplants.2023.06.023.
Castro-Mondragon, J. A., C. Rioualen, B. Contreras-Moreira, and J. van Helden. 2016. RSAT::Plants: Motif Discovery in ChIP-Seq Peaks of Plant Genomes.” Methods Mol Biol 1482: 297–322. https://digital.csic.es/handle/10261/136254.
Contreras-Moreira, B. 2010. “3D-Footprint: A Database for the Structural Analysis of Protein-DNA Complexes.” Nucleic Acids Research 38: D91–97. http://nar.oxfordjournals.org/content/38/suppl\_1/D91.
Contreras-Moreira, B., J. A. Castro-Mondragon, C. Rioualen, C. P. Cantalapiedra, and J. van Helden. 2016. RSAT::Plants: Motif Discovery Within Clusters of Upstream Sequences in Plant Genomes.” Methods Mol Biol 1482: 279–95. https://digital.csic.es/handle/10261/136253.
Contreras-Moreira, B., C. V. Filippi, G. Naamati, C. n, J. E. Allen, and P. Flicek. 2021. K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes.” Plant Genome 14 (3): e20143. https://doi.org/10.1002/tpg2.20143.
Contreras-Moreira, B., J. Sancho, and V. E. Angarica. 2010. Comparison of DNA binding across protein superfamilies.” Proteins 78 (1): 52–62. http://www.ncbi.nlm.nih.gov/pubmed/19731374.
Contreras-Moreira, B., and A. Sebastian. 2016. FootprintDB: Analysis of Plant Cis-Regulatory Elements, Transcription Factors, and Binding Interfaces.” Methods Mol Biol 1482: 259–77. https://digital.csic.es/handle/10261/136262.
Ensembl Compara. 2020. “Ensembl Help & Documentation.” http://plants.ensembl.org/info/genome/compara/homology_method.html.
Girgis, H. Z. 2015. Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale.” BMC Bioinformatics 16 (July): 227. https://pubmed.ncbi.nlm.nih.gov/26206263.
Helden, J. van, B. Andre, and J. Collado-Vides. 1998. Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.” J. Mol. Biol. 281 (5): 827–42. https://www.ncbi.nlm.nih.gov/pubmed/9719638.
Koschmann, J., F. Machens, M. Becker, J. Niemeyer, J. Schulze, L. Bulow, D. J. Stahl, and R. Hehl. 2012. Integration of bioinformatics and synthetic promoters leads to the discovery of novel elicitor-responsive cis-regulatory sequences in Arabidopsis.” Plant Physiol. 160 (1): 178–91. https://doi.org/10.1104/pp.112.198259.
Ksouri, N., J. A. n, F. Montardit-Tarda, J. van Helden, B. Contreras-Moreira, and Y. Gogorcena. 2021. Tuning promoter boundaries improves regulatory motif discovery in nonmodel plants: the peach example.” Plant Physiol 185 (3): 1242–58. https://doi.org/10.1093/plphys/kiaa091.
Lewis, M., G. Chang, N. C. Horton, M. A. Kercher, H. C. Pace, M. A. Schumacher, R. G. Brennan, and P. Lu. 1996. Crystal structure of the lactose operon repressor and its complexes with DNA and inducer.” Science 271 (5253): 1247–54. http://www.ncbi.nlm.nih.gov/pubmed/8638105.
Lu, X. J., and W. K. Olson. 2008. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures.” Nat Protoc 3: 1213–27. http://www.nature.com/nprot/journal/v3/n7/full/nprot.2008.104.html.
Luscombe, N. M., S. E. Austin, H. M. Berman, and J. M. Thornton. 2000. An overview of the structures of protein-DNA complexes.” Genome Biol. 1 (1): REVIEWS001. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138832/.
Luscombe, N. M., R. A. Laskowski, and J. M. Thornton. 2001. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level.” Nucleic Acids Res. 29 (13): 2860–74. http://nar.oxfordjournals.org/content/29/13/2860.long.
Mäkinen, Veli, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press. http://www.genome-scale.info.
Noyes, M. B., R. G. Christensen, A. Wakabayashi, G. D. Stormo, M. H. Brodsky, and S. A. Wolfe. 2008. Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites.” Cell 133 (7): 1277–89. https://www.ncbi.nlm.nih.gov/pubmed/18585360.
Olson, W. K., A. A. Gorin, X. J. Lu, L. M. Hock, and V. B. Zhurkin. 1998. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes.” Proc. Natl. Acad. Sci. U.S.A. 95 (19): 11163–68. http://www.pnas.org/content/95/19/11163.long.
Sand O., Turatsinze J. V., and van Helden J. 2008. “Evaluating the Prediction of Cis-Acting Regulatory Elements in Genome Sequences.” In Modern Genome Annotation, edited by D. Frishman and A. Valencia, 55–89. Springer. https://doi.org/10.1007/978-3-211-75123-7_4.
Sathyapriya, R., M. S. Vijayabaskar, and S. Vishveshwara. 2008. Insights into protein-DNA interactions through structure network analysis.” PLoS Comput. Biol. 4 (9): e1000170. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000170.
Schmidt, T., and J. S. Heslop-Harrison. 1998. “Genomes, Genes and Junk: The Large-Scale Organization of Plant Chromosomes.” Trends in Plant Science 3 (5): 195–99. https://doi.org/http://dx.doi.org/10.1016/S1360-1385(98)01223-0.
Sebastian, A., and B. Contreras Moreira. 2014. footprintDB: a database of transcription factors with annotated cis elements and binding interfaces.” Bioinformatics 30 (2): 258–65. https://doi.org/10.1093/bioinformatics/btt663.
Sebastian, A., and B. Contreras-Moreira. 2013. The twilight zone of cis element alignments.” Nucleic Acids Res. 41 (3): 1438–49. http://nar.oxfordjournals.org/content/41/3/1438.long.
Siggers, T. W., A. Silkov, and B. Honig. 2005. Structural alignment of protein–DNA interfaces: insights into the determinants of binding specificity.” J. Mol. Biol. 345 (5): 1027–45. http://www.ncbi.nlm.nih.gov/pubmed/15644202.
Wicker, T., F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, et al. 2007. A unified classification system for eukaryotic transposable elements.” Nat Rev Genet 8 (12): 973–82. https://pubmed.ncbi.nlm.nih.gov/17984973.
Woodhouse, M. R., E. K. Cannon, J. L. Portwood, L. C. Harper, J. M. Gardiner, M. L. Schaeffer, and C. M. Andorf. 2021. A pan-genomic approach to genome databases using maize as a model system.” BMC Plant Biol 21 (1): 385. https://doi.org/10.1186/s12870-021-03173-5.
Yates, A. D., J. Allen, R. M. Amode, A. G. Azov, M. Barba, A. Becerra, J. Bhai, et al. 2022. Ensembl Genomes 2022: an expanding genome resource for non-vertebrates.” Nucleic Acids Res 50 (D1): D996–1003. https://doi.org/10.1093/nar/gkab1007.
Yu, C. P., S. C. Chen, Y. M. Chang, W. Y. Liu, H. H. Lin, J. J. Lin, H. J. Chen, et al. 2015. Transcriptome dynamics of developing maize leaves and genomewide prediction of cis elements and their cognate transcription factors.” Proc. Natl. Acad. Sci. U.S.A. 112 (19): E2477–2486. https://doi.org/10.1073/pnas.1500605112.
Zhou, T., L. Yang, Y. Lu, I. Dror, A. C. Dantas Machado, T. Ghane, R. Di Felice, and R. Rohs. 2013. DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale.” Nucleic Acids Res. 41 (Web Server issue): 56–62. http://www.ncbi.nlm.nih.gov/pubmed/23703209.

  1. As gene models can change from one assembly to another it is important to use the right assembly version, which is indicated for each genome in Table 1↩︎

  2. Twelve clusters of maize genes, found to be co-expressed in 22 transcriptomes and enriched on Gene Ontoloy terms were analyzed in detail by Yu et al. (2015)↩︎

  3. A crucial parameter to evaluate the results of motif discovery is to estimate the rate of false positives (FP). RSAT programs compute a significance score, which is the minus log of the expected number of false positives (\(\text{e-value} = 10^{-\text{sig}}\)). For example, a motif associated with a significance of 1 should be considered as poorly significant, since on average we would expect \(10^{-1} = 0.1\) false positives, i.e. one FP every 10 random trials. In contrast, a significance of e.g. 16 is very promising, since on average such a result would be expected every \(10^{-16}\) random trials. However, the theoretical significance relies on the correctness of the background model (computed here as k-mer and dyad frequencies in the whole set of promoters). In some cases, sets of plant promoters can discard from the theoretical model, due to heterogeneity of the input (e.g. inclusion of repetitive sequences). The negative control consists in measuring the significance obtained by submitting a random selection of promoters from the organism of interest (maize in the example). Although each of these genes is likely to be regulated by one or more transcription factors (and its promoter should contain corresponding binding sites), in principle the random set as a whole should not be co-regulated, so that the elements would differ from gene to gene, and there should thus be no over-represented motif in their promoters.↩︎

  4. According to Ksouri et al. (2021) the range -500,+200 performs better↩︎

  5. Should the connection to the server interrupt it might be safer to go back and choose ‘email’ as delivery option. The mail message provides a link to the data, which is actually stored at the server.↩︎

  6. It is crucial to have one gene ID per row for submitting queries to retrieve-seq, because only the first word of each row is considered as a query. Note that any transcript isoform numbers .1, .2 should be removed↩︎

  7. This program is generally relevant when analyzing sets containing a large number of sequences such as ChIP-seq peaks or genome-wide promoter sets.↩︎

  8. Plant transcription databases are unfortunately still very fragmentary, so one might be tempted to check more complete collections such as footprintDB or JASPAR. However, the results should be interpreted with caution, because there is little conservation of cis-regulation between plants and other kingdoms of the tree of life. ↩︎

  9. The option ‘Origin’ indicates the reference position relative to each sequence (start, center or end). When this option is set to ‘end’, the coordinates are computed relative to the end of the sequence, with negative values indicating upstream location. The option ‘Offset’ enables to shift the reference point by a given number. For the current example, setting the offset to -200 will give coordinates from -1000 to +200, the 0 corresponding to the TSS.-↩︎

  10. Clearly, more than one random cluster should be evaluated, as suggested in Figure 1, where the results of up to 50 random groups are displayed next to the clusters of Yu et al. (2015).↩︎

  11. Orthologues reported are annotated in Ensembl Compara, generated by a pipeline where maximum likelihood phylogenetic gene trees play a central role. These gene trees, reconciled with their species tree, have their internal nodes annotated to distinguish duplication or speciation events, and thus support the annotation of orthologous and paralogous genes, which can be part of complex one-to-many and many-to-many relations Ensembl Compara (2020)↩︎

  12. This will permute the columns of the input PWMs producing matrices with different consensus. Column-permuted matrices are used as negative controls because they conserve the information content and nucleotide frequencies of the original motifs, but at the same time alter the sequence of nucleotides captured by the original motif, which is not recognized anymore.↩︎

  13. ‘Ncor’ is the relative width-normalized Pearson correlation of two PWMs aligned with matrix-scan. This normalized score prevents spurious matches that would cover only a subset of the aligned matrices (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).↩︎