1 Introduction

In this session we will continue annotating sequences, but now with a focus on non-coding sequences. If we revisit the next figure, I hope you’ll see that non-coding sequences are usually the largest fraction of genomes (Mäkinen et al. 2015):

For instance, the genome of the model grass Brachypodium distachyon is about 270Mbp long, but its genes take only about 122Mbp. The difference increases as the genomes grow larger, as you can see in the next table:

annot.stats <- read.csv(file="test_data/Ensembl_stats.tsv", sep="\t", comment.char=";")
kable(annot.stats,format.args = list(big.mark=","))

genome	genome.size	gene.space	percentage
Arabidopsis thaliana	119,667,750	77,640,474	64.9
Arabidopsis halleri	196,243,198	67,469,378	34.4
Prunus dulcis	227,498,357	94,209,302	41.4
Brachypodium distachyon	271,163,419	122,468,185	45.2
Brassica rapa	283,822,783	83,199,064	29.3
Trifolium pratense	304,842,038	133,448,980	43.8
Arabis alpina	308,032,609	45,780,039	14.9
Cucumis melo	357,857,370	99,736,823	27.9
Citrullus lanatus	365,450,462	81,219,268	22.2
Oryza sativa	375,049,285	130,066,257	34.7
Setaria viridis	395,731,502	123,250,596	31.1
Vitis vinifera	486,265,422	153,535,016	31.6
Rosa chinensis	515,588,973	117,296,819	22.8
Camelina sativa	641,356,059	214,928,593	33.5
Malus domestica	702,961,352	145,783,855	20.7
Olea europaea	1,140,987,834	153,461,985	13.4
Zea mays	2,135,083,061	168,014,276	7.9
Helianthus annuus	3,027,844,945	199,817,746	6.6
Aegilops tauschii	4,224,915,394	348,412,472	8.2
Triticum turgidum	10,463,058,104	475,613,274	4.5

Therefore, when dealing with genome variation, polymorphisms have a greater chance to occur in non-coding regions. Among these, probably the most interesting are regulatory sequences and repeated elements.

The goal of this session is to learn how regulatory sequences can be discovered in promoter sequences using statistical tests and aligned to build DNA motifs.

2 Genomic repeated sequences

The annotation of Transposable Elements (TEs) within plant genomes can help in the interpretation of observed phenotypes, as sometimes TEs affect the expression of neighbor genes, and in computational tasks such as promoter whole genome alignment, promoter or pan-genome analyses.

Usually TEs are annotated by alignment to curated libraries of repeated elements such as RepetDB (Amselem et al. 2019), where each sequence or element is classified according to the Wicker classification (Wicker et al. 2007). Class I elements are “copy and paste”, while Class II are “cut and paste”. The next figure summarizes this taxonomy of TEs, which resembles that of protein domains:

More recently, there are other approaches which do not require any curation; instead, these simply identify repeated elements by counting words (\(K\)-mers) along the genome. An example of such tools is the Repeat Detector (Girgis 2015), which has been used to routinely mask plant genomes (Contreras-Moreira et al. 2021).

The next table summarizes the fraction of repeated sequences in diverse plant genomes, as annotated with REdat, Red and the original papers describing those genomes:

annot.stats <- read.csv(file="test_data/Ensembl_repeats.tsv", sep="\t", comment.char=";")
annot.stats = annot.stats[,c(1,2,6,8,10)]
kable(annot.stats,format.args = list(big.mark=","))

genome	size	perc_REdat	perc_Red	perc_literature
Arabidopsis thaliana	119,667,750	14.2	36.7	19.0
Arabidopsis halleri	196,243,198	15.5	31.1	32.7
Prunus dulcis	227,498,357	6.5	33.4	37.6
Brachypodium distachyon	271,163,419	27.4	31.1	21.4
Brassica rapa	283,822,783	8.6	32.8	32.3
Trifolium pratense	304,842,038	11.0	30.0	41.8
Arabis alpina	308,032,609	15.1	37.7	47.9
Cucumis melo	357,857,370	8.2	39.9	44.0
Citrullus lanatus	365,450,462	6.8	40.8	45.2
Oryza sativa	375,049,285	32.3	37.0	35.0
Setaria viridis	395,731,502	18.5	40.8	46.0
Vitis vinifera	486,265,422	9.0	40.0	41.4
Rosa chinensis	515,588,973	8.4	48.1	67.9
Camelina sativa	641,356,059	15.8	36.0	28.0
Malus domestica	702,961,352	9.1	41.6	59.5
Olea europaea	1,140,989,389	18.0	45.2	43.0
Zea mays	2,135,083,061	59.5	79.0	85.0
Helianthus annuus	3,027,844,945	10.0	73.6	74.7
Aegilops tauschii	4,224,915,394	68.7	81.8	85.9
Triticum turgidum	10,463,058,104	71.7	82.2	82.2

When the repeats themselves are not of interest for subsequent analyses, they are said to be masked out. This means that they are marked so that they can be avoided. Hard-masking means replacing the sequences of repeats with polyN oligonucleotides. Soft-masking means leaving the repeated sequences in lower-case:

ACTAGACTACGNNNNNNNNNNNATATATCA  # hard-masked
ACTAGACTACGtttttttttttATATATCA  # soft-masked

3 DNA-binding proteins and regulatory sequences

The regulation of gene expression is one of the fundamental topics in Genetics. Here you will learn about transcription factors (TFs) and cis-regulatory elements (CREs). TFs are proteins that bind specifically to DNA sequences called CREs and affect the expression of nearby genes.

3.1 Protein-DNA recognition

DNA-binding proteins contain DNA-binding domains and have a specific or general affinity for either single or double stranded DNA. Here we will concentrate mostly on transcription factors, which generally recognize cis-regulatory elements in double-stranded DNA molecules.

3.1.1 Dissecting a protein-DNA interface

Transcription factors recognize target DNA sequences through a binding interface, composed of protein residues and DNA stretches in intimate contact. The best descriptions of protein-DNA interfaces are provided by structural biology, usually by X-ray or NMR experiments.

Details of the interface of Lac repressor and its operator site ( Lewis et al. (1996)) Cartoon dissection of the interface of a NarL dimer (PDB entry 1JE8)

3.1.2 Atomic interactions between protein and DNA residues

The process of recognition of DNA sequences by proteins involves readout mechanisms, and also stabilizing atomic interactions that do not confer specificity.

Direct readout

Hydrogen bonds: direct + water-mediated
Hydrophobic interactions

Indirect readout

Sequence-specific deformation of DNA base steps

Stabilizing interactions

Not sequence-specific, involving DNA backbone

3.1.3 Direct readout: hydrogen bonds

Donor and acceptor hydrogen bond atoms in nitrogen bases, from Luscombe, Laskowski, and Thornton (2001)

3.1.4 Direct readout: Van der Waals interactions

DNA bases display VdW interactions involving mainly C atoms, from Luscombe, Laskowski, and Thornton (2001)

Sequence-specific VdW interactions involve fundamentally thymine-C5M group, adapted from Luscombe, Laskowski, and Thornton (2001)

3.1.5 Indirect/shape readout

Besides atomic interactions between protein and DNA, sequence-dependent deformability of duplexes, deduced from crystal complexes, implies that sequence recognition also involves DNA shape.

DNA step geometry descriptors, from Lu and Olson (2008)

DNA deformation is described by the increase in energy brought about by instantaneous fluctuations of the step parameters from their equilibrium values:

\(deformation = \displaystyle\sum_{i=0}^6 \displaystyle\sum_{j=0}^6 spring_{ij} \Delta\theta_{i,st} \Delta\theta_{j,st}\) (Olson et al. (1998))

Roll and twist of DNA steps are different for purine and pyrimidine steps, from Olson et al. (1998)

The accumulation of experimental and molecular dynamics data of DNA molecules currently supports predictive algorithms, such as DNAshape, which predict the geometry of DNA sequences:

Geometry descriptors of a DNA duplex by sliding a sequence window, from Zhou et al. (2013). MGW=Major Groove Width.

3.1.6 Protein-DNA interface graphs

Interfaces can be explored as generic bipartite graphs (Sathyapriya, Vijayabaskar, and Vishveshwara (2008)) or with sub-graphs that focus only on specific sequence recognition:

Interface graph of NarL, where solid bases indicate indirect readout and arrows direct readout interactions. Adapted from 3d-footprint (Contreras-Moreira (2010))

3.2 Comparison of DNA-binding proteins

A great variety of DNA-binding proteins have been observed in nature, which can be analyzed and compared in terms of the features introduced above, such as readout, or instead with an evolutionary or topological perspective.

Example HTH transcription factors from the work of Luscombe et al. (2000)

The Structural Classification of Proteins (SCOP), or more recently the Encyclopedia of Domains (TED) (Lau et al. 2024), systematically group protein folds in superfamilies, including DNA-binding proteins. The next table shows superfamilies with more than 20 non-redundant complexes in the Protein Data Bank as of Jun2023, as annotated in the database 3d-footprint:

SCOP superfamily	Number of complexes
Winged helix (WH)	115
Homeodomain-like (H)	79
Glucocorticoid-receptor-like (GR)	40
Restriction endonuclease-like (RE)	21
Homing endonuclease (HE)	38
p53-like (P53)	24
Lambda-repressor-like (LR)	37

DNA-binding proteins can also be grouped in terms of the organisms in which they are found. For instance, the Plant-TFClass structural classification currently accepts 56 plant transcription factor types, which include up to 37 DNA-binding domains absent in mammals (Blanc-Mathieu et al. 2023).

Examples of plant transcription factor DNA binding domains in the Plant-TFClass structural classification, from Blanc-Mathieu et al. (2023)

Number of hydrogen bonds and VdW interactions across some SCOP superfamilies of DNA-binding proteins (right) and total atomic contacts per monomer (left). ZF=C2H2 Zinc Finger, RHH=Ribbon-helix-helix, from Contreras-Moreira, Sancho, and Angarica (2010)

Frequency of distorted DNA steps (indirect readout) of target DNA sites of some SCOP superfamilies of DNA-binding proteins, from Contreras-Moreira, Sancho, and Angarica (2010)

3.2.1 Analysis of protein-DNA interfaces in transcription factors families

The available experimental structures of protein-DNA complexes in the PDB support the annotation of interface residues, those involved directly in sequence recognition, within protein families. The database footprintDB does exactly this.

Annotation of interface residues by transference from PDB complexes in the footprintDB database

Several examples in the literature have demonstrated the correlation between interface patterns and the bound DNA motifs within large transcription factor families, such as the work of Noyes et al. (2008):

Relation between (annotated) interface residues and cognate DNA motifs or homeobox transcription factors, from Noyes et al. (2008).

The study of interfaces must be done in the appropriate biological context, for instance considering the oligomerization state of TFs in vivo, as each family of transcription factors has singularities, such as these (compiled by Álvaro Sebastián):

Family	Motifs	Multimeric	Multidomain
Homeodomain	TAATkr,TGAyA	Sometimes	Unusual
Basic helix-loop-helix (bHLH)	CACGTG,CAsshG	Always (homodimers, heterodimers)	Never
Basic leucine zipper (bZIP)	CACGTG,-ACGT-,TGAGTC	Always (homodimers, heterodimers)	Never
MYB	GkTwGkTr	Common (multimers)	Common
High mobility group (HMG)	mTT(T)GwT,TTATC,ATTCA	Sometimes	Unusual
GAGA	GAGA	Never	Never
Fork head	TrTTTr	Unusual	Never
Fungal Zn(2)-Cys(6) binuclear cluster	CGG	Common (homodimers)	Never
Ets	GGAw	Common (homodimers, heterodimers, multimers)	Never
Rel homology domain (RHD)	GGnnwTyCC	Always (homodimers, heterodimers)	Never
Interferon regulatory factor	AAnnGAAA	Always (homodimers, heterodimers, multimers)	Never

3.2.2 Aligning cis elements by superposition of atomic coordinates of interfaces

Here I briefly describe a structural alignment approach for the comparison of DNA-binding proteins and their interfaces, as discussed in Siggers, Silkov, and Honig (2005) and Sebastian and Contreras-Moreira (2013). In this context superpositions might guide the correct alignment of cis-elements bound by homologous proteins, as illustrated in the figure.

Sequence and structure alignment of cis-elements bound by two homologous PDB complexes: 3A01_A1 and 1FJL_B1. Note that this optimal sequence alignment does not match equivalent interface positions, adapted from Sebastian and Contreras-Moreira (2013).

3.2.3 Specificity-based classification of DNA-binding proteins

DNA motifs inferred from protein-complexes can be analyzed in terms of their information content:

Information content of structure-derived DNA motifs of proteins belonging to several SCOP superfamilies as of Dec2024, after Contreras-Moreira (2010)

3.2.4 Modelling DNA-binding proteins

The release of AlphaFold3 means that now it is possible to model a great variety of DNA-binding proteins (Abramson et al. 2024). This allows you to explore how these complexes look like, which is often useful to design further experiments, and also to infer the specificity of transcription factors, as explained in our blog.

4 Discovering cis elements in genomes by counting words (\(k\)-mers)

Now that we know how transcription factors recognize their target DNA sites, we will how learn how enrichment of DNA words in upstream gene promoter sequences can be computed in order to discover putative regulatory elements. We will follow the method of Helden, Andre, and Collado-Vides (1998).

Let’s have a look at the sample data:

# a set of upstream sequences of a group of co-regulated genes on Brachypodium distachyon
# see that they have been cut with coordinates -1000,+200 with respect to 
# transcription start site (TSS)
head test_data/regulon.raw.fna
echo

# same sequences but repeat-masked
head test_data/regulon.rm.fna
echo

# frequency of hexamers in those raw upstream sequences computed with RSAT  
# oligo-analysis -v 1 -seqtype dna -l 6 -return occ,freq -quick -str 2 -ovlp -sort
head -35 test_data/6nt_upstream_regulon.raw.tab | grep -v ";"
echo

# frequency of hexamers in the repeat-masked sequences 
head -35 test_data/6nt_upstream_regulon.rm.tab | grep -v ";"
echo

# frequency of hexamers in all (rm) upstream sequences of that genome
head -35 test_data/6nt_upstream_genome.tab | grep -v ";"

>BRADI1G02630   Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACAAA
AAAAAACAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA

>BRADI1G02630   Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACNNN
NNNNNNCAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA

#seq    id  obs_freq    occ
aaaaaa  aaaaaa|tttttt   0.0044876667803 1712
cgccgc  cgccgc|gcggcg   0.0024666439487 941
ccgccg  ccgccg|cggcgg   0.0023172298094 884
gccgcc  gccgcc|ggcggc   0.0023067446067 880
aaaaat  aaaaat|attttt   0.0020288867336 774

#seq    id  obs_freq    occ
cgccgc  cgccgc|gcggcg   0.0019319011803 693
gccgcc  gccgcc|ggcggc   0.0018761464565 673
ccgccg  ccgccg|cggcgg   0.0018677832479 670
aagaaa  aagaaa|tttctt   0.0016196747269 581
ctcctc  ctcctc|gaggag   0.0015806464203 567

#seq    id  obs_freq    occ
aaaaaa  aaaaaa|tttttt   0.0048170011903 307261
aaaaac  aaaaac|gttttt   0.0014555523659 92845
aaaaag  aaaaag|cttttt   0.0016166515755 103121
aaaaat  aaaaat|attttt   0.0023315959462 148725
aaaaca  aaaaca|tgtttt   0.0015835569442 101010

Now let’s inspect the R code which will be used in this demo. We’ll start by checking the function that computes the significance of hexamers (words of length \(K=6\)):

#' @author Najla Nksouri <najlaksouri@gmail.com> 
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Compare the statistics between query and background k_mer counts
#' @param query.oligo.file file containing k-mer counts computed from the query sequences, as produced by RSAT oligo-analysis.
#' This file must contain 
#' @param background.oligo.file file containing background k-mer occurrences

kmer.stats <- function(query.oligo.file,
                       background.oligo.file) {
  
  ## Check existence of input files
  if (!file.exists(query.oligo.file)) {
    stop("kmer.stats() error: query.oligo.file does not exist: ", query.oligo.file)
  }
  if (!file.exists(background.oligo.file)) {
    stop("kmer.stats() error: background.oligo.file does not exist: ", ... = background.oligo.file)
  }
  
  ## Read tables and rename first column
  oligo.query.freq.table <- 
    read.table(query.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
  names(oligo.query.freq.table)[1] <- "seq"
  
  oligo.genome.freq.table <- 
    read.table(background.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
  names(oligo.genome.freq.table)[1] <- "seq"
  
  oligo.table <- merge(
    oligo.query.freq.table,
    oligo.genome.freq.table,
    by=c("seq", "id"), suffixes=c("_query", "_background")
  )
  
  ## Compute the p-value of the observed occurrences in query frequencies
  nb.oligos <- nrow(oligo.table)
  N <- sum(oligo.table$occ_query) 
  
  ## Compute th number of occurrences expected in random background fragments
  ## of the same size as the query sequences.
  oligo.table$exp.occ <- oligo.table$obs_freq_background * N
  
  ### Compute the nominal (unadjusted) p-value of each oligonucleotide
  
  ## Test the over-representation: righ-tailed test
  oligo.table$pval.over <- pbinom(q = oligo.table$occ_query -1, size = N, 
                                  prob = oligo.table$obs_freq_background, lower.tail=FALSE)
  
  ## Compute log p-value to avoid the floating point limitation to 1e-320
  oligo.table$log.pval.over <- pbinom(q = oligo.table$occ_query -1, size = N, 
                                      prob = oligo.table$obs_freq_background, lower.tail=FALSE, log=TRUE)
  
  ## Test under-representation: left-tailed test
  oligo.table$pval.under <- pbinom(q = oligo.table$occ_query, size = N, 
                                   prob = oligo.table$obs_freq_background, lower.tail=TRUE)
  
  oligo.table$log.pval.under <- pbinom(q = oligo.table$occ_query, size = N, 
                                       prob = oligo.table$obs_freq_background, lower.tail=TRUE, log=TRUE)
  
  ## Two-sided test
  oligo.table$pval <- pmin(oligo.table$pval.under, oligo.table$pval.over) * 2
  oligo.table$log.pval <- pmin(oligo.table$log.pval.under, oligo.table$log.pval.over) * 2
  
  ## Compute the E-value (Bonferroni multiple testing correction).`
  ##  E-value indicates the expected number of false positives 
  ## (i.e. how many oligos would be expected to be be declared positive under 
  ## the null hypothesis)
  oligo.table$eval <- oligo.table$pval * nb.oligos
  
  ## Compute the significance
  ## sig = -log10(eval)
  oligo.table$sig <- -log10(nb.oligos) - oligo.table$log.pval/log(10)
  
  ## Compute maximal frequency for both files
  max.axis <- max(c(oligo.table$obs_freq_query, oligo.table$obs_freq_background))
  
  ## Log2 ratio observed/expected
  oligo.table$freq.ratio <- oligo.table$obs_freq_query/oligo.table$obs_freq_background
  oligo.table$log2.ratio <- log2(oligo.table$freq.ratio)
  
  return(oligo.table)
}

Now we shall look at the plotting function:

#' @title Scatter plot of background versus query k-mer frequencies. 
#' @author Najla Nksouri <najlaksouri@gmail.com> 
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Draw a scatter plot comparing background and query k-mer frequencies for a given size. 
#' @param kmer.stat.table a table with k-mer statistics produced by kmer.stats()
#' @param k oligonucleotide (k-mer) size
#' @param sig.threshold=2 Threshold on significance for the volcano plot. 
#' @param volcano.ymax=350
#' @return a data.frame with merged query and background k-mer frequencies + derived statistics
#' @examples
#' ## Plot query versus background 4-mer frequencies
#' plot.kmers(k=8)
#' 
plot.kmers <- function(kmer.stat.table,
                       k,
                       sig.threshold = 2,
                       log2FC.threshold = 1, ## Enrichment threshold
                       volcano.ymax = 350) {
  message(k, "-mer frequency plot")

  ## Define dot colors  
  dot.colors <- ifelse(
    (kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio > 0.1), "darkred",
         ifelse((kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio < -0.1),
                "darkblue", "darkgray"))

  ## Draw a scatter plot to compare k-mer frequencies
  max.freq <- max(kmer.stat.table[, c("obs_freq_query","obs_freq_background")])
  plot(x = kmer.stat.table$obs_freq_background,
       y = kmer.stat.table$obs_freq_query,
       
       ## Set axis limits
       xlim = c(0, max.freq),
       ylim = c(0, max.freq),
       
       ## Set labels
       xlab = paste("background ", k,"-mer frequencies", sep = ""),
       ylab = paste("query ", k,"-mer frequencies", sep = ""),
       main = paste("background vs query ", k,"-mer frequencies", sep = ""),
       
       col = dot.colors,
       
       ## Insert grid
       panel.first=grid(),
       
       ## Change symbol
       pch=1
  )
  ## Insert diagonal
  abline(0, 1, col="black", lwd=1, lty = 2)
  
  ## MA plot : log-ratio as a function of the log counts
  ## Note: usually, the MA plot represent the log-ratio versus log(mean) of the two measurements.
  ## However, in our case the query sequences are a subset of the background sequences. 
  ## We will thus plot the log(background freq) on the X axis
  plot(x = log2(kmer.stat.table$obs_freq_background),
       y = kmer.stat.table$log2.ratio,
       
       ## Set labels
       xlab = paste("log2(background)"),
       ylab = paste("log2(query/background)"),
       main = paste(k,"-mer frequencies", sep = ""),
       
       col = dot.colors,
       
       ## Insert grid
       panel.first=grid(),
       
       ## Change symbol
       pch=1
  )
  ## Insert diagonal
  abline(h=0, col="black", lwd=1, lty = 2)
  
  
  ## Draw a volcano plot
  ##
  ## Abcsissa indicates the effect,i.e. the log2-ratio query versus background. 
  ## Ordinate indicates the significance as computed above. 
  ##
  ## Mark in red the over-represented k-mers with sig > given threshold.
  ## and in blue the under-represented k-mers with sig > given threshold.
  
  ## Set axis limit 
  x.lim <- abs(max(kmer.stat.table$log2.ratio))
  ## Plotting
  with(kmer.stat.table,plot(x=log2.ratio, 
                        y=sig, 
                        pch = 1,
                        xlim = c(-x.lim, x.lim),
                        ylim = c (0, volcano.ymax),
                        
                        ## Set labels
                        xlab = paste("Log2.ratio ", k,"-mer frequencies", sep = ""),
                        ylab = paste("Significance ", k,"-mer frequencies", sep = ""),
                        main = paste("Volcano Plot ", k,"-mer frequencies", sep = ""),
                        # Insert grid
                        panel.first=grid(),
                        
                        #set the colors 
                        col = dot.colors))
  
  ## Denote significance exceeding the threshold by 
  above.ymax <- kmer.stat.table$sig > volcano.ymax
  with(kmer.stat.table[above.ymax,],
       points(x=log2.ratio, y=rep(x = volcano.ymax, length.out=sum(above.ymax)), pch=17, col="red",type="p"))
  
  ## insert diagonals
  abline(h=sig.threshold, col= "black", lty = 2)
  abline(v=0, col= "black", lty = 2)
  abline(v=-log2FC.threshold, col= "black", lty = 2)
  abline(v=log2FC.threshold, col= "black", lty = 2)
}

With this data and functions we can now compute \(k\)-mer enrichment in the promoters of a sample set of co-regulated genes or regulon:

Cluster of co-expressed genes

# adjust as required 
dir_data = "test_data/" # where pre-computed k-mer freq files are 
dir_code = "test_code/" 

source(file.path(dir_code, "kmer_stats.R"))
source(file.path(dir_code, "plot_kmers.R"))

# set word/k-mer size
k = 6 

# set files with pre-computed words/k-mer frequencies
# raw sequences
#oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.raw.tab")
# recommended repeat-masked sequences
oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.rm.tab")
  
oligo.genome.freq.file = file.path(dir_data, "6nt_upstream_genome.tab")
 
# check the code of this function to see how the probabilities of word occurrences 
# are computed with the binomial function
kmer.stat.table <- kmer.stats(query.oligo.file = oligo.upstream.freq.file, 
                               background.oligo.file = oligo.genome.freq.file)
#View(kmer.stat.table)
  
plot.kmers(kmer.stat.table, k, volcano.ymax = 60)

4.1 Exercises

The plant-dedicated mirror of the Regulatory Sequence Analysis Tools (RSAT, http://plants.rsat.eu) offers specialized options for researchers dealing with plant transcriptional regulation. The web site contains whole-sequenced species regularly updated from Ensembl Plants and other sources, and supports an array of tasks frequently required for the analysis of regulatory sequences, such as retrieving upstream sequences, motif discovery, motif comparison, pattern matching. RSAT::Plants also integrates the footprintDB collection of DNA motifs. This exercise consists on following a step-by-step protocol on how to discover DNA motifs in regulatory regions of clusters of co-expressed genes in plants. It also explains how to empirically control the significance of the result, and how to annotate any discovered motis.

This protocol has been published at (Contreras-Moreira et al. 2016) and updated at (Ksouri et al. 2021). Other related protocols can be found at (Castro-Mondragon et al. 2016; Contreras-Moreira and Sebastian 2016).

4.1.1 Background

Transcriptome data (microarrays, RNA-seq) have been extensively used as a proxy for genetic regulation in many organisms, as the analysis of genome-wide profiles of gene transcription under different treatments uncovers clusters of genes with correlated behaviors, which may result from direct or indirect co-regulation. A classical application of this approach was done by Beer and Tavazoie (2004) with yeast microarray data sets obtained in a variety of experimental conditions. In that experiment, expression data-mining was demonstrated to be an effective strategy for finding regulons, groups of genes that share regulatory mechanisms and functional annotations.

Other studies have unveiled that the outcome of these approaches largely depends on the genomic background of the species under study. For instance, Sand O. and Helden J. (2008) reported that the significance of DNA motifs discovered in Saccharomyces cerevisiae promoters is much higher for regulons than for random gene sets of the same sizes, but for human promoters the signal-to-noise ratio is almost null, because random gene sets give highly significant motifs due to heterogeneities in promoter compositions and biases due to repetitive elements. For metazoans, it is thus a real challenge to distinguish bona fide motifs from noise (Sand O. and Helden J. 2008). These observations suggest that motif discovery on sequence clusters faces intrinsic properties of the genomes under study, regardless of the software used for the task.

Among plants, these strategies have so far been tested on the model Arabidopsis thaliana, and they have been successfully applied to the identification of novel cis-regulatory elements validated with synthetic promoters (Koschmann et al. 2012). Yet, with the exception of this model, these sorts of experiments have not been possible in other plants until recently. In spite of this, the growing list of available plant genomes encourages these analyses in combination with expression profiles obtained from either microarray or RNA-seq data sets, as in the work of Yu et al. (2015), provided that these factors are considered:

Plant genomes are rich in repetitive elements (RE) distributed along the genome (Schmidt and Heslop-Harrison 1998), which pose particular problems for motif discovery statistics (violation of the independence assumption).
Current genome assemblies range from 120Mb (A. thaliana) to 13,420Mb (Triticum aestivum). Brachypodium distachyon, a model species for grasses, is 272Mb. The quality of these assemblies and their RE content is also quite variable, as shown in Table 1.
Upstream regions, defined by annotated gene coordinates, are also of variable length, going from 1,151b on average in A. thaliana to 2,820b in Triticum_turgidum (see Table 1). Work by Ksouri et al. (2021) has shown that the choice of upstream region length impacts on the discovery of regulatory sequences.

Table 1. Features of some plant genomes in RSAT::Plants, taken from https://rsat.eead.csic.es/plants/htmllink.cgi?title=RSAT-Data&file=data. Each ID concatenates the organism, the assembly version and the source. Most genome IDs add to the end the Ensembl Plants release number. For instance, Arabidopsis_thaliana.TAIR10.55, corresponds to A.thaliana assembly 10 from TAIR https://www.arabidopsis.org, annotated in release 55 of Ensembl Plants. “%Ns” are stretches of uncharacterized nucleotides which often connect assembled sequence contigs. “%repeat-masked” segments are sequences with significant similarity to plant repetitive DNA sequences, which are masked.

genome.stats <- read.delim(file="test_data/summary.tab", row.names = NULL, comment.char=";")
genome.stats[,2] = as.numeric(genome.stats[,2])/1e6 ## Convert genome sizes to Mb
genome.stats[,5] = as.numeric(genome.stats[,5]-genome.stats[,4]) ## raw %mask also includes %N
names(genome.stats) <- c("Organism / assembly ID", "Genome size (Mb)", "Contigs", "%N", "% repeat-masked", "Gene models", "Mean upstream length")
# names(genome.stats)

## Print the table
kable(genome.stats,format.args = list(big.mark=","),
      digits=c(0,0,0,1,1,0,0))

Organism / assembly ID	Genome size (Mb)	Contigs	%N	% repeat-masked	Gene models	Mean upstream length
Aegilops_tauschii.Aetv4.0.55	4,225	109,583	2.3	82.2	43,362	2,533
Amborella_trichopoda.AMTR1.0.55	706	5,745	5.4	54.9	28,353	2,682
Ananas_comosus.F153.55	316	25	1.5	16.2	26,730	2,367
Arabidopsis_halleri.Ahal2.2.55	196	2,239	14.8	24.0	33,223	1,758
Arabidopsis_lyrata.v.1.0.55	207	695	11.1	32.9	33,500	1,799
Arabidopsis_thaliana.Columbia.GCF_000001735.3_TAIR10.NCBIRefseq	120	7	0.2	21.8	38,161	638
Arabidopsis_thaliana.TAIR10.55	120	7	0.2	23.4	34,262	1,154
Beta_vulgaris.RefBeet-1.2.2.55	566	40,245	8.6	18.0	28,031	2,260
Brachypodium_distachyon.Brachypodium_distachyon_v3.0.55	271	10	0.2	4.2	34,310	1,971
Brachypodium_hybridum.463.v1.1.JGI	509	15	1.2	17.8	70,160	2,052
Brachypodium_stacei.316.v1.1.JGI	234	112	1.1	13.9	29,898	2,139
Brachypodium_sylvaticum.490.v1.1.JGI	358	629	0.0	41.6	36,927	2,070
Brassica_juncea.ASM1870372v1.55	934	1,544	4.8	50.1	92,887	2,045
Brassica_napus.AST_PRJEB5043_v1.55	848	20,899	13.0	20.1	105,276	1,862
Brassica_oleracea.BOL.55	489	32,928	8.8	21.7	60,694	2,195
Brassica_rapa_ro18.SCU_BraROA_2.3.55	346	295	0.4	52.4	43,129	2,263
Cannabis_sativa.cs10.GCF_900626175.2.NCBI	876	221	15.9	46.9	29,813	2,528
Capsicum_annuum.ASM51225v2.55	3,064	35,797	3.2	8.0	35,845	2,683
Chenopodium_quinoa.PI614886.392.v1.JGI	1,385	3,486	4.5	63.2	44,776	2,591
Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.55	111	53	3.6	7.1	17,743	536
Chondrus_crispus.ASM35022v2.55	105	926	0.9	68.4	9,857	674
Cicer_arietinum.ASM33114v1.GCF_000331145.1.RefSeq	531	7,127	9.1	42.4	28,772	2,484
Citrus_clementina.Citrus_clementina_v1.0.55	301	1,398	2.1	21.0	27,179	2,338
Coffea_canephora.AUK_PRJEB4211_v1.55	569	13,007	17.1	6.1	25,574	2,360
Corchorus_capsularis.CCACVL1_1.0.55	317	16,522	0.0	46.5	31,549	2,070
Corylus_avellana.CavTom2PMs-1.0.55	370	11	0.5	48.2	27,270	2,445
Cucumis_sativus.ASM407v2.55	194	186	1.8	13.8	24,079	2,218
Cyanidioschyzon_merolae.ASM9120v1.55	17	22	0.0	6.7	5,021	816
Cynara_cardunculus.CcrdV1.55	725	13,588	9.7	8.7	26,505	2,595
Daucus_carota.ASM162521v1.55	422	4,826	8.2	4.6	33,488	2,552
Digitaria_exilis.Fonio_CM05836.55	716	19	2.8	40.1	59,844	2,329
Dioscorea_rotundata.TDr96_F1_Pseudo_Chromosome_v1.0.55	457	21	11.1	17.3	19,565	2,441
Echinochloa_crusgalli.ec_v3.55	1,341	2,058	0.1	53.1	103,850	2,404
Eragrostis_tef.ASM97063v1.55	607	13,883	18.3	3.5	41,920	2,321
Ficus_carica.UNIPI_FiCari_1.0.55	333	511	0.0	47.2	37,832	2,275
Fragaria_vesca.Hawaii4.Fvb2.0.GDR	212	8	7.0	20.6	33,538	1,822
Galdieria_sulphuraria.ASM34128v1.55	14	433	2.1	2.7	6,736	215
Glycine_max.Glycine_max_v2.1.55	978	1,190	2.4	48.9	57,147	2,521
Gossypium_raimondii.Graimondii2_0.55	761	1,033	1.8	57.6	38,719	2,492
Helianthus_annuus.HanXRQr2.0-SUNRISE.55	3,010	332	2.1	81.3	83,177	2,501
Hordeum_vulgare.IBSCv2.55	4,834	10	5.4	52.3	42,858	2,636
Hordeum_vulgare.MorexV3_pseudomolecules_assembly.55	4,226	290	0.0	85.3	35,826	2,779
Ipomoea_triloba.ASM357664v1.55	462	16	5.3	19.2	31,418	2,405
Klebsormidium_nitens.NIES2285.171026.v11.TITECH	104	1,814	1.0	0.0	17,153	865
Lactuca_sativa.Lsat_Salinas_v7.55	2,391	8,323	7.6	68.5	38,693	2,784
Leersia_perrieri.Lperr_V1.4.55	267	12	0.4	32.2	31,516	2,211
Lolium_perenne.MPB_Lper_Kyuss_1697.55	2,277	1,677	0.0	79.0	38,868	2,706
Lupinus_albus.CNRS_Lalb_10.GCA_009771035.1.NCBI	451	89	0.5	44.8	41,359	2,086
Lupinus_albus.La_Amiga31.GCA_010261695.1.NCBI	559	1,580	0.3	42.8	48,718	2,203
Lupinus_angustifolius.LupAngTanjil_v1.0.55	609	13,573	0.7	24.0	35,170	2,629
Manihot_esculenta.Manihotesculentav6.55	582	2,019	14.9	12.2	33,516	2,363
Marchantia_polymorpha.Marchanta_polymorpha_v1.55	226	2,957	6.7	4.2	19,287	2,174
Medicago_truncatula.MedtrA17_4.0.55	413	2,186	5.5	33.9	52,918	2,298
Musa_acuminata.ASM31385v1.55	473	12	17.4	15.8	38,658	1,968
Nicotiana_attenuata.NIATTr2.55	1,492	35,134	11.7	24.4	21,344	2,765
Nicotiana_tabacum.TN90.GCF_000715135.1.NCBI	3,643	168,247	2.9	51.4	74,273	2,432
Olea_europaea.OLEA9.55	1,316	9,751	4.1	54.3	55,595	2,563
Olea_europaea.sylvestris.451.v1.JGI	1,142	41,256	9.7	56.3	50,684	2,695
Oryza_barthii.O.barthii_v1.55	308	12	0.8	43.9	37,849	2,217
Oryza_brachyantha.Oryza_brachyantha.v1.4b.55	261	7,485	6.7	28.4	34,598	2,246
Oryza_glaberrima.Oryza_glaberrima_V1.55	316	1,951	4.1	40.3	74,579	2,028
Oryza_glumipatula.Oryza_glumaepatula_v1.5.55	373	12	19.2	35.1	39,702	2,165
Oryza_indica.ASM465v1.55	427	10,490	3.8	44.2	89,723	2,005
Oryza_longistaminata.O_longistaminata_v1.0.55	326	60,198	9.8	31.0	32,868	2,159
Oryza_meridionalis.Oryza_meridionalis_v1.3.55	336	12	23.8	29.9	32,549	2,204
Oryza_nivara.Oryza_nivara_v1.0.55	338	12	8.3	40.5	38,741	2,179
Oryza_punctata.Oryza_punctata_v1.2.55	394	12	0.4	53.1	37,791	2,218
Oryza_rufipogon.OR_W1943.55	338	12	1.9	46.1	40,584	2,171
Oryza_sativa.IRGSP-1.0.55	375	63	0.0	51.0	38,978	2,204
Ostreococcus_lucimarinus.ASM9206v1.55	13	21	0.0	15.0	7,664	521
Panicum_hallii_fil2.PHallii_v3.1.55	536	291	1.4	5.1	33,805	2,027
Phaseolus_vulgaris.PhaVulg1_0.55	521	708	9.3	20.1	28,472	2,596
Physcomitrium_patens.PhypaV3.55	472	357	1.3	58.6	32,674	2,044
Populus_trichocarpa.Pop_tri_v3.55	434	1,446	2.6	5.5	41,335	2,451
Prunus_avium.PAV_r1.0.55	272	10,148	9.4	16.0	43,349	1,868
Prunus_avium.Satonishiki.PAV_r1.0.GDR	374	9	34.0	18.9	43,673	2,094
Prunus_mume.BJFU1210120008.GCF_000346735.1.NCBIRefseq	234	8,164	7.2	22.5	26,524	2,047
Prunus_persica.Prunus_persica_NCBIv2.38	227	191	1.2	21.3	27,107	1,984
Prunus_persica.Prunus_persica_NCBIv2.55	227	191	1.2	19.2	27,107	1,984
Prunus_persica.Prupe1_0.37	227	202	1.2	16.1	28,898	2,217
Saccharum_spontaneum.Sspon.HiC_chr_asm.55	2,900	32	0.3	36.6	83,815	2,706
Selaginella_moellendorffii.v1.0.55	213	759	1.9	37.1	36,200	1,436
Setaria_italica.Setaria_italica_v2.0.55	406	336	1.2	2.8	35,831	2,300
Solanum_lycopersicum.SL3.0.55	828	3,148	9.8	51.8	35,092	2,300
Solanum_tuberosum.DM.v4.03.PGSC	725	12	12.5	0.0	39,028	2,401
Solanum_tuberosum.SolTub_3.0.55	811	13	15.8	49.3	40,430	2,527
Sorghum_bicolor.Sorghum_bicolor_NCBIv3.55	709	867	4.7	64.7	35,479	2,276
Trifolium_pratense.Trpr.55	305	8,595	11.7	19.0	41,184	1,849
Triticum_aestivum.IWGSC.55	14,550	22	1.9	85.1	120,744	2,605
Triticum_dicoccoides.WEWSeqv.1.0.55	10,080	14	1.5	86.1	67,375	2,656
Triticum_turgidum.svevo.55	10,460	15	1.5	4.9	66,559	2,820
Triticum_urartu.Tu2.0.55	4,852	10,284	0.6	89.7	41,492	2,613
Vigna_angularis.Vigan1.1.55	467	37,373	3.6	13.5	34,819	2,719
Vigna_radiata.Vradiata_ver6.55	463	2,497	7.2	20.2	23,164	2,643
Vigna_unguiculata.ASM411807v1.55	519	682	0.5	51.0	31,814	2,496
Vitis_vinifera.PN40024.v4.55	476	22	0.4	57.2	35,134	2,266
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55	2,182	685	0.2	84.5	44,303	2,502

The step-by-step protocol below has been tested for the task of discovering and annotating DNA motifs in clusters of upstream sequences for species supported by RSAT::Plants, which have been obtained mostly from Ensembl Plants (https://plants.ensembl.org), but also include data from the JGI Genome Portal (http://genome.jgi.doe.gov) and other sources. In addition, RSAT::Plants integrates footprintDB (https://footprintdb.eead.csic.es), a collection of position-specific scoring matrices (PSSM) representing transcription factor binding motifs (TFBM), as well as their cognate binding proteins (Sebastian and Contreras Moreira 2014), which can be used to annotate discovered motifs and to predict potentially binding transcription factors.

Discovering regulatory elements within natural genomic sequences is certainly an important scientific goal on its own, but can also be part of the design and validation of synthetic promoters. We envisage at least two applications in this context:

The characterization of promoters of genes with known expression properties, which can then be used to engineer the expression of genes of interest.
The validation of engineered promoters in order to make sure that they contain the expected regulatory elements which might be natural or engineered depending on the application.

4.1.2 Materials

This protocol requires disposing of:

A computer with a Web browser
A set of gene clusters from any of the species currently supported at RSAT::Plants ( http://plants.rsat.eu, see Note ¹). Here we will use three example clusters of co-expressed maize genes, shown on Table 2 (see Note ²). First, they discovered potential regulatory motifs within their upstream sequences, and then they performed electrophoretic mobility shift assays (EMSA) to confirm them. Table 2 shows three of those clusters which are used in this protocol. For each cluster a list of gene identifiers is given next to the EMSA-confirmed motifs. The remaining nine clusters were left out for being too small, as the statistical approaches in this protocol require at least 10-15 genes.

4.1.3 Methods

The following protocol enumerates the steps required to discover DNA motifs, based on the over-representation of \(k\)-mers (oligonucleotides) and dyads (spaced pairs of oligonucleotides), in clusters of upstream sequences. The protocol comprises two stages, analyzing first co-expressed genes and then random clusters as a negative control (see Note ³). Only after both stages have been completed it is possible to objectively estimate the relevance of the results.

4.1.3.1 Collecting the full set of promoters for the genome of interest

Before the proper analysis of the gene cluster, we will retrieve the promoter sequences of all the genes of the organism of interest, which will serve below to estimate the background model. Table 2 shows three clusters of maize (Zea mays) genes used along the protocol, extracted from the published work of Yu et al. (2015). The original gene identifiers have been translated to current gene models in the Zm-B73-REFERENCE-NAM-5.0.55 annotation using maizeGDB (Woodhouse et al. 2021).

Experimentally verified regulatory motifs of these clusters are indicated:

Cluster name	Confirmed motif	#sequences	Original gene IDs	Current gene IDs
ABI4	GCGCRSGCGGSC	16	GRMZM2G025062 GRMZM2G053503 GRMZM2G069082 GRMZM2G069126 GRMZM2G069146 GRMZM2G076896 GRMZM2G081892 GRMZM2G124011 GRMZM2G129674 GRMZM2G142179 GRMZM2G169654 GRMZM2G172936 GRMZM2G173771 GRMZM2G174347 GRMZM2G175525 GRMZM2G421033	Zm00001eb330490 Zm00001eb340140 Zm00001eb318890 Zm00001eb318900 Zm00001eb318910 Zm00001eb202570 Zm00001eb335190 Zm00001eb335200 Zm00001eb103110 Zm00001eb006160 Zm00001eb119540 Zm00001eb156040 Zm00001eb074730 Zm00001eb193670 Zm00001eb352580 Zm00001eb422470 Zm00001eb432090
E2F	TTCCCGCCA	18	AC197146.3_FG001 GRMZM2G017081 GRMZM2G021069 GRMZM2G037700 GRMZM2G057571 GRMZM2G062333 GRMZM2G065205 GRMZM2G066101 GRMZM2G075978 GRMZM2G100639 GRMZM2G112074 GRMZM2G117238 GRMZM2G130351 GRMZM2G139894 GRMZM2G154267 GRMZM2G162445 GRMZM2G327032 GRMZM2G450055	Zm00001eb405460 Zm00001eb227730 Zm00001eb350340 Zm00001eb228900 Zm00001eb284360 Zm00001eb028400 Zm00001eb162700 Zm00001eb257900 Zm00001eb290510 Zm00001eb137000 Zm00001eb248940 Zm00001eb421110 Zm00001eb169600 Zm00001eb169610 Zm00001eb037700 Zm00001eb351990 Zm00001eb342080 Zm00001eb192270 Zm00001eb192280 Zm00001eb243240 Zm00001eb243250
WRI1	CGGCGGCGS	56	AC210013.4_FG019 GRMZM2G008430 GRMZM2G009968 GRMZM2G010435 GRMZM2G010599 GRMZM2G014444 GRMZM2G015097 GRMZM2G017966 GRMZM2G022019 GRMZM2G027232 GRMZM2G028110 GRMZM2G035017 GRMZM2G041238 GRMZM2G045818 GRMZM2G047727 GRMZM2G048703 GRMZM2G064807 GRMZM2G068745 GRMZM2G074300 GRMZM2G076435 GRMZM2G078779 GRMZM2G078985 GRMZM2G080608 GRMZM2G092663 GRMZM2G096165 GRMZM2G098957 GRMZM2G107336 GRMZM2G108348 GRMZM2G111987 GRMZM2G115265 GRMZM2G119865 GRMZM2G122871 GRMZM2G126603 GRMZM2G126928 GRMZM2G132095 GRMZM2G140799 GRMZM2G148744 GRMZM2G150434 GRMZM2G151252 GRMZM2G152599 GRMZM2G170262 GRMZM2G181336 GRMZM2G311914 GRMZM2G312521 GRMZM2G322413 GRMZM2G325606 GRMZM2G343543 GRMZM2G353785 GRMZM2G409407 GRMZM2G439201 GRMZM5G823135 GRMZM5G827266 GRMZM5G831142 GRMZM5G835323 GRMZM5G870606 GRMZM5G882378	Zm00001eb212470 Zm00001eb347390 Zm00001eb351870 Zm00001eb066830 Zm00001eb420570 Zm00001eb238930 Zm00001eb031300 Zm00001eb267190 Zm00001eb222060 Zm00001eb222210 Zm00001eb283120 Zm00001eb063980 Zm00001eb335620 Zm00001eb342930 Zm00001eb095960 Zm00001eb011490 Zm00001eb174060 Zm00001eb405750 Zm00001eb172130 Zm00001eb166570 Zm00001eb316270 Zm00001eb285610 Zm00001eb061760 Zm00001eb212180 Zm00001eb134780 Zm00001eb037330 Zm00001eb138230 Zm00001eb235710 Zm00001eb271520 Zm00001eb277320 Zm00001eb160070 Zm00001eb359780 Zm00001eb235710 Zm00001eb399010 Zm00001eb036530 Zm00001eb354150 Zm00001eb361700 Zm00001eb317180 Zm00001eb173470 Zm00001eb192860 Zm00001eb205850 Zm00001eb192680 Zm00001eb330480 Zm00001eb372580 Zm00001eb276860 Zm00001eb320920 Zm00001eb282960 Zm00001eb329960 Zm00001eb227700 Zm00001eb227710 Zm00001eb352030 Zm00001eb158110 Zm00001eb323110 Zm00001eb225190 Zm00001eb225200 Zm00001eb121240

Open a connection to the RSAT::Plants server. It can be reached at http://plants.rsat.eu. On the left-side menu, select ‘Sequence tools -> retrieve sequence’.
Choose ‘Single organism -> Zea mays’ for the examples of this protocol (see Note 1), which corresponds to ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ in Ensembl Plants release 55 (Yates et al. 2022).
Choose ‘Genes -> all’; this will retrieve all upstream sequences of the maize genome.
Set appropriate upstream bounds. To replicate the work of Yu et al. (2015) these should be set to ‘From’ -1000 ‘To’ +200, with position 0 corresponding to transcriptional start sites (TSS); see Note ⁴. Beware that TSS positions in plant genomes often correspond to start codons, probably due to incomplete annotations.
We recommend to tick the option ‘Mask repeats’, as plant genomes are frequently repeat-rich. This option should NOT be used if you suspect the transcription factors of interest bind to repeated sequences.
Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note ⁵). The results include the executed command and a URL to the ‘sequences’ file, which must be saved. We will refer to this URL as ‘all.fasta.URL’. This FASTA-format file can also be stored as a local file on your computer, but note it can be rather large and might fail to upload later on if its size is larger than the server’s limit.
Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note 4).

4.1.3.2 Analyzing upstream sequences of co-expressed genes

We will now retrieve the upstream sequences of a cluster of co-expressed genes, and use peak-motifs to discover exceptional motifs in their promoters. The tool peak-motifs was initially conceived to discover motifs in ChIP-seq peaks, but it can also be used to analyze other sequence types, as illustrated here.

Choose cluster E2F from Table 2, copy the current gene IDs (last column). The IDs should be saved in a new text file, that you will store on your computer. Insert newline characters between genes⁶.
In the left menu of the RSAT server, click on ‘retrieve sequence’ to get a fresh form. Make sure that the option ‘Genes -> selection’ is activated and that the right organism, in this case ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’, is selected. Tick ‘Mask repeats’, and set the same size limits as for the whole collection of promoters: from -1000 to +200. Paste the list of IDs of your gene cluster (one gene ID per row).
Press ‘GO’ and wait a few seconds until the result page is displayed. Inspection of these sequences might reveal N-masked sequence stretches, which correspond to annotated repeats. Save both ‘query genes’ and ‘sequences’ files to local files on your computer, we will refer to them as ‘cluster.genes’ and ‘cluster.fasta’ later on this protocol.
Press the ‘peak-motifs’ button. The peak sequences section is automatically filled with a link to the selected cluster sequences.
Add a title for this job, such as ‘E2F cluster’.
On the right side of ‘Peak sequences’, under Control sequences, paste the ‘all.fasta.URL’ on the ‘URL of a sequence file available on a Web server’ entry.
Click on ‘Reduce peak sequences’ and leave both fields (‘number of top sequences to retain’ and ‘cut peak sequences’) blank to avoid having the sequences clipped.
Click on ‘Motif discovery parameters’. Select two algorithms: ‘Discover over-represented words’ (oligo-analysis) and ‘Discover over-represented spaced word pairs’ (dyad-analysis). Uncheck the program position-analysis (see Note ⁷).
Click on ‘Compare discovered motifs with databases’ and select appropriate databases which will be used to annotate any found motifs. For plant promoters, we recommend to check ‘footprintDB-plants’, but you can also check other databases such as ‘JASPAR plants’ (see Note ⁸).

You can also upload your own collection of DNA motifs in TRANSFAC format.

Click on ‘Reporting Options’. Set ‘Origin’ to ‘end’ and ‘Offset’ to -200 (see Note ⁹).
Select output delivery and press ‘GO’.
After few seconds the server should have uploaded the sequences and display a page with the URL of the future result page. You can already click on this link: the result page will be periodically updated to show the progress of the analysis. At the end of the processing, a box will appear at the top of the result page, with a short summary of the discovered motifs, and links to different sections of the results.

Once the job is complete click on the link [Download all results (peak-motifs_archive.zip)] to save the results on your computer. You will later be able to uncompress this archive in order to check the result after its removal from the server (results are only available on the server for 7 days after job completion). We also recommend downloading the full set of discovered motifs, by clicking on the link [Download all matrices (transfac format)] and saving a local file named ‘cluster.motifs.tf’. This file contains all motifs in the form of position-weight matrices (PWMs) in TRANSFAC format.

On the result page, the section entitled ‘Discovered motifs (with motif comparison)’ lists the discovered motifs, displays their sequence logos and their distribution along clustered sequences, in addition to top matches with the motif databases selected on step 9.

4.1.3.3 Negative control: random groups of genes

In this section, we propose a procedure to obtain an empirical estimation of the rate of false positives, by discovering motifs in the promoters of genes picked up at random.

On the left-side menu of RSAT::Plants select ‘Build control sets -> random gene selection’.
Choose ‘Organism -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ for the examples of this protocol.
Set ‘Number of genes’ to the size of one of the sample clusters on Table 2. For instance, the size of the negative control sets would be 18 for cluster E2F, 16 for cluster ABI4, and 56 for cluster WRI1. For convenience, in this tutorial only one random group is generated (the default), but this utility can generate several random groups in one go (see Note ¹⁰).
Press ‘GO’ and click the ‘Next step’ button ‘retrieve sequences’ at the bottom of the result page. In the retrieve-seq form, set the other parameters as above: from -1000 to +200, check the ‘Mask repeats’ option and press ‘GO’.
Save ‘query genes’ and ‘sequences’ files to local ‘random.genes’ and ‘random.fasta’ files and repeat steps 4-11 of section 4.1.3.2.

4.1.3.4 Validating motifs by scanning promoter sequences

This part of the protocol is devoted to validating sequence motifs discovered by their over-representation, which are scanned against the original sequences from which they were discovered, plus, optionally, orthologous sequences from a related species (see Note ¹¹).

The first goal of this section is to check whether the discovered motifs show patterns of occurrence along promoter sequences, and to see how many cluster sequences actually harbor them. This can be done empirically by comparing the results of expression-based motifs with those of shuffled motifs, with columns permuted, which play the role of negative controls.

A second goal is to investigate whether these regulatory motifs are conserved on orthologous promoters of a related plant, Sorghum bicolor in this case study.

On the left-side menu select ‘Comparative genomics -> get orthologs-compara’.
Choose ‘Reference organism -> Sorghum bicolor’ for the maize example.
Upload file ‘cluster.genes’ generated in step 3 of section 4.1.3.2. Press ‘GO’ and finally press ‘retrieve sequences’ on the next screen.
Repeat steps 4-6 of section 4.1.3.1 but now select Sorghum bicolor as organism. Save ‘sequences’ to local file ‘cluster_orths.fasta’.
On the left-side menu select ‘Build control sets -> permute-matrix’.
Upload ‘cluster.motifs.tf’ (obtained in step 12 of section 4.1.3.2) and press ‘GO’. Save the results file as ‘cluster.motifs.perm1.tf’ (see Note ¹²).
Select ‘Pattern matching -> matrix scan (full options)’.
In the sequence box paste the contents of ‘cluster.fasta’ and, optionally, ‘cluster_orths.fasta’, if you wish to assess motif conservation. Alternatively, steps 7-12 can be performed separately with maize and S.bicolor sequences.
Upload file ‘cluster.motifs.tf’ and select ‘transfac’ format.
In the ‘Background model’ section select Markov order 2 and choose ‘Organism-specific -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’. Press ‘GO’.
Save the ‘Scan result’ file as ‘cluster.scan.ft’ and press the ‘feature map’ button to draw a map of the matched motif instances.
Repeat steps 6-11 using the set of permuted PWMs ‘cluster.motifs.perm1.tf’ and save the results as ‘cluster.perm1.scan.tf’.

4.1.3.5 Interpretation of results

The last stage of the protocol is the interpretation of results, which is which requires having at hand results of both clusters of co-expressed genes and random clusters, which play the role of negative controls. Figure 1 summarizes the results of cluster ABI4 compared to 50 random clusters of the same size. There are three types of evidence to look at, which will be discussed with the examples on this figure:

Figure 1. Summary of motif discovery results for cluster ABI4. Dark bars correspond to clusters of co-expressed genes, grey bars to 50 random clusters of genes drawn from the maize genome. Maximum significance of oligo-analysis (A) and dyad-analysis (B) motifs. The sequence logo of motifs reported by each algorithm is shown on top, indicating the number of sites used to compute it and the \(\text{Ncor}\) score of the comparison to the expected motif (bottom, see Note 11). Panel C shows the scores of discovered motifs when scanned back to the original maize upstream sequences and sequences from orthologous genes in Sorghum bicolor. Here dark bars are the reported PWMs, while the grey bars correspond to permuted PWMs. Panel D show the \(\text{Ncor}\) scores of discovered motifs when compared to annotated PWMs in footprintDB.

The distribution of motif significance yielded by oligo-analysis (A) and dyad-analysis (B). Motifs discovered in random clusters (grey bars) typically have significances below 4. The motifs found in cluster ABI4 (black bars) are not more significant than those of random gene sets of the same sizes. The reason for having significant motifs in the random gene sets may result from the occasional presence of low complexity motifs, which should not be considered as reliable predictions.
Panel A, E and I also show the comparisons between some motifs returned by peak-motifs and those reported by the authors of the reference experimental study (Yu et al. 2015); they used MEME as motif discovery tools).
The distribution of scanning scores (C) show to which extent motif matches in upstream sequences of both maize and their S. bicolor orthologues (dark boxes) depart from matches of permuted matrices (lighter boxes), used here as negative controls. On these boxplots, the horizontal bars indicate the median score of all the predicted sites in a given set of promoter sequences, and the shaded rectangles show the interquartile range, i.e. the extent between the 25% and 75% percentiles. For the ABI4 cluster there is a noticeable overlap between the interquartile boxes of the cluster and the random gene selections. Besides, the random selections show several “outliers” (circles) indicating sites predicted with high matching scores. Even though the mean scores are clearly higher for the actual cluster, the results may thus not be considered very significant.
The distribution of scores in footprintDB (D) describe how similar the discovered motifs are when compared to motifs (PWMs) annotated in footprintDB. Similarities are measured by normalized correlation score (\(\text{Ncor}\), see Note ¹³). In each example 50 random sets of promoters were analyzed with peak-motifs, and the discovered motifs compared to footprintDB. The black bar indicates the best matching score for the original, expression-based gene clusters, and the corresponding logo is overlaid on the histogram. For ABI4, the best-scoring matches correspond to phytochrome interacting factors. These proteins belong to the bHLH family of transcription factors and there are many annotated motifs for them in databases such as footprintDB.

In summary, motifs discovered in promoters of co-expressed genes should always be evaluated based on a combination of complementary criteria:

The primary key of interpretation is the significance reported by the motif discovery algorithms. This significance has to be interpreted by comparison with the results obtained in random promoter sets of the same size as the gene cluster of interest (negative controls).
Sequence scanning permits to predict putative binding sites, but the matching scores should be evaluated relative to randomized motifs (column-permuted).
Comparison between discovered motifs and databases of known TF-binding motifs suggests candidate transcription factors which could intervene in the co-regulation of the co-expressed cluster.

4.1.3.6 Your report

Your work is to run the protocol with cluster E2F and write a brief report with:

a table of the top hexamers and dyads enriched on the E2F cluster of maize usptream sequences and at least a random cluster of the same size. The table shall include the following columns: exp_freq=expected relative frequency, occ=observed occurrences, exp_occ=expected occurrences, occ_P=occurrence probability (binomial), occ_E=E-value for occurrences, occ_sig=occurrence significance.
a table with similar DNA motifs in footprintDB
an overall interpretation of results

References and notes

Abramson, Josh, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, et al. 2024. “Accurate Structure Prediction of Biomolecular Interactions with AlphaFold 3.” Nature 630 (8016): 493–500. https://doi.org/10.1038/s41586-024-07487-w.

Amselem, J., G. Cornut, N. Choisne, M. Alaux, F. Alfama-Depauw, V. Jamilloux, F. Maumus, et al. 2019. “RepetDB: a unified resource for transposable element references.” Mob DNA 10: 6. https://doi.org/10.1186/s13100-019-0150-y.

Beer, Michael A, and Saeed Tavazoie. 2004. “Predicting gene expression from sequence.” Cell 117 (2): 185–98. https://doi.org/10.1016/S0092-8674(04)00304-6.

Blanc-Mathieu, R., R. Dumas, L. Turchi, J. Lucas, and F. Parcy. 2023. “Plant-TFClass: a structural classification for plant transcription factors.” Trends Plant Sci, July. https://doi.org/10.1016/j.tplants.2023.06.023.

Castro-Mondragon, J. A., C. Rioualen, B. Contreras-Moreira, and J. van Helden. 2016. “RSAT::Plants: Motif Discovery in ChIP-Seq Peaks of Plant Genomes.” Methods Mol Biol 1482: 297–322. https://digital.csic.es/handle/10261/136254.

Contreras-Moreira, B. 2010. “3D-Footprint: A Database for the Structural Analysis of Protein-DNA Complexes.” Nucleic Acids Research 38: D91–97. http://nar.oxfordjournals.org/content/38/suppl\_1/D91.

Contreras-Moreira, B., J. A. Castro-Mondragon, C. Rioualen, C. P. Cantalapiedra, and J. van Helden. 2016. “RSAT::Plants: Motif Discovery Within Clusters of Upstream Sequences in Plant Genomes.” Methods Mol Biol 1482: 279–95. https://digital.csic.es/handle/10261/136253.

Contreras-Moreira, B., C. V. Filippi, G. Naamati, C. n, J. E. Allen, and P. Flicek. 2021. “K-mer counting and curated libraries drive efficient annotation of repeats in plant genomes.” Plant Genome 14 (3): e20143. https://doi.org/10.1002/tpg2.20143.

Contreras-Moreira, B., J. Sancho, and V. E. Angarica. 2010. “Comparison of DNA binding across protein superfamilies.” Proteins 78 (1): 52–62. http://www.ncbi.nlm.nih.gov/pubmed/19731374.

Contreras-Moreira, B., and A. Sebastian. 2016. “FootprintDB: Analysis of Plant Cis-Regulatory Elements, Transcription Factors, and Binding Interfaces.” Methods Mol Biol 1482: 259–77. https://digital.csic.es/handle/10261/136262.

Ensembl Compara. 2020. “Ensembl Help & Documentation.” http://plants.ensembl.org/info/genome/compara/homology_method.html.

Girgis, H. Z. 2015. “Red: an intelligent, rapid, accurate tool for detecting repeats de-novo on the genomic scale.” BMC Bioinformatics 16 (July): 227. https://pubmed.ncbi.nlm.nih.gov/26206263.

Helden, J. van, B. Andre, and J. Collado-Vides. 1998. “Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies.” J. Mol. Biol. 281 (5): 827–42. https://www.ncbi.nlm.nih.gov/pubmed/9719638.

Koschmann, J., F. Machens, M. Becker, J. Niemeyer, J. Schulze, L. Bulow, D. J. Stahl, and R. Hehl. 2012. “Integration of bioinformatics and synthetic promoters leads to the discovery of novel elicitor-responsive cis-regulatory sequences in Arabidopsis.” Plant Physiol. 160 (1): 178–91. https://doi.org/10.1104/pp.112.198259.

Ksouri, N., J. A. n, F. Montardit-Tarda, J. van Helden, B. Contreras-Moreira, and Y. Gogorcena. 2021. “Tuning promoter boundaries improves regulatory motif discovery in nonmodel plants: the peach example.” Plant Physiol 185 (3): 1242–58. https://doi.org/10.1093/plphys/kiaa091.

Lau, Andy M., Nicola Bordin, Shaun M. Kandathil, Ian Sillitoe, Vaishali P. Waman, Jude Wells, Christine A. Orengo, and David T. Jones. 2024. “Exploring Structural Diversity Across the Protein Universe with The Encyclopedia of Domains.” Science (New York, N.Y.) 386 (6721): eadq4946. https://doi.org/10.1126/science.adq4946.

Lewis, M., G. Chang, N. C. Horton, M. A. Kercher, H. C. Pace, M. A. Schumacher, R. G. Brennan, and P. Lu. 1996. “Crystal structure of the lactose operon repressor and its complexes with DNA and inducer.” Science 271 (5253): 1247–54. http://www.ncbi.nlm.nih.gov/pubmed/8638105.

Lu, X. J., and W. K. Olson. 2008. “3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures.” Nat Protoc 3: 1213–27. http://www.nature.com/nprot/journal/v3/n7/full/nprot.2008.104.html.

Luscombe, N. M., S. E. Austin, H. M. Berman, and J. M. Thornton. 2000. “An overview of the structures of protein-DNA complexes.” Genome Biol. 1 (1): REVIEWS001. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC138832/.

Luscombe, N. M., R. A. Laskowski, and J. M. Thornton. 2001. “Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level.” Nucleic Acids Res. 29 (13): 2860–74. http://nar.oxfordjournals.org/content/29/13/2860.long.

Mäkinen, Veli, Djamal Belazzougui, Fabio Cunial, and Alexandru I. Tomescu. 2015. Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing. Cambridge University Press. http://www.genome-scale.info.

Noyes, M. B., R. G. Christensen, A. Wakabayashi, G. D. Stormo, M. H. Brodsky, and S. A. Wolfe. 2008. “Analysis of homeodomain specificities allows the family-wide prediction of preferred recognition sites.” Cell 133 (7): 1277–89. https://www.ncbi.nlm.nih.gov/pubmed/18585360.

Olson, W. K., A. A. Gorin, X. J. Lu, L. M. Hock, and V. B. Zhurkin. 1998. “DNA sequence-dependent deformability deduced from protein-DNA crystal complexes.” Proc. Natl. Acad. Sci. U.S.A. 95 (19): 11163–68. http://www.pnas.org/content/95/19/11163.long.

Sand O., Turatsinze J. V., and van Helden J. 2008. “Evaluating the Prediction of Cis-Acting Regulatory Elements in Genome Sequences.” In Modern Genome Annotation, edited by D. Frishman and A. Valencia, 55–89. Springer. https://doi.org/10.1007/978-3-211-75123-7_4.

Sathyapriya, R., M. S. Vijayabaskar, and S. Vishveshwara. 2008. “Insights into protein-DNA interactions through structure network analysis.” PLoS Comput. Biol. 4 (9): e1000170. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000170.

Schmidt, T., and J. S. Heslop-Harrison. 1998. “Genomes, Genes and Junk: The Large-Scale Organization of Plant Chromosomes.” Trends in Plant Science 3 (5): 195–99. https://doi.org/http://dx.doi.org/10.1016/S1360-1385(98)01223-0.

Sebastian, A., and B. Contreras Moreira. 2014. “footprintDB: a database of transcription factors with annotated cis elements and binding interfaces.” Bioinformatics 30 (2): 258–65. https://doi.org/10.1093/bioinformatics/btt663.

Sebastian, A., and B. Contreras-Moreira. 2013. “The twilight zone of cis element alignments.” Nucleic Acids Res. 41 (3): 1438–49. http://nar.oxfordjournals.org/content/41/3/1438.long.

Siggers, T. W., A. Silkov, and B. Honig. 2005. “Structural alignment of protein–DNA interfaces: insights into the determinants of binding specificity.” J. Mol. Biol. 345 (5): 1027–45. http://www.ncbi.nlm.nih.gov/pubmed/15644202.

Wicker, T., F. Sabot, A. Hua-Van, J. L. Bennetzen, P. Capy, B. Chalhoub, A. Flavell, et al. 2007. “A unified classification system for eukaryotic transposable elements.” Nat Rev Genet 8 (12): 973–82. https://pubmed.ncbi.nlm.nih.gov/17984973.

Woodhouse, M. R., E. K. Cannon, J. L. Portwood, L. C. Harper, J. M. Gardiner, M. L. Schaeffer, and C. M. Andorf. 2021. “A pan-genomic approach to genome databases using maize as a model system.” BMC Plant Biol 21 (1): 385. https://doi.org/10.1186/s12870-021-03173-5.

Yates, A. D., J. Allen, R. M. Amode, A. G. Azov, M. Barba, A. Becerra, J. Bhai, et al. 2022. “Ensembl Genomes 2022: an expanding genome resource for non-vertebrates.” Nucleic Acids Res 50 (D1): D996–1003. https://doi.org/10.1093/nar/gkab1007.

Yu, C. P., S. C. Chen, Y. M. Chang, W. Y. Liu, H. H. Lin, J. J. Lin, H. J. Chen, et al. 2015. “Transcriptome dynamics of developing maize leaves and genomewide prediction of cis elements and their cognate transcription factors.” Proc. Natl. Acad. Sci. U.S.A. 112 (19): E2477–2486. https://doi.org/10.1073/pnas.1500605112.

Zhou, T., L. Yang, Y. Lu, I. Dror, A. C. Dantas Machado, T. Ghane, R. Di Felice, and R. Rohs. 2013. “DNAshape: a method for the high-throughput prediction of DNA structural features on a genomic scale.” Nucleic Acids Res. 41 (Web Server issue): 56–62. http://www.ncbi.nlm.nih.gov/pubmed/23703209.

As gene models can change from one assembly to another it is important to use the right assembly version, which is indicated for each genome in Table 1↩︎
Twelve clusters of maize genes, found to be co-expressed in 22 transcriptomes and enriched on Gene Ontoloy terms were analyzed in detail by Yu et al. (2015)↩︎
A crucial parameter to evaluate the results of motif discovery is to estimate the rate of false positives (FP). RSAT programs compute a significance score, which is the minus log of the expected number of false positives (\(\text{e-value} = 10^{-\text{sig}}\)). For example, a motif associated with a significance of 1 should be considered as poorly significant, since on average we would expect \(10^{-1} = 0.1\) false positives, i.e. one FP every 10 random trials. In contrast, a significance of e.g. 16 is very promising, since on average such a result would be expected every \(10^{-16}\) random trials. However, the theoretical significance relies on the correctness of the background model (computed here as k-mer and dyad frequencies in the whole set of promoters). In some cases, sets of plant promoters can discard from the theoretical model, due to heterogeneity of the input (e.g. inclusion of repetitive sequences). The negative control consists in measuring the significance obtained by submitting a random selection of promoters from the organism of interest (maize in the example). Although each of these genes is likely to be regulated by one or more transcription factors (and its promoter should contain corresponding binding sites), in principle the random set as a whole should not be co-regulated, so that the elements would differ from gene to gene, and there should thus be no over-represented motif in their promoters.↩︎
According to Ksouri et al. (2021) the range -500,+200 performs better↩︎
Should the connection to the server interrupt it might be safer to go back and choose ‘email’ as delivery option. The mail message provides a link to the data, which is actually stored at the server.↩︎
It is crucial to have one gene ID per row for submitting queries to retrieve-seq, because only the first word of each row is considered as a query. Note that any transcript isoform numbers .1, .2 should be removed↩︎
This program is generally relevant when analyzing sets containing a large number of sequences such as ChIP-seq peaks or genome-wide promoter sets.↩︎
Plant transcription databases are unfortunately still very fragmentary, so one might be tempted to check more complete collections such as footprintDB or JASPAR. However, the results should be interpreted with caution, because there is little conservation of cis-regulation between plants and other kingdoms of the tree of life. ↩︎
The option ‘Origin’ indicates the reference position relative to each sequence (start, center or end). When this option is set to ‘end’, the coordinates are computed relative to the end of the sequence, with negative values indicating upstream location. The option ‘Offset’ enables to shift the reference point by a given number. For the current example, setting the offset to -200 will give coordinates from -1000 to +200, the 0 corresponding to the TSS.-↩︎
Clearly, more than one random cluster should be evaluated, as suggested in Figure 1, where the results of up to 50 random groups are displayed next to the clusters of Yu et al. (2015).↩︎
Orthologues reported are annotated in Ensembl Compara, generated by a pipeline where maximum likelihood phylogenetic gene trees play a central role. These gene trees, reconciled with their species tree, have their internal nodes annotated to distinguish duplication or speciation events, and thus support the annotation of orthologous and paralogous genes, which can be part of complex one-to-many and many-to-many relations Ensembl Compara (2020)↩︎
This will permute the columns of the input PWMs producing matrices with different consensus. Column-permuted matrices are used as negative controls because they conserve the information content and nucleotide frequencies of the original motifs, but at the same time alter the sequence of nucleotides captured by the original motif, which is not recognized anymore.↩︎
‘Ncor’ is the relative width-normalized Pearson correlation of two PWMs aligned with matrix-scan. This normalized score prevents spurious matches that would cover only a subset of the aligned matrices (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).↩︎

Session 2: Analysis of non-coding sequences

Bruno Contreras Moreira, Najla Ksouri