In this session we will continue annotating sequences, but now with a focus on non-coding sequences. If we revisit the next figure, I hope you’ll see that non-coding sequences are usually the largest fraction of genomes (Mäkinen et al. 2015):
For instance, the genome of the model grass Brachypodium distachyon is about 270Mbp long, but its genes take only about 122Mbp. The difference increases as the genomes grow larger, as you can see in the next table:
annot.stats <- read.csv(file="test_data/Ensembl_stats.tsv", sep="\t", comment.char=";")
kable(annot.stats,format.args = list(big.mark=","))
genome | genome.size | gene.space | percentage |
---|---|---|---|
Arabidopsis thaliana | 119,667,750 | 77,640,474 | 64.9 |
Arabidopsis halleri | 196,243,198 | 67,469,378 | 34.4 |
Prunus dulcis | 227,498,357 | 94,209,302 | 41.4 |
Brachypodium distachyon | 271,163,419 | 122,468,185 | 45.2 |
Brassica rapa | 283,822,783 | 83,199,064 | 29.3 |
Trifolium pratense | 304,842,038 | 133,448,980 | 43.8 |
Arabis alpina | 308,032,609 | 45,780,039 | 14.9 |
Cucumis melo | 357,857,370 | 99,736,823 | 27.9 |
Citrullus lanatus | 365,450,462 | 81,219,268 | 22.2 |
Oryza sativa | 375,049,285 | 130,066,257 | 34.7 |
Setaria viridis | 395,731,502 | 123,250,596 | 31.1 |
Vitis vinifera | 486,265,422 | 153,535,016 | 31.6 |
Rosa chinensis | 515,588,973 | 117,296,819 | 22.8 |
Camelina sativa | 641,356,059 | 214,928,593 | 33.5 |
Malus domestica | 702,961,352 | 145,783,855 | 20.7 |
Olea europaea | 1,140,987,834 | 153,461,985 | 13.4 |
Zea mays | 2,135,083,061 | 168,014,276 | 7.9 |
Helianthus annuus | 3,027,844,945 | 199,817,746 | 6.6 |
Aegilops tauschii | 4,224,915,394 | 348,412,472 | 8.2 |
Triticum turgidum | 10,463,058,104 | 475,613,274 | 4.5 |
Therefore, when dealing with genome variation, polymorphisms have a greater chance to occur in non-coding regions. Among these, probably the most interesting are regulatory sequences and repeated elements.
The goal of this session is to learn how regulatory sequences can be discovered in promoter sequences using statistical tests and aligned to build DNA motifs.
The annotation of Transposable Elements (TEs) within plant genomes can help in the interpretation of observed phenotypes, as sometimes TEs affect the expression of neighbor genes, and in computational tasks such as promoter whole genome alignment, promoter or pan-genome analyses.
Usually TEs are annotated by alignment to curated libraries of repeated elements such as RepetDB (Amselem et al. 2019), where each sequence or element is classified according to the Wicker classification (Wicker et al. 2007). Class I elements are “copy and paste”, while Class II are “cut and paste”. The next figure summarizes this taxonomy of TEs, which ressembles that of protein domains:
More recently, there are other approaches which don’t require any curation; instead, these simply identify repeated elements by counting words (\(K\)-mers) along the genome. An example of such tools is the Repeat Detector (Girgis 2015), which has been used to routinely mask plant genomes (Contreras-Moreira et al. 2021).
The next table summarizes the fraction of repeated sequences in diverse plant genomes, as annotated with REdat, Red and the original papers describing those genomes:
annot.stats <- read.csv(file="test_data/Ensembl_repeats.tsv", sep="\t", comment.char=";")
annot.stats = annot.stats[,c(1,2,6,8,10)]
kable(annot.stats,format.args = list(big.mark=","))
genome | size | perc_REdat | perc_Red | perc_literature |
---|---|---|---|---|
Arabidopsis thaliana | 119,667,750 | 14.2 | 36.7 | 19.0 |
Arabidopsis halleri | 196,243,198 | 15.5 | 31.1 | 32.7 |
Prunus dulcis | 227,498,357 | 6.5 | 33.4 | 37.6 |
Brachypodium distachyon | 271,163,419 | 27.4 | 31.1 | 21.4 |
Brassica rapa | 283,822,783 | 8.6 | 32.8 | 32.3 |
Trifolium pratense | 304,842,038 | 11.0 | 30.0 | 41.8 |
Arabis alpina | 308,032,609 | 15.1 | 37.7 | 47.9 |
Cucumis melo | 357,857,370 | 8.2 | 39.9 | 44.0 |
Citrullus lanatus | 365,450,462 | 6.8 | 40.8 | 45.2 |
Oryza sativa | 375,049,285 | 32.3 | 37.0 | 35.0 |
Setaria viridis | 395,731,502 | 18.5 | 40.8 | 46.0 |
Vitis vinifera | 486,265,422 | 9.0 | 40.0 | 41.4 |
Rosa chinensis | 515,588,973 | 8.4 | 48.1 | 67.9 |
Camelina sativa | 641,356,059 | 15.8 | 36.0 | 28.0 |
Malus domestica | 702,961,352 | 9.1 | 41.6 | 59.5 |
Olea europaea | 1,140,989,389 | 18.0 | 45.2 | 43.0 |
Zea mays | 2,135,083,061 | 59.5 | 79.0 | 85.0 |
Helianthus annuus | 3,027,844,945 | 10.0 | 73.6 | 74.7 |
Aegilops tauschii | 4,224,915,394 | 68.7 | 81.8 | 85.9 |
Triticum turgidum | 10,463,058,104 | 71.7 | 82.2 | 82.2 |
When the repeats themsleves are not of interest for subsequent analyses, they are said to be masked out. This means that they are marked so that they can be avoided. Hard-masking means replacing the sequences of repeats with polyN oligonucleotides. Soft-masking means leaving the repeated sequences in lower-case.
The regulation of gene expression is one of the fundamental topics in Genetics. Here you will learn about transcription factors (TFs) and cis-regulatory elements (CREs). TFs are proteins that bind specifically to DNA sequences called CREs and affect the expression of nearby genes.
DNA-binding proteins contain DNA-binding domains and have a specific or general affinity for either single or double stranded DNA. Here we will concentrate mostly on transcription factors, which generally recognize cis-regulatory elements in double-stranded DNA molecules.
Transcription factors recognize target DNA sequences through a binding interface, composed of protein residues and DNA stretches in intimate contact. The best descriptions of protein-DNA interfaces are provided by structural biology, usually by X-ray or NMR experiments.
The process of recognition of DNA sequences by proteins involves readout mechanisms, and also stabilizing atomic interactions that do not confer specificity.
Direct readout
Indirect readout
Stabilizing interactions
Besides atomic interactions between protein and DNA, sequence-dependent deformability of duplexes, deduced from crystal complexes, implies that sequence recognition also involves DNA shape.
DNA deformation is described by the increase in energy brought about by instantaneous fluctuations of the step parameters from their equilibrium values:
\(deformation = \displaystyle\sum_{i=0}^6 \displaystyle\sum_{j=0}^6 spring_{ij} \Delta\theta_{i,st} \Delta\theta_{j,st}\) (Olson et al. (1998))
The accumulation of experimental and molecular dynamics data of DNA molecules currently supports predictive algorithms, such as DNAshape, which predict the geometry of DNA sequences:
Interfaces can be explored as generic bipartite graphs (Sathyapriya, Vijayabaskar, and Vishveshwara (2008)) or with sub-graphs that focus only on specific sequence recognition:
A great variety of DNA-binding proteins have been observed in nature, which can be analyzed and compared in terms of the features introduced above, such as readout, or instead with an evolutionary or topological perspective.
The Structural Classification of Proteins (SCOP) systematically groups protein folds in superfamilies, including DNA-binding proteins. The next table shows superfamilies with more than 20 non-redundant complexes in the Protein Data Bank as of Jun2023, as annotated in the database 3d-footprint:
SCOP superfamily | Number of complexes |
---|---|
Winged helix (WH) | 115 |
Homeodomain-like (H) | 79 |
Glucocorticoid-receptor-like (GR) | 40 |
Restriction endonuclease-like (RE) | 21 |
Homing endonuclease (HE) | 38 |
p53-like (P53) | 24 |
Lambda-repressor-like (LR) | 37 |
DNA-binding proteins can also be grouped in terms of the organisms in which they are found. For instance, the Plant-TFClass structural classification currently accepts 56 plant transcription factor types, which include up to 37 DNA-binding domains absent in mammals (Blanc-Mathieu et al. 2023).
The available experimental structures of protein-DNA complexes in the PDB support the annotation of interface residues, those involved directly in sequence recognition, within protein families. The database footprintDB does exactly this.
Several examples in the literature have demonstrated the correlation between interface patterns and the bound DNA motifs within large transcription factor families, such as the work of Noyes et al. (2008):
The study of interfaces must be done in the appropriate biological context, for instance considering the oligomerization state of TFs in vivo, as each family of transcription factors has singularities, such as these (compiled by Álvaro Sebastián):
Family | Motifs | Multimeric | Multidomain |
---|---|---|---|
Homeodomain | TAATkr,TGAyA | Sometimes | Unusual |
Basic helix-loop-helix (bHLH) | CACGTG,CAsshG | Always (homodimers, heterodimers) | Never |
Basic leucine zipper (bZIP) | CACGTG,-ACGT-,TGAGTC | Always (homodimers, heterodimers) | Never |
MYB | GkTwGkTr | Common (multimers) | Common |
High mobility group (HMG) | mTT(T)GwT,TTATC,ATTCA | Sometimes | Unusual |
GAGA | GAGA | Never | Never |
Fork head | TrTTTr | Unusual | Never |
Fungal Zn(2)-Cys(6) binuclear cluster | CGG | Common (homodimers) | Never |
Ets | GGAw | Common (homodimers, heterodimers, multimers) | Never |
Rel homology domain (RHD) | GGnnwTyCC | Always (homodimers, heterodimers) | Never |
Interferon regulatory factor | AAnnGAAA | Always (homodimers, heterodimers, multimers) | Never |
Here I briefly describe a structural alignment approach for the comparison of DNA-binding proteins and their interfaces, as discussed in Siggers, Silkov, and Honig (2005) and Sebastian and Contreras-Moreira (2013). In this context superpositions might guide the correct alignment of cis elements bound by homologous proteins, as illustrated in the figure.
DNA motifs inferred from protein-complexes can be analyzed in terms of their information content:
Now that we know how transcription factors recognize their target DNA sites, we will how learn how enrichment of DNA words in upstream gene promoter sequences can be computed in order to discover putative regulatory elements. We will follow the method of Helden, Andre, and Collado-Vides (1998).
Let’s have a look at the sample data:
# a set of upstream sequences of a group of co-regulated genes on Brachypodium distachyon
# see that they have been cut with coordinates -1000,+200 with respect to
# transcription start site (TSS)
head test_data/regulon.raw.fna
echo
# same sequences but repeat-masked
head test_data/regulon.rm.fna
echo
# frequency of hexamers in those raw upstream sequences computed with RSAT
# oligo-analysis -v 1 -seqtype dna -l 6 -return occ,freq -quick -str 2 -ovlp -sort
head -35 test_data/6nt_upstream_regulon.raw.tab | grep -v ";"
echo
# frequency of hexamers in the repeat-masked sequences
head -35 test_data/6nt_upstream_regulon.rm.tab | grep -v ";"
echo
# frequency of hexamers in all (rm) upstream sequences of that genome
head -35 test_data/6nt_upstream_genome.tab | grep -v ";"
>BRADI1G02630 Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACAAA
AAAAAACAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA
>BRADI1G02630 Bradi1g02630; upstream from -1000 to +200; size: 1201; feature type:gene; location: Brachypodium_distachyon.v1.0.37:1:1782739:1783939:R; upstream neighbour: BRADI1G02640 (distance: 5670)
TAAGTATTATCTTTTCAAGGGAAGGAACCTCCTGGCCTTCGCATCCTCGGATGCACACAA
CCATGTTTTTATTTGTAAGAAAAATACCACACACCAAGATCGAACACATGAGTTTACNNN
NNNNNNCAACAACTAGAGAACATCTTTCTAAAACTGGTCTCTCGGCATCCATCTTCTCCA
TAACCTACAAAGGTGACAATCTTACCCATGGTTAACAACCCCCATGGATTTTTGATTCAA
ATGGTATGGCATAAGTTGCTATTTTCGATTTTCGCCCACCAGTGGTAATCCATATGTAAA
ACCTATTAGTCATGAGTAGAGCAGGATGCAAGTTAAATTAATGGAAAAAGATTAGATGTG
ATGTGTTTATATGAGTTTAATTTGACTAAACCCTGTCTCTAGCTACCAATCCTCAATTCC
TCATAGCGTGTCGTCGTCTCTGGCATCAGAGCAAATTCAAATCCTGAGACGATTGTGACT
AAGGCCTCATTCGTTCCATAGGAATTTCACGGGAAACTCAAAGGATTCAGGATTCCATAA
#seq id obs_freq occ
aaaaaa aaaaaa|tttttt 0.0044876667803 1712
cgccgc cgccgc|gcggcg 0.0024666439487 941
ccgccg ccgccg|cggcgg 0.0023172298094 884
gccgcc gccgcc|ggcggc 0.0023067446067 880
aaaaat aaaaat|attttt 0.0020288867336 774
#seq id obs_freq occ
cgccgc cgccgc|gcggcg 0.0019319011803 693
gccgcc gccgcc|ggcggc 0.0018761464565 673
ccgccg ccgccg|cggcgg 0.0018677832479 670
aagaaa aagaaa|tttctt 0.0016196747269 581
ctcctc ctcctc|gaggag 0.0015806464203 567
#seq id obs_freq occ
aaaaaa aaaaaa|tttttt 0.0048170011903 307261
aaaaac aaaaac|gttttt 0.0014555523659 92845
aaaaag aaaaag|cttttt 0.0016166515755 103121
aaaaat aaaaat|attttt 0.0023315959462 148725
aaaaca aaaaca|tgtttt 0.0015835569442 101010
Now let’s inspect the R code which will be used in this demo. We’ll start by checking the function that computes the significance of hexamers (words of length \(K=6\)):
#' @author Najla Nksouri <najlaksouri@gmail.com>
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Compare the statistics between query and background k_mer counts
#' @param query.oligo.file file containing k-mer counts computed from the query sequences, as produced by RSAT oligo-analysis.
#' This file must contain
#' @param background.oligo.file file containing background k-mer occurrences
kmer.stats <- function(query.oligo.file,
background.oligo.file) {
## Check existence of input files
if (!file.exists(query.oligo.file)) {
stop("kmer.stats() error: query.oligo.file does not exist: ", query.oligo.file)
}
if (!file.exists(background.oligo.file)) {
stop("kmer.stats() error: background.oligo.file does not exist: ", ... = background.oligo.file)
}
## Read tables and rename first column
oligo.query.freq.table <-
read.table(query.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
names(oligo.query.freq.table)[1] <- "seq"
oligo.genome.freq.table <-
read.table(background.oligo.file, sep = "\t", comment.char = ";", header = TRUE)
names(oligo.genome.freq.table)[1] <- "seq"
oligo.table <- merge(
oligo.query.freq.table,
oligo.genome.freq.table,
by=c("seq", "id"), suffixes=c("_query", "_background")
)
## Compute the p-value of the observed occurrences in query frequencies
nb.oligos <- nrow(oligo.table)
N <- sum(oligo.table$occ_query)
## Compute th number of occurrences expected in random background fragments
## of the same size as the query sequences.
oligo.table$exp.occ <- oligo.table$obs_freq_background * N
### Compute the nominal (unadjusted) p-value of each oligonucleotide
## Test the over-representation: righ-tailed test
oligo.table$pval.over <- pbinom(q = oligo.table$occ_query -1, size = N,
prob = oligo.table$obs_freq_background, lower.tail=FALSE)
## Compute log p-value to avoid the floating point limitation to 1e-320
oligo.table$log.pval.over <- pbinom(q = oligo.table$occ_query -1, size = N,
prob = oligo.table$obs_freq_background, lower.tail=FALSE, log=TRUE)
## Test under-representation: left-tailed test
oligo.table$pval.under <- pbinom(q = oligo.table$occ_query, size = N,
prob = oligo.table$obs_freq_background, lower.tail=TRUE)
oligo.table$log.pval.under <- pbinom(q = oligo.table$occ_query, size = N,
prob = oligo.table$obs_freq_background, lower.tail=TRUE, log=TRUE)
## Two-sided test
oligo.table$pval <- pmin(oligo.table$pval.under, oligo.table$pval.over) * 2
oligo.table$log.pval <- pmin(oligo.table$log.pval.under, oligo.table$log.pval.over) * 2
## Compute the E-value (Bonferroni multiple testing correction).`
## E-value indicates the expected number of false positives
## (i.e. how many oligos would be expected to be be declared positive under
## the null hypothesis)
oligo.table$eval <- oligo.table$pval * nb.oligos
## Compute the significance
## sig = -log10(eval)
oligo.table$sig <- -log10(nb.oligos) - oligo.table$log.pval/log(10)
## Compute maximal frequency for both files
max.axis <- max(c(oligo.table$obs_freq_query, oligo.table$obs_freq_background))
## Log2 ratio observed/expected
oligo.table$freq.ratio <- oligo.table$obs_freq_query/oligo.table$obs_freq_background
oligo.table$log2.ratio <- log2(oligo.table$freq.ratio)
return(oligo.table)
}
Now we shall look at the plotting function:
#' @title Scatter plot of background versus query k-mer frequencies.
#' @author Najla Nksouri <najlaksouri@gmail.com>
#' Revised by Jacques van Helden <Jacques.van-Helden@univ-amu.fr>
#' @description Draw a scatter plot comparing background and query k-mer frequencies for a given size.
#' @param kmer.stat.table a table with k-mer statistics produced by kmer.stats()
#' @param k oligonucleotide (k-mer) size
#' @param sig.threshold=2 Threshold on significance for the volcano plot.
#' @param volcano.ymax=350
#' @return a data.frame with merged query and background k-mer frequencies + derived statistics
#' @examples
#' ## Plot query versus background 4-mer frequencies
#' plot.kmers(k=8)
#'
plot.kmers <- function(kmer.stat.table,
k,
sig.threshold = 2,
log2FC.threshold = 1, ## Enrichment threshold
volcano.ymax = 350) {
message(k, "-mer frequency plot")
## Define dot colors
dot.colors <- ifelse(
(kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio > 0.1), "darkred",
ifelse((kmer.stat.table$sig > sig.threshold & kmer.stat.table$log2.ratio < -0.1),
"darkblue", "darkgray"))
## Draw a scatter plot to compare k-mer frequencies
max.freq <- max(kmer.stat.table[, c("obs_freq_query","obs_freq_background")])
plot(x = kmer.stat.table$obs_freq_background,
y = kmer.stat.table$obs_freq_query,
## Set axis limits
xlim = c(0, max.freq),
ylim = c(0, max.freq),
## Set labels
xlab = paste("background ", k,"-mer frequencies", sep = ""),
ylab = paste("query ", k,"-mer frequencies", sep = ""),
main = paste("background vs query ", k,"-mer frequencies", sep = ""),
col = dot.colors,
## Insert grid
panel.first=grid(),
## Change symbol
pch=1
)
## Insert diagonal
abline(0, 1, col="black", lwd=1, lty = 2)
## MA plot : log-ratio as a function of the log counts
## Note: usually, the MA plot represent the log-ratio versus log(mean) of the two measurements.
## However, in our case the query sequences are a subset of the background sequences.
## We will thus plot the log(background freq) on the X axis
plot(x = log2(kmer.stat.table$obs_freq_background),
y = kmer.stat.table$log2.ratio,
## Set labels
xlab = paste("log2(background)"),
ylab = paste("log2(query/background)"),
main = paste(k,"-mer frequencies", sep = ""),
col = dot.colors,
## Insert grid
panel.first=grid(),
## Change symbol
pch=1
)
## Insert diagonal
abline(h=0, col="black", lwd=1, lty = 2)
## Draw a volcano plot
##
## Abcsissa indicates the effect,i.e. the log2-ratio query versus background.
## Ordinate indicates the significance as computed above.
##
## Mark in red the over-represented k-mers with sig > given threshold.
## and in blue the under-represented k-mers with sig > given threshold.
## Set axis limit
x.lim <- abs(max(kmer.stat.table$log2.ratio))
## Plotting
with(kmer.stat.table,plot(x=log2.ratio,
y=sig,
pch = 1,
xlim = c(-x.lim, x.lim),
ylim = c (0, volcano.ymax),
## Set labels
xlab = paste("Log2.ratio ", k,"-mer frequencies", sep = ""),
ylab = paste("Significance ", k,"-mer frequencies", sep = ""),
main = paste("Volcano Plot ", k,"-mer frequencies", sep = ""),
# Insert grid
panel.first=grid(),
#set the colors
col = dot.colors))
## Denote significance exceeding the threshold by
above.ymax <- kmer.stat.table$sig > volcano.ymax
with(kmer.stat.table[above.ymax,],
points(x=log2.ratio, y=rep(x = volcano.ymax, length.out=sum(above.ymax)), pch=17, col="red",type="p"))
## insert diagonals
abline(h=sig.threshold, col= "black", lty = 2)
abline(v=0, col= "black", lty = 2)
abline(v=-log2FC.threshold, col= "black", lty = 2)
abline(v=log2FC.threshold, col= "black", lty = 2)
}
With this data and functions we can now compute \(k\)-mer enrichment in the promoters of a sample set of co-regulated genes or regulon:
# adjust as required
dir_data = "test_data/" # where pre-computed k-mer freq files are
dir_code = "test_code/"
source(file.path(dir_code, "kmer_stats.R"))
source(file.path(dir_code, "plot_kmers.R"))
# set word/k-mer size
k = 6
# set files with pre-computed words/k-mer frequencies
# raw sequences
#oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.raw.tab")
# recommended repeat-masked sequences
oligo.upstream.freq.file = file.path(dir_data,"6nt_upstream_regulon.rm.tab")
oligo.genome.freq.file = file.path(dir_data, "6nt_upstream_genome.tab")
# check the code of this function to see how the probabilities of word occurrences
# are computed with the binomial function
kmer.stat.table <- kmer.stats(query.oligo.file = oligo.upstream.freq.file,
background.oligo.file = oligo.genome.freq.file)
#View(kmer.stat.table)
plot.kmers(kmer.stat.table, k, volcano.ymax = 60)
The plant-dedicated mirror of the Regulatory Sequence Analysis Tools (RSAT, http://plants.rsat.eu) offers specialized options for researchers dealing with plant transcriptional regulation. The web site contains whole-sequenced species regularly updated from Ensembl Plants and other sources, and supports an array of tasks frequently required for the analysis of regulatory sequences, such as retrieving upstream sequences, motif discovery, motif comparison, pattern matching. RSAT::Plants also integrates the footprintDB collection of DNA motifs. This exercise consists on following a step-by-step protocol on how to discover DNA motifs in regulatory regions of clusters of co-expressed genes in plants. It also explains how to empirically control the significance of the result, and how to annotate any discovered motis.
This protocol has been published at (Contreras-Moreira et al. 2016) and updated at (Ksouri et al. 2021). Other related protocols can be found at (Castro-Mondragon et al. 2016; Contreras-Moreira and Sebastian 2016).
Transcriptome data (microarrays, RNA-seq) have been extensively used as a proxy for genetic regulation in many organisms, as the analysis of genome-wide profiles of gene transcription under different treatments uncovers clusters of genes with correlated behaviors, which may result from direct or indirect co-regulation. A classical application of this approach was done by Beer and Tavazoie (2004) with yeast microarray data sets obtained in a variety of experimental conditions. In that experiment, expression data-mining was demonstrated to be an effective strategy for finding regulons, groups of genes that share regulatory mechanisms and functional annotations.
Other studies have unveiled that the outcome of these approaches largely depends on the genomic background of the species under study. For instance, Sand O. and Helden J. (2008) reported that the significance of DNA motifs discovered in Saccharomyces cerevisiae promoters is much higher for regulons than for random gene sets of the same sizes, but for human promoters the signal-to-noise ratio is almost null, because random gene sets give highly significant motifs due to heterogeneities in promoter compositions and biases due to repetitive elements. For metazoans, it is thus a real challenge to distinguish bona fide motifs from noise (Sand O. and Helden J. 2008). These observations suggest that motif discovery on sequence clusters faces intrinsic properties of the genomes under study, regardless of the software used for the task.
Among plants, these strategies have so far been tested on the model Arabidopsis thaliana, and they have been successfully applied to the identification of novel cis-regulatory elements validated with synthetic promoters (Koschmann et al. 2012). Yet, with the exception of this model, these sorts of experiments have not been possible in other plants until recently. In spite of this, the growing list of available plant genomes encourages these analyses in combination with expression profiles obtained from either microarray or RNA-seq data sets, as in the work of Yu et al. (2015), provided that these factors are considered:
Plant genomes are rich in repetitive elements (RE) distributed along the genome (Schmidt and Heslop-Harrison 1998), which pose particular problems for motif discovery statistics (violation of the independence assumption).
Current genome assemblies range from 120Mb (A. thaliana) to 13,420Mb (Triticum aestivum). Brachypodium distachyon, a model species for grasses, is 272Mb. The quality of these assemblies and their RE content is also quite variable, as shown in Table 1.
Upstream regions, defined by annotated gene coordinates, are also of variable length, going from 1,151b on average in A. thaliana to 2,820b in Triticum_turgidum (see Table 1). Work by Ksouri et al. (2021) has shown that the choice of upstream region length impacts on the discovery of regulatory sequences.
Table 1. Features of some plant genomes in RSAT::Plants, taken from https://plants.rsat.eu/data/stats. Each ID concatenates the organism, the assembly version and the source. Most genome IDs add to the end the Ensembl Plants release number. For instance, Arabidopsis_thaliana.TAIR10.55, corresponds to A.thaliana assembly 10 from TAIR https://www.arabidopsis.org, annotated in release 55 of Ensembl Plants. “%Ns” are stretches of uncharacterized nucleotides which often connect assembled sequence contigs. “%repeat-masked” segments are sequences with significant similarity to plant repetitive DNA sequences, which are masked.
genome.stats <- read.delim(file="test_data/summary.tab", row.names = NULL, comment.char=";")
genome.stats[,2] = as.numeric(genome.stats[,2])/1e6 ## Convert genome sizes to Mb
genome.stats[,5] = as.numeric(genome.stats[,5]-genome.stats[,4]) ## raw %mask also includes %N
names(genome.stats) <- c("Organism / assembly ID", "Genome size (Mb)", "Contigs", "%N", "% repeat-masked", "Gene models", "Mean upstream length")
# names(genome.stats)
## Print the table
kable(genome.stats,format.args = list(big.mark=","),
digits=c(0,0,0,1,1,0,0))
Organism / assembly ID | Genome size (Mb) | Contigs | %N | % repeat-masked | Gene models | Mean upstream length |
---|---|---|---|---|---|---|
Aegilops_tauschii.Aetv4.0.55 | 4,225 | 109,583 | 2.3 | 82.2 | 43,362 | 2,533 |
Amborella_trichopoda.AMTR1.0.55 | 706 | 5,745 | 5.4 | 54.9 | 28,353 | 2,682 |
Ananas_comosus.F153.55 | 316 | 25 | 1.5 | 16.2 | 26,730 | 2,367 |
Arabidopsis_halleri.Ahal2.2.55 | 196 | 2,239 | 14.8 | 24.0 | 33,223 | 1,758 |
Arabidopsis_lyrata.v.1.0.55 | 207 | 695 | 11.1 | 32.9 | 33,500 | 1,799 |
Arabidopsis_thaliana.Columbia.GCF_000001735.3_TAIR10.NCBIRefseq | 120 | 7 | 0.2 | 21.8 | 38,161 | 638 |
Arabidopsis_thaliana.TAIR10.55 | 120 | 7 | 0.2 | 23.4 | 34,262 | 1,154 |
Beta_vulgaris.RefBeet-1.2.2.55 | 566 | 40,245 | 8.6 | 18.0 | 28,031 | 2,260 |
Brachypodium_distachyon.Brachypodium_distachyon_v3.0.55 | 271 | 10 | 0.2 | 4.2 | 34,310 | 1,971 |
Brachypodium_hybridum.463.v1.1.JGI | 509 | 15 | 1.2 | 17.8 | 70,160 | 2,052 |
Brachypodium_stacei.316.v1.1.JGI | 234 | 112 | 1.1 | 13.9 | 29,898 | 2,139 |
Brachypodium_sylvaticum.490.v1.1.JGI | 358 | 629 | 0.0 | 41.6 | 36,927 | 2,070 |
Brassica_juncea.ASM1870372v1.55 | 934 | 1,544 | 4.8 | 50.1 | 92,887 | 2,045 |
Brassica_napus.AST_PRJEB5043_v1.55 | 848 | 20,899 | 13.0 | 20.1 | 105,276 | 1,862 |
Brassica_oleracea.BOL.55 | 489 | 32,928 | 8.8 | 21.7 | 60,694 | 2,195 |
Brassica_rapa_ro18.SCU_BraROA_2.3.55 | 346 | 295 | 0.4 | 52.4 | 43,129 | 2,263 |
Cannabis_sativa.cs10.GCF_900626175.2.NCBI | 876 | 221 | 15.9 | 46.9 | 29,813 | 2,528 |
Capsicum_annuum.ASM51225v2.55 | 3,064 | 35,797 | 3.2 | 8.0 | 35,845 | 2,683 |
Chenopodium_quinoa.PI614886.392.v1.JGI | 1,385 | 3,486 | 4.5 | 63.2 | 44,776 | 2,591 |
Chlamydomonas_reinhardtii.Chlamydomonas_reinhardtii_v5.5.55 | 111 | 53 | 3.6 | 7.1 | 17,743 | 536 |
Chondrus_crispus.ASM35022v2.55 | 105 | 926 | 0.9 | 68.4 | 9,857 | 674 |
Cicer_arietinum.ASM33114v1.GCF_000331145.1.RefSeq | 531 | 7,127 | 9.1 | 42.4 | 28,772 | 2,484 |
Citrus_clementina.Citrus_clementina_v1.0.55 | 301 | 1,398 | 2.1 | 21.0 | 27,179 | 2,338 |
Coffea_canephora.AUK_PRJEB4211_v1.55 | 569 | 13,007 | 17.1 | 6.1 | 25,574 | 2,360 |
Corchorus_capsularis.CCACVL1_1.0.55 | 317 | 16,522 | 0.0 | 46.5 | 31,549 | 2,070 |
Corylus_avellana.CavTom2PMs-1.0.55 | 370 | 11 | 0.5 | 48.2 | 27,270 | 2,445 |
Cucumis_sativus.ASM407v2.55 | 194 | 186 | 1.8 | 13.8 | 24,079 | 2,218 |
Cyanidioschyzon_merolae.ASM9120v1.55 | 17 | 22 | 0.0 | 6.7 | 5,021 | 816 |
Cynara_cardunculus.CcrdV1.55 | 725 | 13,588 | 9.7 | 8.7 | 26,505 | 2,595 |
Daucus_carota.ASM162521v1.55 | 422 | 4,826 | 8.2 | 4.6 | 33,488 | 2,552 |
Digitaria_exilis.Fonio_CM05836.55 | 716 | 19 | 2.8 | 40.1 | 59,844 | 2,329 |
Dioscorea_rotundata.TDr96_F1_Pseudo_Chromosome_v1.0.55 | 457 | 21 | 11.1 | 17.3 | 19,565 | 2,441 |
Echinochloa_crusgalli.ec_v3.55 | 1,341 | 2,058 | 0.1 | 53.1 | 103,850 | 2,404 |
Eragrostis_tef.ASM97063v1.55 | 607 | 13,883 | 18.3 | 3.5 | 41,920 | 2,321 |
Ficus_carica.UNIPI_FiCari_1.0.55 | 333 | 511 | 0.0 | 47.2 | 37,832 | 2,275 |
Fragaria_vesca.Hawaii4.Fvb2.0.GDR | 212 | 8 | 7.0 | 20.6 | 33,538 | 1,822 |
Galdieria_sulphuraria.ASM34128v1.55 | 14 | 433 | 2.1 | 2.7 | 6,736 | 215 |
Glycine_max.Glycine_max_v2.1.55 | 978 | 1,190 | 2.4 | 48.9 | 57,147 | 2,521 |
Gossypium_raimondii.Graimondii2_0.55 | 761 | 1,033 | 1.8 | 57.6 | 38,719 | 2,492 |
Helianthus_annuus.HanXRQr2.0-SUNRISE.55 | 3,010 | 332 | 2.1 | 81.3 | 83,177 | 2,501 |
Hordeum_vulgare.IBSCv2.55 | 4,834 | 10 | 5.4 | 52.3 | 42,858 | 2,636 |
Hordeum_vulgare.MorexV3_pseudomolecules_assembly.55 | 4,226 | 290 | 0.0 | 85.3 | 35,826 | 2,779 |
Ipomoea_triloba.ASM357664v1.55 | 462 | 16 | 5.3 | 19.2 | 31,418 | 2,405 |
Klebsormidium_nitens.NIES2285.171026.v11.TITECH | 104 | 1,814 | 1.0 | 0.0 | 17,153 | 865 |
Lactuca_sativa.Lsat_Salinas_v7.55 | 2,391 | 8,323 | 7.6 | 68.5 | 38,693 | 2,784 |
Leersia_perrieri.Lperr_V1.4.55 | 267 | 12 | 0.4 | 32.2 | 31,516 | 2,211 |
Lolium_perenne.MPB_Lper_Kyuss_1697.55 | 2,277 | 1,677 | 0.0 | 79.0 | 38,868 | 2,706 |
Lupinus_albus.CNRS_Lalb_10.GCA_009771035.1.NCBI | 451 | 89 | 0.5 | 44.8 | 41,359 | 2,086 |
Lupinus_albus.La_Amiga31.GCA_010261695.1.NCBI | 559 | 1,580 | 0.3 | 42.8 | 48,718 | 2,203 |
Lupinus_angustifolius.LupAngTanjil_v1.0.55 | 609 | 13,573 | 0.7 | 24.0 | 35,170 | 2,629 |
Manihot_esculenta.Manihotesculentav6.55 | 582 | 2,019 | 14.9 | 12.2 | 33,516 | 2,363 |
Marchantia_polymorpha.Marchanta_polymorpha_v1.55 | 226 | 2,957 | 6.7 | 4.2 | 19,287 | 2,174 |
Medicago_truncatula.MedtrA17_4.0.55 | 413 | 2,186 | 5.5 | 33.9 | 52,918 | 2,298 |
Musa_acuminata.ASM31385v1.55 | 473 | 12 | 17.4 | 15.8 | 38,658 | 1,968 |
Nicotiana_attenuata.NIATTr2.55 | 1,492 | 35,134 | 11.7 | 24.4 | 21,344 | 2,765 |
Nicotiana_tabacum.TN90.GCF_000715135.1.NCBI | 3,643 | 168,247 | 2.9 | 51.4 | 74,273 | 2,432 |
Olea_europaea.OLEA9.55 | 1,316 | 9,751 | 4.1 | 54.3 | 55,595 | 2,563 |
Olea_europaea.sylvestris.451.v1.JGI | 1,142 | 41,256 | 9.7 | 56.3 | 50,684 | 2,695 |
Oryza_barthii.O.barthii_v1.55 | 308 | 12 | 0.8 | 43.9 | 37,849 | 2,217 |
Oryza_brachyantha.Oryza_brachyantha.v1.4b.55 | 261 | 7,485 | 6.7 | 28.4 | 34,598 | 2,246 |
Oryza_glaberrima.Oryza_glaberrima_V1.55 | 316 | 1,951 | 4.1 | 40.3 | 74,579 | 2,028 |
Oryza_glumipatula.Oryza_glumaepatula_v1.5.55 | 373 | 12 | 19.2 | 35.1 | 39,702 | 2,165 |
Oryza_indica.ASM465v1.55 | 427 | 10,490 | 3.8 | 44.2 | 89,723 | 2,005 |
Oryza_longistaminata.O_longistaminata_v1.0.55 | 326 | 60,198 | 9.8 | 31.0 | 32,868 | 2,159 |
Oryza_meridionalis.Oryza_meridionalis_v1.3.55 | 336 | 12 | 23.8 | 29.9 | 32,549 | 2,204 |
Oryza_nivara.Oryza_nivara_v1.0.55 | 338 | 12 | 8.3 | 40.5 | 38,741 | 2,179 |
Oryza_punctata.Oryza_punctata_v1.2.55 | 394 | 12 | 0.4 | 53.1 | 37,791 | 2,218 |
Oryza_rufipogon.OR_W1943.55 | 338 | 12 | 1.9 | 46.1 | 40,584 | 2,171 |
Oryza_sativa.IRGSP-1.0.55 | 375 | 63 | 0.0 | 51.0 | 38,978 | 2,204 |
Ostreococcus_lucimarinus.ASM9206v1.55 | 13 | 21 | 0.0 | 15.0 | 7,664 | 521 |
Panicum_hallii_fil2.PHallii_v3.1.55 | 536 | 291 | 1.4 | 5.1 | 33,805 | 2,027 |
Phaseolus_vulgaris.PhaVulg1_0.55 | 521 | 708 | 9.3 | 20.1 | 28,472 | 2,596 |
Physcomitrium_patens.PhypaV3.55 | 472 | 357 | 1.3 | 58.6 | 32,674 | 2,044 |
Populus_trichocarpa.Pop_tri_v3.55 | 434 | 1,446 | 2.6 | 5.5 | 41,335 | 2,451 |
Prunus_avium.PAV_r1.0.55 | 272 | 10,148 | 9.4 | 16.0 | 43,349 | 1,868 |
Prunus_avium.Satonishiki.PAV_r1.0.GDR | 374 | 9 | 34.0 | 18.9 | 43,673 | 2,094 |
Prunus_mume.BJFU1210120008.GCF_000346735.1.NCBIRefseq | 234 | 8,164 | 7.2 | 22.5 | 26,524 | 2,047 |
Prunus_persica.Prunus_persica_NCBIv2.38 | 227 | 191 | 1.2 | 21.3 | 27,107 | 1,984 |
Prunus_persica.Prunus_persica_NCBIv2.55 | 227 | 191 | 1.2 | 19.2 | 27,107 | 1,984 |
Prunus_persica.Prupe1_0.37 | 227 | 202 | 1.2 | 16.1 | 28,898 | 2,217 |
Saccharum_spontaneum.Sspon.HiC_chr_asm.55 | 2,900 | 32 | 0.3 | 36.6 | 83,815 | 2,706 |
Selaginella_moellendorffii.v1.0.55 | 213 | 759 | 1.9 | 37.1 | 36,200 | 1,436 |
Setaria_italica.Setaria_italica_v2.0.55 | 406 | 336 | 1.2 | 2.8 | 35,831 | 2,300 |
Solanum_lycopersicum.SL3.0.55 | 828 | 3,148 | 9.8 | 51.8 | 35,092 | 2,300 |
Solanum_tuberosum.DM.v4.03.PGSC | 725 | 12 | 12.5 | 0.0 | 39,028 | 2,401 |
Solanum_tuberosum.SolTub_3.0.55 | 811 | 13 | 15.8 | 49.3 | 40,430 | 2,527 |
Sorghum_bicolor.Sorghum_bicolor_NCBIv3.55 | 709 | 867 | 4.7 | 64.7 | 35,479 | 2,276 |
Trifolium_pratense.Trpr.55 | 305 | 8,595 | 11.7 | 19.0 | 41,184 | 1,849 |
Triticum_aestivum.IWGSC.55 | 14,550 | 22 | 1.9 | 85.1 | 120,744 | 2,605 |
Triticum_dicoccoides.WEWSeqv.1.0.55 | 10,080 | 14 | 1.5 | 86.1 | 67,375 | 2,656 |
Triticum_turgidum.svevo.55 | 10,460 | 15 | 1.5 | 4.9 | 66,559 | 2,820 |
Triticum_urartu.Tu2.0.55 | 4,852 | 10,284 | 0.6 | 89.7 | 41,492 | 2,613 |
Vigna_angularis.Vigan1.1.55 | 467 | 37,373 | 3.6 | 13.5 | 34,819 | 2,719 |
Vigna_radiata.Vradiata_ver6.55 | 463 | 2,497 | 7.2 | 20.2 | 23,164 | 2,643 |
Vigna_unguiculata.ASM411807v1.55 | 519 | 682 | 0.5 | 51.0 | 31,814 | 2,496 |
Vitis_vinifera.PN40024.v4.55 | 476 | 22 | 0.4 | 57.2 | 35,134 | 2,266 |
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55 | 2,182 | 685 | 0.2 | 84.5 | 44,303 | 2,502 |
The step-by-step protocol below has been tested for the task of discovering and annotating DNA motifs in clusters of upstream sequences for species supported by RSAT::Plants, which have been obtained mostly from Ensembl Plants (https://plants.ensembl.org), but also include data from the JGI Genome Portal (http://genome.jgi.doe.gov) and other sources. In addition, RSAT::Plants integrates footprintDB (https://footprintdb.eead.csic.es), a collection of position-specific scoring matrices (PSSM) representing transcription factor binding motifs (TFBM), as well as their cognate binding proteins (Sebastian and Contreras Moreira 2014), which can be used to annotate discovered motifs and to predict potentially binding transcription factors.
Discovering regulatory elements within natural genomic sequences is certainly an important scientific goal on its own, but can also be part of the design and validation of synthetic promoters. We envisage at least two applications in this context:
The characterization of promoters of genes with known expression properties, which can then be used to engineer the expression of genes of interest.
The validation of engineered promoters in order to make sure that they contain the expected regulatory elements which might be natural or engineered depending on the application.
This protocol requires disposing of:
A computer with a Web browser
A set of gene clusters from any of the species currently supported at RSAT::Plants ( http://plants.rsat.eu, see Note 1). Here we will use three example clusters of co-expressed maize genes, shown on Table 2 (see Note 2). First, they discovered potential regulatory motifs within their upstream sequences, and then they performed electrophoretic mobility shift assays (EMSA) to confirm them. Table 2 shows three of those clusters which are used in this protocol. For each cluster a list of gene identifiers is given next to the EMSA-confirmed motifs. The remaining nine clusters were left out for being too small, as the statistical approaches in this protocol require at least 10-15 genes.
The following protocol enumerates the steps required to discover DNA motifs, based on the over-representation of \(k\)-mers (oligonucleotides) and dyads (spaced pairs of oligonucleotides), in clusters of upstream sequences. The protocol comprises two stages, analyzing first co-expressed genes and then random clusters as a negative control (see Note 3). Only after both stages have been completed it is possible to objectively estimate the relevance of the results.
Before the proper analysis of the gene cluster, we will retrieve the promoter sequences of all the genes of the organism of interest, which will serve below to estimate the background model. Table 2 shows three clusters of maize (Zea mays) genes used along the protocol, extracted from the published work of Yu et al. (2015). The original gene identifiers have been translated to current gene models in the Zm-B73-REFERENCE-NAM-5.0.55 annotation using maizeGDB (Woodhouse et al. 2021).
Experimentally verified regulatory motifs of these clusters are indicated:
Cluster name | Confirmed motif | #sequences | Original gene IDs | Current gene IDs |
---|---|---|---|---|
ABI4 | GCGCRSGCGGSC | 16 | GRMZM2G025062 GRMZM2G053503 GRMZM2G069082 GRMZM2G069126 GRMZM2G069146 GRMZM2G076896 GRMZM2G081892 GRMZM2G124011 GRMZM2G129674 GRMZM2G142179 GRMZM2G169654 GRMZM2G172936 GRMZM2G173771 GRMZM2G174347 GRMZM2G175525 GRMZM2G421033 |
Zm00001eb330490 Zm00001eb340140 Zm00001eb318890 Zm00001eb318900 Zm00001eb318910 Zm00001eb202570 Zm00001eb335190 Zm00001eb335200 Zm00001eb103110 Zm00001eb006160 Zm00001eb119540 Zm00001eb156040 Zm00001eb074730 Zm00001eb193670 Zm00001eb352580 Zm00001eb422470 Zm00001eb432090 |
E2F | TTCCCGCCA | 18 | AC197146.3_FG001 GRMZM2G017081 GRMZM2G021069 GRMZM2G037700 GRMZM2G057571 GRMZM2G062333 GRMZM2G065205 GRMZM2G066101 GRMZM2G075978 GRMZM2G100639 GRMZM2G112074 GRMZM2G117238 GRMZM2G130351 GRMZM2G139894 GRMZM2G154267 GRMZM2G162445 GRMZM2G327032 GRMZM2G450055 |
Zm00001eb405460
Zm00001eb227730 Zm00001eb350340 Zm00001eb228900 Zm00001eb284360 Zm00001eb028400 Zm00001eb162700 Zm00001eb257900 Zm00001eb290510 Zm00001eb137000 Zm00001eb248940 Zm00001eb421110 Zm00001eb169600 Zm00001eb169610 Zm00001eb037700 Zm00001eb351990 Zm00001eb342080 Zm00001eb192270 Zm00001eb192280 Zm00001eb243240 Zm00001eb243250 |
WRI1 | CGGCGGCGS | 56 | AC210013.4_FG019 GRMZM2G008430 GRMZM2G009968 GRMZM2G010435 GRMZM2G010599 GRMZM2G014444 GRMZM2G015097 GRMZM2G017966 GRMZM2G022019 GRMZM2G027232 GRMZM2G028110 GRMZM2G035017 GRMZM2G041238 GRMZM2G045818 GRMZM2G047727 GRMZM2G048703 GRMZM2G064807 GRMZM2G068745 GRMZM2G074300 GRMZM2G076435 GRMZM2G078779 GRMZM2G078985 GRMZM2G080608 GRMZM2G092663 GRMZM2G096165 GRMZM2G098957 GRMZM2G107336 GRMZM2G108348 GRMZM2G111987 GRMZM2G115265 GRMZM2G119865 GRMZM2G122871 GRMZM2G126603 GRMZM2G126928 GRMZM2G132095 GRMZM2G140799 GRMZM2G148744 GRMZM2G150434 GRMZM2G151252 GRMZM2G152599 GRMZM2G170262 GRMZM2G181336 GRMZM2G311914 GRMZM2G312521 GRMZM2G322413 GRMZM2G325606 GRMZM2G343543 GRMZM2G353785 GRMZM2G409407 GRMZM2G439201 GRMZM5G823135 GRMZM5G827266 GRMZM5G831142 GRMZM5G835323 GRMZM5G870606 GRMZM5G882378 |
Zm00001eb212470 Zm00001eb347390 Zm00001eb351870 Zm00001eb066830 Zm00001eb420570 Zm00001eb238930 Zm00001eb031300 Zm00001eb267190 Zm00001eb222060 Zm00001eb222210 Zm00001eb283120 Zm00001eb063980 Zm00001eb335620 Zm00001eb342930 Zm00001eb095960 Zm00001eb011490 Zm00001eb174060 Zm00001eb405750 Zm00001eb172130 Zm00001eb166570 Zm00001eb316270 Zm00001eb285610 Zm00001eb061760 Zm00001eb212180 Zm00001eb134780 Zm00001eb037330 Zm00001eb138230 Zm00001eb235710 Zm00001eb271520 Zm00001eb277320 Zm00001eb160070 Zm00001eb359780 Zm00001eb235710 Zm00001eb399010 Zm00001eb036530 Zm00001eb354150 Zm00001eb361700 Zm00001eb317180 Zm00001eb173470 Zm00001eb192860 Zm00001eb205850 Zm00001eb192680 Zm00001eb330480 Zm00001eb372580 Zm00001eb276860 Zm00001eb320920 Zm00001eb282960 Zm00001eb329960 Zm00001eb227700 Zm00001eb227710 Zm00001eb352030 Zm00001eb158110 Zm00001eb323110 Zm00001eb225190 Zm00001eb225200 Zm00001eb121240 |
Open a connection to the RSAT::Plants server. It can be reached at http://plants.rsat.eu. On the left-side menu, select ‘Sequence tools -> retrieve sequence’.
Choose ‘Single organism -> Zea mays’ for the examples of this protocol (see Note 1), which corresponds to ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ in Ensembl Plants release 55 (Yates et al. 2022).
Choose ‘Genes -> all’; this will retrieve all upstream sequences of the maize genome.
Set appropriate upstream bounds. To replicate the work of Yu et al. (2015) these should be set to ‘From’ -1000 ‘To’ +200, with position 0 corresponding to transcriptional start sites (TSS); see Note 4. Beware that TSS positions in plant genomes often correspond to start codons, probably due to incomplete annotations.
We recommend to tick the option ‘Mask repeats’, as plant genomes are frequently repeat-rich. This option should NOT be used if you suspect the transcription factors of interest bind to repeated sequences.
Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note 5). The results include the executed command and a URL to the ‘sequences’ file, which must be saved. We will refer to this URL as ‘all.fasta.URL’. This FASTA-format file can also be stored as a local file on your computer, but note it can be rather large and might fail to upload later on if its size is larger than the server’s limit.
Press ‘GO’ and wait until the retrieve-seq result page is displayed (see Note 4).
We will now retrieve the upstream sequences of a cluster of co-expressed genes, and use peak-motifs to discover exceptional motifs in their promoters. The tool peak-motifs was initially conceived to discover motifs in ChIP-seq peaks, but it can also be used to analyze other sequence types, as illustrated here.
Choose cluster E2F from Table 2, copy the current gene IDs (last column). The IDs should be saved in a new text file, that you will store on your computer. Insert newline characters between genes6.
In the left menu of the RSAT server, click on ‘retrieve sequence’ to get a fresh form. Make sure that the option ‘Genes -> selection’ is activated and that the right organism, in this case ‘Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’, is selected. Tick ‘Mask repeats’, and set the same size limits as for the whole collection of promoters: from -1000 to +200. Paste the list of IDs of your gene cluster (one gene ID per row).
Press ‘GO’ and wait a few seconds until the result page is displayed. Inspection of these sequences might reveal N-masked sequence stretches, which correspond to annotated repeats. Save both ‘query genes’ and ‘sequences’ files to local files on your computer, we will refer to them as ‘cluster.genes’ and ‘cluster.fasta’ later on this protocol.
Press the ‘peak-motifs’ button. The peak sequences section is automatically filled with a link to the selected cluster sequences.
Add a title for this job, such as ‘E2F cluster’.
On the right side of ‘Peak sequences’, under Control sequences, paste the ‘all.fasta.URL’ on the ‘URL of a sequence file available on a Web server’ entry.
Click on ‘Reduce peak sequences’ and leave both fields (‘number of top sequences to retain’ and ‘cut peak sequences’) blank to avoid having the sequences clipped.
Click on ‘Motif discovery parameters’. Select two algorithms: ‘Discover over-represented words’ (oligo-analysis) and ‘Discover over-represented spaced word pairs’ (dyad-analysis). Uncheck the program position-analysis (see Note 7).
Click on ‘Compare discovered motifs with databases’ and select appropriate databases which will be used to annotate any found motifs. For plant promoters, we recommend to check ‘footprintDB-plants’, but you can also check other databases such as ‘JASPAR plants’ (see Note 8).
You can also upload your own collection of DNA motifs in TRANSFAC format.
Click on ‘Reporting Options’. Set ‘Origin’ to ‘end’ and ‘Offset’ to -200 (see Note 9).
Select output delivery and press ‘GO’.
After few seconds the server should have uploaded the sequences and display a page with the URL of the future result page. You can already click on this link: the result page will be periodically updated to show the progress of the analysis. At the end of the processing, a box will appear at the top of the result page, with a short summary of the discovered motifs, and links to different sections of the results.
Once the job is complete click on the link [Download all results (peak-motifs_archive.zip)] to save the results on your computer. You will later be able to uncompress this archive in order to check the result after its removal from the server (results are only available on the server for 7 days after job completion). We also recommend downloading the full set of discovered motifs, by clicking on the link [Download all matrices (transfac format)] and saving a local file named ‘cluster.motifs.tf’. This file contains all motifs in the form of position-weight matrices (PWMs) in TRANSFAC format.
On the result page, the section entitled ‘Discovered motifs (with motif comparison)’ lists the discovered motifs, displays their sequence logos and their distribution along clustered sequences, in addition to top matches with the motif databases selected on step 9.
In this section, we propose a procedure to obtain an empirical estimation of the rate of false positives, by discovering motifs in the promoters of genes picked up at random.
On the left-side menu of RSAT::Plants select ‘Build control sets -> random gene selection’.
Choose ‘Organism -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’ for the examples of this protocol.
Set ‘Number of genes’ to the size of one of the sample clusters on Table 2. For instance, the size of the negative control sets would be 18 for cluster E2F, 16 for cluster ABI4, and 56 for cluster WRI1. For convenience, in this tutorial only one random group is generated (the default), but this utility can generate several random groups in one go (see Note 10).
Press ‘GO’ and click the ‘Next step’ button ‘retrieve sequences’ at the bottom of the result page. In the retrieve-seq form, set the other parameters as above: from -1000 to +200, check the ‘Mask repeats’ option and press ‘GO’.
Save ‘query genes’ and ‘sequences’ files to local ‘random.genes’ and ‘random.fasta’ files and repeat steps 4-11 of section 4.1.3.2.
This part of the protocol is devoted to validating sequence motifs discovered by their over-representation, which are scanned against the original sequences from which they were discovered, plus, optionally, orthologous sequences from a related species (see Note 11).
The first goal of this section is to check whether the discovered motifs show patterns of occurrence along promoter sequences, and to see how many cluster sequences actually harbor them. This can be done empirically by comparing the results of expression-based motifs with those of shuffled motifs, with columns permuted, which play the role of negative controls.
A second goal is to investigate whether these regulatory motifs are conserved on orthologous promoters of a related plant, Sorghum bicolor in this case study.
On the left-side menu select ‘Comparative genomics -> get orthologs-compara’.
Choose ‘Reference organism -> Sorghum bicolor’ for the maize example.
Upload file ‘cluster.genes’ generated in step 3 of section 4.1.3.2. Press ‘GO’ and finally press ‘retrieve sequences’ on the next screen.
Repeat steps 4-6 of section 4.1.3.1 but now select Sorghum bicolor as organism. Save ‘sequences’ to local file ‘cluster_orths.fasta’.
On the left-side menu select ‘Build control sets -> permute-matrix’.
Upload ‘cluster.motifs.tf’ (obtained in step 12 of section 4.1.3.2) and press ‘GO’. Save the results file as ‘cluster.motifs.perm1.tf’ (see Note 12).
Select ‘Pattern matching -> matrix scan (full options)’.
In the sequence box paste the contents of ‘cluster.fasta’ and, optionally, ‘cluster_orths.fasta’, if you wish to assess motif conservation. Alternatively, steps 7-12 can be performed separately with maize and S.bicolor sequences.
Upload file ‘cluster.motifs.tf’ and select ‘transfac’ format.
In the ‘Background model’ section select Markov order 2 and choose ‘Organism-specific -> Zea_mays.Zm-B73-REFERENCE-NAM-5.0.55’. Press ‘GO’.
Save the ‘Scan result’ file as ‘cluster.scan.ft’ and press the ‘feature map’ button to draw a map of the matched motif instances.
Repeat steps 6-11 using the set of permuted PWMs ‘cluster.motifs.perm1.tf’ and save the results as ‘cluster.perm1.scan.tf’.
The last stage of the protocol is the interpretation of results, which is which requires having at hand results of both clusters of co-expressed genes and random clusters, which play the role of negative controls. Figure 1 summarizes the results of cluster ABI4 compared to 50 random clusters of the same size. There are three types of evidence to look at, which will be discussed with the examples on this figure:
Figure 1. Summary of motif discovery results for cluster ABI4. Dark bars correspond to clusters of co-expressed genes, grey bars to 50 random clusters of genes drawn from the maize genome. Maximum significance of oligo-analysis (A) and dyad-analysis (B) motifs. The sequence logo of motifs reported by each algorithm is shown on top, indicating the number of sites used to compute it and the \(\text{Ncor}\) score of the comparison to the expected motif (bottom, see Note 11). Panel C shows the scores of discovered motifs when scanned back to the original maize upstream sequences and sequences from orthologous genes in Sorghum bicolor. Here dark bars are the reported PWMs, while the grey bars correspond to permuted PWMs. Panel D show the \(\text{Ncor}\) scores of discovered motifs when compared to annotated PWMs in footprintDB.
The distribution of motif significance yielded by oligo-analysis (A) and dyad-analysis (B). Motifs discovered in random clusters (grey bars) typically have significances below 4. The motifs found in cluster ABI4 (black bars) are not more significant than those of random gene sets of the same sizes. The reason for having significant motifs in the random gene sets may result from the occasional presence of low complexity motifs, which should not be considered as reliable predictions.
Panel A, E and I also show the comparisons between some motifs returned by peak-motifs and those reported by the authors of the reference experimental study (Yu et al. 2015); they used MEME as motif discovery tools).
The distribution of scanning scores (C) show to which extent motif matches in upstream sequences of both maize and their S. bicolor orthologues (dark boxes) depart from matches of permuted matrices (lighter boxes), used here as negative controls. On these boxplots, the horizontal bars indicate the median score of all the predicted sites in a given set of promoter sequences, and the shaded rectangles show the interquartile range, i.e. the extent between the 25% and 75% percentiles. For the ABI4 cluster there is a noticeable overlap between the interquartile boxes of the cluster and the random gene selections. Besides, the random selections show several “outliers” (circles) indicating sites predicted with high matching scores. Even though the mean scores are clearly higher for the actual cluster, the results may thus not be considered very significant.
The distribution of scores in footprintDB (D) describe how similar the discovered motifs are when compared to motifs (PWMs) annotated in footprintDB. Similarities are measured by normalized correlation score (\(\text{Ncor}\), see Note 13). In each example 50 random sets of promoters were analyzed with peak-motifs, and the discovered motifs compared to footprintDB. The black bar indicates the best matching score for the original, expression-based gene clusters, and the corresponding logo is overlaid on the histogram. For ABI4, the best-scoring matches correspond to phytochrome interacting factors. These proteins belong to the bHLH family of transcription factors and there are many annotated motifs for them in databases such as footprintDB.
In summary, motifs discovered in promoters of co-expressed genes should always be evaluated based on a combination of complementary criteria:
Your work is to run the protocol with cluster E2F and write a brief report with:
a table of the top hexamers and dyads enriched on the E2F cluster of maize usptream sequences and at least a random cluster of the same size. The table shall include the following columns: exp_freq=expected relative frequency, occ=observed occurrences, exp_occ=expected occurrences, occ_P=occurrence probability (binomial), occ_E=E-value for occurrences, occ_sig=occurrence significance.
a table with similar DNA motifs in footprintDB
an overall interpretation of results
As gene models can change from one assembly to another it is important to use the right assembly version, which is indicated for each genome in Table 1↩︎
Twelve clusters of maize genes, found to be co-expressed in 22 transcriptomes and enriched on Gene Ontoloy terms were analyzed in detail by Yu et al. (2015)↩︎
A crucial parameter to evaluate the results of motif discovery is to estimate the rate of false positives (FP). RSAT programs compute a significance score, which is the minus log of the expected number of false positives (\(\text{e-value} = 10^{-\text{sig}}\)). For example, a motif associated with a significance of 1 should be considered as poorly significant, since on average we would expect \(10^{-1} = 0.1\) false positives, i.e. one FP every 10 random trials. In contrast, a significance of e.g. 16 is very promising, since on average such a result would be expected every \(10^{-16}\) random trials. However, the theoretical significance relies on the correctness of the background model (computed here as k-mer and dyad frequencies in the whole set of promoters). In some cases, sets of plant promoters can discard from the theoretical model, due to heterogeneity of the input (e.g. inclusion of repetitive sequences). The negative control consists in measuring the significance obtained by submitting a random selection of promoters from the organism of interest (maize in the example). Although each of these genes is likely to be regulated by one or more transcription factors (and its promoter should contain corresponding binding sites), in principle the random set as a whole should not be co-regulated, so that the elements would differ from gene to gene, and there should thus be no over-represented motif in their promoters.↩︎
According to Ksouri et al. (2021) the range -500,+200 performs better↩︎
Should the connection to the server interrupt it might be safer to go back and choose ‘email’ as delivery option. The mail message provides a link to the data, which is actually stored at the server.↩︎
It is crucial to have one gene ID per row for submitting queries to retrieve-seq, because only the first word of each row is considered as a query. Note that any transcript isoform numbers .1, .2 should be removed↩︎
This program is generally relevant when analyzing sets containing a large number of sequences such as ChIP-seq peaks or genome-wide promoter sets.↩︎
Plant transcription databases are unfortunately still very fragmentary, so one might be tempted to check more complete collections such as footprintDB or JASPAR. However, the results should be interpreted with caution, because there is little conservation of cis-regulation between plants and other kingdoms of the tree of life. ↩︎
The option ‘Origin’ indicates the reference position relative to each sequence (start, center or end). When this option is set to ‘end’, the coordinates are computed relative to the end of the sequence, with negative values indicating upstream location. The option ‘Offset’ enables to shift the reference point by a given number. For the current example, setting the offset to -200 will give coordinates from -1000 to +200, the 0 corresponding to the TSS.-↩︎
Clearly, more than one random cluster should be evaluated, as suggested in Figure 1, where the results of up to 50 random groups are displayed next to the clusters of Yu et al. (2015).↩︎
Orthologues reported are annotated in Ensembl Compara, generated by a pipeline where maximum likelihood phylogenetic gene trees play a central role. These gene trees, reconciled with their species tree, have their internal nodes annotated to distinguish duplication or speciation events, and thus support the annotation of orthologous and paralogous genes, which can be part of complex one-to-many and many-to-many relations Ensembl Compara (2020)↩︎
This will permute the columns of the input PWMs producing matrices with different consensus. Column-permuted matrices are used as negative controls because they conserve the information content and nucleotide frequencies of the original motifs, but at the same time alter the sequence of nucleotides captured by the original motif, which is not recognized anymore.↩︎
‘Ncor’ is the relative width-normalized Pearson correlation of two PWMs aligned with matrix-scan. This normalized score prevents spurious matches that would cover only a subset of the aligned matrices (e.g. matches between the last column of the query matrix and the first column of the reference matrix, or matches of a very small motif against a large one).↩︎