Author: Najla Ksouri nksouri@eead.csic.es
Revised by: Bruno Contreras-Moreira bcontreras@eead.csic.es
Last updated: 15/05/2024
1 Description
This hands-on tutorial serves as a supplement to our article entitled “Motif analysis in co-expression networks reveals regulatory elements in plants: The peach as a model”. It offers step-by-step instructions to reproduce the analysis and test it on the user’s data using the Regulatory Sequence Analysis Tools (RSAT) directly from the Unix shell.
Doing so, users are able to execute larger jobs and exert finer control over the tool parameters.
1.1 Installation of RSAT
As the installation of RSAT can be somehow complex due to multiple tools, dependencies, and configuration steps, we highly recommend using its Docker image.
Please be aware that this repository hosts a deprecated Docker image for Regulatory sequence analysis tools (RSAT).
For the current and updated ready-to-use image, visit BioContainer. Detailed information about the Dockerfile is available on GitHub.
If you haven’t set up a Docker engine on your machine yet, please refer to the instructions provided by the Docker community for a simplified installation procedure.
# 1. download the container (from the Linux/WSL terminal)
docker pull biocontainers/rsat:20240507_cv1
# 2. Create local folders for input data and results named respectively as "rsat_data/" and "rsat_results/"
# In addition, a subfolder (rsat_data/genomes) should be created too
# assuming you are at $HOME
mkdir -p rsat_data/genomes rsat_results
chmod -R a+w rsat_data rsat_results # Change the permissions
# 3. Launch the Docker RSAT container
docker run --rm -v $HOME/rsat_data:/packages/rsat/public_html/data/ -v $HOME/rsat_results:/home/rsat_user/rsat_results -it biocontainers/rsat:20240507_cv1
# Ps: You should see a warning that can be safely ignored:
# * Starting Apache httpd web server apache2
# (13)Permission denied: AH00091: apache2: could not open error log file /var/log/apache2/error.log.
# AH00015: Unable to open logs
# Action 'start' failed.
# The Apache error log may have more information
# 4. Download organism from public RSAT plant server (in this case Prunus persica)
# Ps: Other available servers are http://fungi.rsat.eu, http://metazoa.rsat.eu, http://protists.rsat.eu and http://teaching.rsat.eu
download-organism -v 2 -org Prunus_persica.Prunus_persica_NCBIv2.38 -server https://rsat.eead.csic.es/plants
# 5. Test the container
cd rsat_results
make -f ../test_data/peak-motifs.mk RNDSAMPLES=2 all
Once the session is launched, you will be logged as an “rsat_user” user in “$HOME/rsat_user”. Docker container are supposed to be light-weighted, so that any created files are immediately deleted once you exit.
To solve this issue, we have mounted two local directories “rsat_results” and “rsat_data” in the host machine serving as a persistent storage volume inside the RSAT docker container.
Installed genomes will be saved in “rsat_data/genomes” folder and results from this tutorial will be saved in “rsat_results”.
1.2 Launch Docker Web server
To connect to RSAT Web server running from Docker container (Linux only):
# Retrieve your IP adress
hostname -I
# Launch Docker Web server
# Password: rsat_2020
sudo service apache2 restart
# Launch the following URL in your browser, using the IP address, ie http://172.17.0.2/rsat
In addition to logging into the Docker container as explained in the previous section, you can also call individual tools from the terminal non-interactively. See the example
2 Let´s get started
We will use one showcase module (Module 11) produced from our analysis (see article). Module 11 can be found within the docker container in the sub-directory “$HOME/rsat_user/test_data”.
- It contains a list of 269 gene IDs hereafter called “test genes”.
- Each gene ID is listed on a separate line within the module.
- These gene IDs need to follow the format accepted by the RSAT
server, which in this case starts with “PRUPE_”.
Our approach of de novo motif discovery relies on the combination of different steps:
- Retrieving the upstream sequences from the test genes
- Creating the background model
- Building the negative control clusters and retrieving their upstream sequences
-
Running the peak-motifs tool under the differential mode englobing
various programs of RSAT suite:
- Motif discovery
- Motif comparison
- Matrix scanning
2.1 Retrieve upstream sequences from the test genes
As we aim to test the effect of four upstream tracts on DNA-motifs
discovery, four different upstream sequences need to be retrieved from
each gene in module M11.
Up 1: [-1500 bp ; +200 bp], Up 2: [-500 bp ; +200 bp], Up 3: [-500 bp ;
0 bp], Up 4: [0 bp ; +200 bp].
Note 3: By
default, the reference point is the first DNA base of an annotated gene,
which usually corresponds to the beginning of a transcript. In cases
where there is no untranslated region or where genes have been predicted
de novo, position 0 would be the A of the start codon (ATG).
Negative coordinates indicate sequences upstream the reference point
(toward the 5’end on the sense strand) while positive ones correspond to
downstream sequences (towards the 3’end).
Generally, regulatory signals are located upstream the promoter (in the non-coding sequence). However, sometimes they may overlap the coding sequence. For this reason we analyze here both non-coding and coding regions.
We recommend to generate a repeat-masked output file (-rm) as the presence of repetitive elements in the genome can hamper the detection of motifs. Indeed they have a distinct composition than the rest of the genome.
2.2 Create the background model
We define the background as the set of upstream sequences collected from all genes of Prunus persica genome. The resulting background will be used to estimate the random expectation of each oligonucleotide (Contreras-Moreira et al., 2016).
2.3 Build the negative control clusters and retrieve their upstream sequences
Bioinformatic approaches for de novo motif discovery are by essence predictive. Thus the need to evaluate the rate of false positive predictions. To do so, our strategy proceeds to build 50 sets of random genes called “negative control clusters” (Contreras-Moreira et al., 2016).
RSAT suite offers a variety of tools to build the sets of random control genes.
2.4 Run the peak-motifs pipeline in differential mode
To discover exceptional motifs in the promoters of M11, we employ the “Peak-motifs” tool in differential mode. This differential analysis requires two data sets as input (test versus control) and will be run in two separated phases:
Phase 1: Considering the promoter sequences of “test genes” as the test set and the promoter sequences of background as the control.
Phase 2: Considering the promoter sequences of each random cluster as
test and keeping the background sequences as control.
Note 4: Peak-motifs gathers three essential components for putative motif detection. Here, we describe the parameters set for each component.
- Motif discovery
- Motif comparison
- Matrix scanning
– Relies on a serie of complementary algorithms() oligo analysis, dyad analysis, position analysis and local word analysis). However, when executing the differential mode, position and local word will be automatically ignored.
– In differential mode, the markov order of the background model is ignored.
– For the oligo-analysis, we set the oligomer length from 5 to 8 (-minol 5 -maxol 8) because it gave us better results but can be setted as (-minol 6 -maxol 8)
– The number of motifs returned for each algorithm was set to 5 (-nmotifs 5). A higher number of motifs may increase the computing time.
Discovered patteerns are compared to one or several collections of known motifs to identify transcription factors (TFs) that may recognize and bind the discovered motifs. Here we choose “footprintDB plants database” in TRANSFAC format.
Peak sequences are scanned to predict their binding sites positions using the default Markov order m=1 (-scan_markov 1). Predicted sites are then plotted for better visualization of the binding preferences.
Here, we attach a detailed makefile including all the steps.
Click here to download the makefile
MAKEFILE=/rsat_results/peak-motifs.mk
# Define the directories
DIR_DATA=/home/rsat_user/test_data
UP1=~/rsat_results/upstream1
UP2=~/rsat_results/upstream2
UP3=~/rsat_results/upstream3
UP4=~/rsat_results/upstream4
# Define the species of interest and the four upstream boundaries
SPECIES=Prunus_persica.Prunus_persica_NCBIv2.38
REGULON=M11
FROM1=-1500
TO1=+200
FROM2=-500
TO2=+200
FROM3=-500
TO3=0
FROM4=0
TO4=+200
# Define the corresponding background for each upstream interval
BGMSK1=${UP1}/${SPECIES}-up1.rm.fna
BGMSK2=${UP2}/${SPECIES}-up2.rm.fna
BGMSK3=${UP3}/${SPECIES}-up3.rm.fna
BGMSK4=${UP4}/${SPECIES}-up4.rm.fna
# Precise the number of random clusters
RNDSAMPLES=50
# Collect the different upstream sequences from module M11 and from your reference genome.
retrieve:
@echo "Creating the result directories"
@mkdir -p ${UP1}
@mkdir -p ${UP2}
@mkdir -p ${UP3}
@mkdir -p ${UP4}
@echo "Retrieving 4 upstream sequences boundaries from the genes of module M11"
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP1}/regulon${REGULON}_up1.rm.fna
@echo "${UP1}/regulon${REGULON}_up1.rm.fna"
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP2}/regulon${REGULON}_up2.rm.fna
@echo "${UP2}/regulon${REGULON}_up2.rm.fna"
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP3}/regulon${REGULON}_up3.rm.fna
@echo "${UP3}/regulon${REGULON}_up3.rm.fna"
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP4}/regulon${REGULON}_up4.rm.fna
@echo "${UP4}/regulon${REGULON}_up4.rm.fna"
@echo "Retrieving 4 upstream sequences boundaries from all Prunus persica genome"
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -all -label id -rm -o ${BGMSK1}
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -all -label id -rm -o ${BGMSK2}
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -all -label id -rm -o ${BGMSK3}
@retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -all -label id -rm -o ${BGMSK4}
# Generate 50 random clusters
random:
@echo "Creating ${RNDSAMPLES} random clusters"
@for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
NSEQS=`wc -l ${DIR_DATA}/${REGULON}.txt`; \
RNDREGULON=random$$r.txt; \
random-genes -n $$NSEQS} -org ${SPECIES} -feattype gene -g 1 -o $$RNDREGULON; \
done
@echo "retrieve the different upstream sequence lengths from the random clusters"
@for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
RNDREGULON=random$$r.txt; \
RNDMSK=random$${r}.rm.fna; \
retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -rm -i $$RNDREGULON -label id -o ${UP1}/$$RNDMSK; \
retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -rm -i $$RNDREGULON -label id -o ${UP2}/$$RNDMSK;\
retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -rm -i $$RNDREGULON -label id -o ${UP3}/$$RNDMSK;\
retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -rm -i $$RNDREGULON -label id -o ${UP4}/$$RNDMSK;\
done
####################################################################################
# THIS SECOND PART IS FOR PEAK-MOTIFS ANALYSIS IN DIFFERENTIAL MODE
####################################################################################
# Define the directories
INPUT1=${UP1}/regulon${REGULON}_up1.rm.fna
INPUT2=${UP2}/regulon${REGULON}_up2.rm.fna
INPUT3=${UP3}/regulon${REGULON}_up3.rm.fna
INPUT4=${UP4}/regulon${REGULON}_up4.rm.fna
FOOTDBFILE=/packages/rsat/public_html/motif_databases/footprintDB/footprintDB.plants.motif.tf
DISCO=oligos,dyads
PM_TASKS=purge,seqlen,composition,disco,merge_motifs,split_motifs,timelog,motifs_vs_db,synthesis,small_summary,clean_seq
# Run peak-motifs pipeline
peakmotifs:
@echo "Running peak-motifs for M11 within upstream1"
@peak-motifs -i ${INPUT1} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
-disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
@rm -rf ${UP1}/regulon${REGULON}.rm.fna.peaks-rm/data
@echo "Running peak-motifs for random clusters within upstream1"
for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
RNDMSK=random$${r}.rm.fna; \
peak-motifs -i ${UP1}/$${RNDMSK} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
rm -rf ${UP1}/random$$r.rm.fna.peaks-rm/data;\
done
@echo "Running peak-motifs for M11 within upstream2"
@peak-motifs -i ${INPUT2} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
-disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
@rm -rf ${UP2}/regulon${REGULON}.rm.fna.peaks-rm/data
@echo "Running peak-motifs for random clusters within upstream2"
for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
RNDMSK=random$${r}.rm.fna; \
peak-motifs -i ${UP2}/$${RNDMSK} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
rm -rf ${UP2}/random$$r.rm.fna.peaks-rm/data;\
done
@echo "Running peak-motifs for M11 within upstream3"
@peak-motifs -i ${INPUT3} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
-disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
@rm -rf ${UP3}/regulon${REGULON}.rm.fna.peaks-rm/data
@echo "Running peak-motifs for random clusters within upstream3"
for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
RNDMSK=random$${r}.rm.fna; \
peak-motifs -i ${UP3}/$${RNDMSK} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
rm -rf ${UP3}/random$$r.rm.fna.peaks-rm/data;\
done
@echo "Running peak-motifs for M11 within upstream4"
@peak-motifs -i ${INPUT4} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
-disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
@rm -rf ${UP4}/regulon${REGULON}.rm.fna.peaks-rm/data
@echo "Running peak-motifs for random clusters within upstream4"
for r in `seq 1 ${RNDSAMPLES}`; do \
echo Replicate $$r; \
RNDMSK=random$${r}.rm.fna; \
peak-motifs -i ${UP4}/$${RNDMSK} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}; \
rm -rf ${UP4}/random$$r.rm.fna.peaks-rm/data; \
done
####################################################################################
# THIS THIRD PART IS FOR MATRIX-SCANNING
####################################################################################
SCAN_INPUT=${UP1}/regulon${REGULON}.rm.fna.peaks-rm/results/discovered_motifs/
matrix=m1 #here we are going to scan only the first matrix
SCAN_OLIGO_OUTPUT=${UP1}/scan_results/oligo/
SCAN_DYAD_OUTPUT=${UP1}/scan_results/dyad/
SCANMAXPVALUE=1e-4
scan:
@mkdir -p ${SCAN_OLIGO_OUTPUT}
@mkdir -p ${SCAN_DYAD_OUTPUT}
@echo "Run the matrix-scan within long upstream region -1500 +200 for oligo-motif"
@matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/oligos_5-8nt_m1/peaks_oligos_5-8nt_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna \
-seq_format fasta \
-pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
-n score -o ${SCAN_OLIGO_OUTPUT}/scan_oligo_up1.tab
@echo "Run the matrix-scan within long upstream region -1500 +200 for dyad-motif"
@matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/dyads_test_vs_ctrl_m1/peaks_dyads_test_vs_ctrl_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna -seq_format fasta \
-pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
-n score -o ${SCAN_DYAD_OUTPUT}/scan_dyad_up1.tab
# Run all target at once
all: retrieve random peakmotifs scan
3 Interpretation of the results
3.1 Sequence analysis
To ensure clarity, the interpretation of the results will be confined to the boundary Up 1: [-1500 bp; +200 bp]. The same reasoning will be applied to other upstream regions.
Let’s begin by examining the length distribution of the returned promoter sequences. For instance, we can easily execute the following command:
Figure 1. Sequences length distribution.
Note 5: As
illustrated, some sequences are shorter than the expected length of 1701
bp when the boundaries are set from -1500 bp to +200 bp.
This discrepancy can be explained by the activation of the “-noorf” option in the retrieve-seq tool. When this option is enabled, upstream sequences are automatically clipped to prevent overlap with neighboring genes. For instance, in Figure 2, if gene 1 is the query gene, it is common to have a close upstream neighbor, such as gene 2. The “-noorf” option ensures that the upstream sequences do not extend into the neighboring gene’s region.
Figure 2. Schematic representation of overlapping neighbor genes.
3.2 Analysis of the predicted sites
The complete results are available in an HTML report called “peak_synthesis.html.” This report provides detailed information for each algorithm, including five motifs with their corresponding logos in both direct and reverse orientations. Additionally, the report includes data on statistical significance, E-value, annotation, Ncor, matrix, and other relevant parameters for each motif.
In the present case, Motif 2 and Motif 6 are considered the best-scored potential motifs, detected respectively by oligo and dyad analysis. These motifs have been identified as having the highest significance and Ncor score, as illustrated in Figure 3.
Figure 3. Top significant motifs discovered by aligo and dyad algorithms.
However, to ensure a reliable motif comparison, we recommend re-comparing these two selected motifs against the FootprintDB Plants database, using stringent values for the “cor” and “Ncor” parameters.
As depicted in Figure 4, both oligo-analysis (top) and dyad analysis (bottom) reveal that the best match among plant motifs in FootprintDB corresponds to the CAMTA transcription factor. The Ncor scores for oligo-analysis and dyad analysis are 0.64 and 0.879, respectively.
Figure 4: Annotation of the discovered motifs by comparison against FootprintDB collection of plant curated motifs
3.3 Estimating the relevance of the results.
While identifying CAMTA-associated motifs is a valuable first step, caution is necessary before considering them as reliable predictions. To assess their reliability and rule out chance identification, we must compare their significance scores with those of top-ranking motifs found in random control clusters (negative controls).
Note 6: Information regarding the statistical significance of the detected patterns is stored in the file “{UP1}/random$r.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab”, where the variable r ranges from 1 to 50.
Doing this manually may be time-consuming and cumbersome for the user. Above, we propose a simple R script to generate a barplot with the best-scored motifs.
Upon the successful execution of the makefile, the results of the co-expressed gene modules and their corresponding random control clusters are systematically stored within a parent directory named “rsat_results.” This directory is structured as follows:
├── rsat_results
│ ├── upstream1
│ ├── upstream2
│ ├── upstream3
│ ├── upstream4
│ │ ├── M11
│ │ │ │──RegulonM11.rm.fna.peaks-rm
│ │ │ ├── results
│ │ │ │ ├── oligos_5-8nt
│ │ │ │ │ ├── peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
│ │ ├── random1.rm.fna.peaks-rm
│ │ ├── results
│ │ │ │ ├── oligos_5-8nt
│ │ │ │ │ ├── peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
│ │ ├── random2.rm.fna.peaks-rm
│ │ ├── ...
│ │ │ ├── ...
│ │ │ │ ├── ...
│ │ ├── random3.rm.fna.peaks-rm
│ │ ├── random4.rm.fna.peaks-rm
│ │ ├── random5.rm.fna.peaks-rm
│ │ ├── random6.rm.fna.peaks-rm
│ .
│ .
│ .
│ .
│ │ ├── random50.rm.fna.peaks-rm
To generate the barplot graph of motifs significance, execute the following script:
# Set the working directory
parent.dir <- ("~/rsat_results/upstream4/M11")
# Load required libraries
required.libs <- c("gtools", "ggplot2", "ggridges")
for (lib in required.libs) {
if (!require(lib, character.only = TRUE)) {
install.packages(lib)
library(lib, character.only = TRUE)
}
}
# List all sub-directories
subdirs <- list.dirs(parent.dir, recursive = TRUE, full.names = TRUE)
# Initialize a vector to store the file paths of random clusters
random.list <- c()
# Loop through each sub-directory
for (i in subdirs) {
# Check if the sub-directory contains the pattern "random"
if (grepl("random", i, ignore.case = TRUE)) {
# List all files inside the current sub-directory
files <- list.files(i, full.names = TRUE)
# Filter files containing the pattern "5-8nt_mvk-2_-2.tab"
matching_files <- files[grep("5-8nt_mvk-2_-2.tab", files, ignore.case = TRUE)]
# Add the matching file paths to the file list vector
random.list <- c(random.list, matching_files)
}
}
# Sort files
random.list <- mixedsort(random.list, decreasing = TRUE)
# Create an empty data frame where new values will be assigned
results <- as.matrix(data.frame(matrix(ncol = 1, nrow = 0)))
# Name the column
col.name <- c("significance")
colnames(results) <- col.name
# Loop to read files and retrieve the significance values
for (j in 1:length(random.list)) {
# Read the files
random_oligo <- read.table(random.list[j], header = TRUE, comment.char = ";")
# Transform the data to numeric class
random_oligo$occ_sig <- as.numeric(as.character(random_oligo$occ_sig))
# Replace NA by 0
random_oligo[is.na(random_oligo)] <- 0
# Get the maximum value of the significance
max_sig <- max(random_oligo$occ_sig)
# Add new row to data frame
results <- rbind(results, data.frame(clusters = paste("r", j, sep = ""), significance = max_sig))
}
# List files from regulon directories
regulon_dir <- subdirs[grep("regulon", subdirs, ignore.case = TRUE)]
regulon_list <- list.files(regulon_dir, full.names = TRUE, pattern = "5-8nt_mvk-2_-2.tab", recursive = TRUE)
# Read files
regulon_oligo <- lapply(regulon_list, function(file) {
data <- read.table(file, header = TRUE, comment.char = ";")
data$occ_sig <- as.numeric(as.character(data$occ_sig))
data[is.na(data)] <- 0
max(data$occ_sig)
})
# Get maximum value of significance
max_sig_regulon <- max(unlist(regulon_oligo))
# Get maximum value of significance
max_sig_regulon <- max(unlist(regulon_oligo))
# Add new row for M11
results <- rbind(results, data.frame(clusters = "M11", significance = max_sig_regulon))
# Order the data frame by significance
results <- results[order(results$significance, decreasing = TRUE), ]
# Assign color to bars
results$color <- ifelse(results$clusters == "M11", "black", "grey")
# Create bar plot
#Create the bar plot
ggplot(results, aes(x = reorder(clusters, -significance), y = significance, fill = color)) +
geom_bar(stat = "identity") +
scale_fill_identity() +
labs(x = "", y = "Significance", title = "Bar Plot M11: Oligo motifs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
As shown in the barplot above, the CAMTA-corresponding motif detected in module M11 is the most significant pattern when compared to those found in the negative control clusters. This indicates that it may be considered a potential candidate.
Note 7: To reproduce the barplot with dyad results, just you need to create a new list files of all peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab.
Please note that in the barplot, we only consider the most significant motif from the module test and from the negative control clusters. If you want to assess the significance of all predicted motifs, we propose the following steps to generate a significance boxplot:
- Generate a file containing all significance values of predicted oligo-motifs in module M11.
M11_results=~/rsat_results/upstream4/M11/Results
mkdir -p ${M11_results}
M11_oligo_regulon=~/rsat_results/upstream4/M11/regulonM11.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
grep -v "^;" ${M11_oligo_regulon} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.oligo.regulon.tsv
- Generate a file containing all significance values of predicted oligo-motifs in random clusters.
M11_oligo_random=~/rsat_results/upstream4/M11/random*.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
grep -v "^;" ${M11_oligo_random} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.oligo.random.tsv
- Repeat steps 1 and 2 for dyad motifs
M11_dyad_regulon=~/rsat_results/upstream4/M11/regulonM11.rm.fna.peaks-rm/results/dyads_test_vs_ctrl/peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab
M11_dyad_random=~/rsat_results/upstream4/M11/random*.rm.fna.peaks-rm/results/dyads_test_vs_ctrl/peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab
grep -v "^;" ${M11_dyad_regulon} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.dyad.regulon.tsv
grep -v "^;" ${M11_dyad_random} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.dyad.random.tsv
Once you have these files, you can create significance boxplots using R. Here’s a simplified outline of the process:
Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':
between, first, last
The following objects are masked from 'package:plyr':
arrange, count, desc, failwith, id, mutate, rename, summarise,
summarize
The following object is masked from 'package:gridExtra':
combine
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
# -------------- #
# Required input #
# -------------- #
analysis.name <- "M11"
rsat.results.folder <- "/home/sie/rsat_results/upstream4/M11/Results"
analysis.results.folder <- file.path(rsat.results.folder)
result.files <- list.files(analysis.results.folder, full.names = T)
regulon.name <- paste0(analysis.name, "_regulon")
result.agg.df <- data.table()
for (rf in result.files) {
sequence.type <- ""
program.type <- ""
shape.type <- ""
# Detect program from file name
if (grepl("oligo", rf)) {
program.type <- "oligo"
# shape.type <- "1"
shape.type <- 1
} else if (grepl("dyad", rf)) {
program.type <- "dyad"
# shape.type <- "2"
shape.type <- 2
}
# Detect sequence type from file name
if (grepl("regulon", rf)) {
sequence.type <- regulon.name
} else if (grepl("random", rf)) {
sequence.type <- "random"
}
results.df <- read.table(rf) |>
rename("Significance" = "V1") |>
mutate(Program = program.type,
Analysis = sequence.type,
Label = shape.type)
# mutate( toupper(substr(Program, 1, 1)))
result.agg.df <- rbind(result.agg.df, results.df)
}
result.agg.df$Analysis <- factor(result.agg.df$Analysis, levels = c(regulon.name, "random"))
result.agg.df
Significance Program Analysis Label
1: 0.06 dyad random 2
2: 0.72 dyad random 2
3: 0.24 dyad random 2
4: 0.11 dyad random 2
5: 0.02 dyad random 2
---
316: 1.33 oligo M11_regulon 1
317: 0.96 oligo M11_regulon 1
318: 0.92 oligo M11_regulon 1
319: 0.46 oligo M11_regulon 1
320: 0.37 oligo M11_regulon 1
random.color <- "#878787"
regulon.color <- "#b2182b"
max.random.signif <- max(subset(result.agg.df, Analysis == "random")$Significance)
median.random.signif <- median(subset(result.agg.df, Analysis == "random")$Significance)
ggplot(result.agg.df, aes(x = Analysis, y = Significance, fill = Analysis, colour = Analysis, group = Analysis)) +
geom_hline(yintercept = max.random.signif, linetype="dashed", color = random.color, size = 0.75) +
geom_hline(yintercept = median.random.signif, color = random.color, size = 0.75) +
geom_boxplot(outlier.shape = NA, alpha = 0.25, show.legend = F) +
geom_point(aes(shape = Program), size = 3, position = position_jitter(width = 0.2), show.legend = T) +
scale_shape_manual(values = c(1, 2),
breaks = c("oligo", "dyad")) +
theme_classic() +
scale_color_manual(breaks = c(regulon.name, "Random"),
values = c(regulon.color, random.color)) +
scale_fill_manual(breaks = c(regulon.name, "Random"),
values = c(regulon.color, random.color)) +
labs(x = "")
3.4 Positional distribution of transcrption factor binding sites
The final step in the makefile involves scanning the position weight matrix of the discovered motifs across the long upstream promoter region Up1:[-1500 bp, +200 bp].
To better visualize the density distribution of the TF binding sites, we offer a simple R script to generate a ridgeline plot.
# Set the working directory
working.dir <- ("~/rsat_results/upstream1/")
# Load required libraries
required.libs <- c("ggplot2","gridExtra","ggridges")
for (lib in required.libs){
if (!require(lib, character.only = T)){
install.packages(lib)
library(lib, character.only = T)
}
}
# Read the files, remove the rows with missing data and add new column called algorithm
scan.oligo <- read.csv("~/rsat_results/upstream1/scan_results/oligo/scan_oligo_up1.tab", header=T, dec = ".", sep = "",comment.char = ";")
scan.oligo <- scan.oligo[complete.cases(scan.oligo),]
scan.oligo$algorithm <- c("oligo")
scan.dyad <- read.csv("~/rsat_results/upstream1/scan_results/dyad/scan_dyad_up1.tab", header=T, dec = ".", sep = "",comment.char = ";")
scan.dyad <- scan.dyad[complete.cases(scan.dyad),]
scan.dyad$algorithm <- c("dyad")
# Join the 2 files vertically
scan.data <- rbind(scan.oligo,scan.dyad)
scan.data <- na.omit(scan.data)
# Ridgeline plots
plot_M11_up1 <- ggplot(scan.data,aes(start, algorithm, fill = algorithm))+
geom_density_ridges_gradient(scale = 1, show.legend = T,rel_min_height = 0.0011) +
theme_ridges() + theme(legend.position = "none") +
labs(x = "",y = "") + ggtitle("") + theme(plot.title = element_text(hjust = 0.4)) +
theme( axis.text.x = element_text( angle = 90, hjust = 1, size = 10)) +
scale_color_manual(values = c("#868686FF","#fdef64"))+
scale_fill_manual(values = c("#868686FF","#fdef64"))+
theme(strip.text = element_text(size=12),strip.background = element_rect(fill="gray", colour="black",size=4)) + scale_x_continuous (limit = c(-1500, +200),breaks = c(-1500,-500,0,+200)) + annotate("text", x = c(-250,-250), y = c(2.5,1.5), label = c("CAMTA", "CAMTA"), color="black", size=4, fontface="bold")
plot_M11_up1
Picking joint bandwidth of 116
4 References
Contreras-Moreira B, Castro-Mondragon J.A, Rioualen C, Cantalapiedra C.P, van Helden J. (2016) RSAT::Plants: Motif Discovery Within Clusters of Upstream Sequences in Plant Genomes. In: Hehl R. (eds) Plant Synthetic Promoters. Methods in Molecular Biology, vol 1482. Humana Press, New York, NY
Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. (2012). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40(4): e31.
Thomas-Chollier M, Darbo E, Herrmann C, Defrance M, Thieffry D, van Helden J. (2012). A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 7(8): 1551-1568.