Author: Najla Ksouri

Revised by: Bruno Contreras-Moreira

Last updated: 15/05/2024


1 Description

This hands-on tutorial serves as a supplement to our article entitled “Motif analysis in co-expression networks reveals regulatory elements in plants: The peach as a model”. It offers step-by-step instructions to reproduce the analysis and test it on the user’s data using the Regulatory Sequence Analysis Tools (RSAT) directly from the Unix shell.

Doing so, users are able to execute larger jobs and exert finer control over the tool parameters.

1.1 Installation of RSAT

As the installation of RSAT can be somehow complex due to multiple tools, dependencies, and configuration steps, we highly recommend using its Docker image.

Please be aware that this repository hosts a deprecated Docker image for Regulatory sequence analysis tools (RSAT).

For the current and updated ready-to-use image, visit BioContainer. Detailed information about the Dockerfile is available on GitHub.

If you haven’t set up a Docker engine on your machine yet, please refer to the instructions provided by the Docker community for a simplified installation procedure.


# 1. download the container (from the Linux/WSL terminal)
docker pull biocontainers/rsat:20240507_cv1

# 2. Create local folders for input data and results named respectively as "rsat_data/" and "rsat_results/"
# In addition, a subfolder (rsat_data/genomes) should be created too

# assuming you are at $HOME 
mkdir -p rsat_data/genomes rsat_results
chmod -R a+w rsat_data rsat_results   # Change the permissions


# 3. Launch the Docker RSAT container
docker run --rm -v $HOME/rsat_data:/packages/rsat/public_html/data/ -v $HOME/rsat_results:/home/rsat_user/rsat_results -it biocontainers/rsat:20240507_cv1

# Ps: You should see a warning that can be safely ignored: 
# * Starting Apache httpd web server apache2
# (13)Permission denied: AH00091: apache2: could not open error log file /var/log/apache2/error.log.
# AH00015: Unable to open logs
# Action 'start' failed.
# The Apache error log may have more information

# 4. Download organism from public RSAT plant server (in this case Prunus persica)
# Ps: Other available servers are http://fungi.rsat.eu, http://metazoa.rsat.eu, http://protists.rsat.eu and http://teaching.rsat.eu

download-organism -v 2 -org Prunus_persica.Prunus_persica_NCBIv2.38 -server https://rsat.eead.csic.es/plants

# 5. Test the container
cd rsat_results 
make -f ../test_data/peak-motifs.mk RNDSAMPLES=2 all

Once the session is launched, you will be logged as an “rsat_user” user in “$HOME/rsat_user”. Docker container are supposed to be light-weighted, so that any created files are immediately deleted once you exit.

To solve this issue, we have mounted two local directories “rsat_results” and “rsat_data” in the host machine serving as a persistent storage volume inside the RSAT docker container.

Installed genomes will be saved in “rsat_data/genomes” folder and results from this tutorial will be saved in “rsat_results”.

1.2 Launch Docker Web server

To connect to RSAT Web server running from Docker container (Linux only):


# Retrieve your IP adress
hostname -I

# Launch Docker Web server
# Password: rsat_2020
sudo service apache2 restart

# Launch the following URL in your browser, using the IP address, ie http://172.17.0.2/rsat

In addition to logging into the Docker container as explained in the previous section, you can also call individual tools from the terminal non-interactively. See the example


2 Let´s get started

We will use one showcase module (Module 11) produced from our analysis (see article). Module 11 can be found within the docker container in the sub-directory “$HOME/rsat_user/test_data”.

Our approach of de novo motif discovery relies on the combination of different steps:

  1. Retrieving the upstream sequences from the test genes
  2. Creating the background model
  3. Building the negative control clusters and retrieving their upstream sequences
  4. Running the peak-motifs tool under the differential mode englobing various programs of RSAT suite:
    • Motif discovery
    • Motif comparison
    • Matrix scanning

2.1 Retrieve upstream sequences from the test genes

As we aim to test the effect of four upstream tracts on DNA-motifs discovery, four different upstream sequences need to be retrieved from each gene in module M11.
Up 1: [-1500 bp ; +200 bp], Up 2: [-500 bp ; +200 bp], Up 3: [-500 bp ; 0 bp], Up 4: [0 bp ; +200 bp].


Note 3: By default, the reference point is the first DNA base of an annotated gene, which usually corresponds to the beginning of a transcript. In cases where there is no untranslated region or where genes have been predicted de novo, position 0 would be the A of the start codon (ATG). Negative coordinates indicate sequences upstream the reference point (toward the 5’end on the sense strand) while positive ones correspond to downstream sequences (towards the 3’end).

Generally, regulatory signals are located upstream the promoter (in the non-coding sequence). However, sometimes they may overlap the coding sequence. For this reason we analyze here both non-coding and coding regions.

We recommend to generate a repeat-masked output file (-rm) as the presence of repetitive elements in the genome can hamper the detection of motifs. Indeed they have a distinct composition than the rest of the genome.

2.2 Create the background model

We define the background as the set of upstream sequences collected from all genes of Prunus persica genome. The resulting background will be used to estimate the random expectation of each oligonucleotide (Contreras-Moreira et al., 2016).

2.3 Build the negative control clusters and retrieve their upstream sequences

Bioinformatic approaches for de novo motif discovery are by essence predictive. Thus the need to evaluate the rate of false positive predictions. To do so, our strategy proceeds to build 50 sets of random genes called “negative control clusters” (Contreras-Moreira et al., 2016).

RSAT suite offers a variety of tools to build the sets of random control genes.

2.4 Run the peak-motifs pipeline in differential mode

To discover exceptional motifs in the promoters of M11, we employ the “Peak-motifs” tool in differential mode. This differential analysis requires two data sets as input (test versus control) and will be run in two separated phases:

Phase 1: Considering the promoter sequences of “test genes” as the test set and the promoter sequences of background as the control.

Phase 2: Considering the promoter sequences of each random cluster as test and keeping the background sequences as control.

Note 4: Peak-motifs gathers three essential components for putative motif detection. Here, we describe the parameters set for each component.

  • Motif discovery
  • – Relies on a serie of complementary algorithms() oligo analysis, dyad analysis, position analysis and local word analysis). However, when executing the differential mode, position and local word will be automatically ignored.

    – In differential mode, the markov order of the background model is ignored.

    – For the oligo-analysis, we set the oligomer length from 5 to 8 (-minol 5 -maxol 8) because it gave us better results but can be setted as (-minol 6 -maxol 8)

    – The number of motifs returned for each algorithm was set to 5 (-nmotifs 5). A higher number of motifs may increase the computing time.

  • Motif comparison
  • Discovered patteerns are compared to one or several collections of known motifs to identify transcription factors (TFs) that may recognize and bind the discovered motifs. Here we choose “footprintDB plants database” in TRANSFAC format.

  • Matrix scanning
  • Peak sequences are scanned to predict their binding sites positions using the default Markov order m=1 (-scan_markov 1). Predicted sites are then plotted for better visualization of the binding preferences.

    Here, we attach a detailed makefile including all the steps.

    Click here to download the makefile

    MAKEFILE=/rsat_results/peak-motifs.mk
    
    # Define the directories
    DIR_DATA=/home/rsat_user/test_data
    UP1=~/rsat_results/upstream1
    UP2=~/rsat_results/upstream2
    UP3=~/rsat_results/upstream3
    UP4=~/rsat_results/upstream4
    
    # Define the species of interest and the four upstream boundaries
    SPECIES=Prunus_persica.Prunus_persica_NCBIv2.38
    REGULON=M11
    FROM1=-1500
    TO1=+200
    FROM2=-500
    TO2=+200
    FROM3=-500
    TO3=0
    FROM4=0
    TO4=+200
    
    # Define the corresponding background for each upstream interval
    BGMSK1=${UP1}/${SPECIES}-up1.rm.fna
    BGMSK2=${UP2}/${SPECIES}-up2.rm.fna
    BGMSK3=${UP3}/${SPECIES}-up3.rm.fna
    BGMSK4=${UP4}/${SPECIES}-up4.rm.fna
    
    # Precise the number of random clusters
    RNDSAMPLES=50 
    
    # Collect the different upstream sequences from module M11 and from your reference genome.
    retrieve:
        @echo "Creating the result directories"
        @mkdir -p ${UP1}
        @mkdir -p ${UP2}
        @mkdir -p ${UP3}
        @mkdir -p ${UP4}
    
        @echo "Retrieving 4 upstream sequences boundaries from the genes of module M11"
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP1}/regulon${REGULON}_up1.rm.fna
        @echo "${UP1}/regulon${REGULON}_up1.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP2}/regulon${REGULON}_up2.rm.fna
        @echo "${UP2}/regulon${REGULON}_up2.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP3}/regulon${REGULON}_up3.rm.fna
        @echo "${UP3}/regulon${REGULON}_up3.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP4}/regulon${REGULON}_up4.rm.fna
        @echo "${UP4}/regulon${REGULON}_up4.rm.fna"
    
        @echo "Retrieving 4 upstream sequences boundaries from all Prunus persica genome"
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -all -label id -rm -o ${BGMSK1}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -all -label id -rm -o ${BGMSK2}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -all -label id -rm -o ${BGMSK3}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -all -label id -rm -o ${BGMSK4}
    
    # Generate 50 random clusters 
    random:
        @echo "Creating ${RNDSAMPLES} random clusters"
        @for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            NSEQS=`wc -l ${DIR_DATA}/${REGULON}.txt`; \
            RNDREGULON=random$$r.txt; \
            random-genes -n $$NSEQS} -org ${SPECIES} -feattype gene -g 1 -o $$RNDREGULON; \
        done
        
        @echo "retrieve the different upstream sequence lengths from the random clusters"
        @for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDREGULON=random$$r.txt; \
            RNDMSK=random$${r}.rm.fna; \
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -rm -i $$RNDREGULON -label id -o ${UP1}/$$RNDMSK; \
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -rm -i $$RNDREGULON -label id -o ${UP2}/$$RNDMSK;\
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -rm -i $$RNDREGULON -label id -o ${UP3}/$$RNDMSK;\
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -rm -i $$RNDREGULON -label id -o ${UP4}/$$RNDMSK;\
        done
    
    ####################################################################################
    # THIS SECOND PART IS FOR PEAK-MOTIFS ANALYSIS IN DIFFERENTIAL MODE
    ####################################################################################
    
    # Define the directories
    INPUT1=${UP1}/regulon${REGULON}_up1.rm.fna
    INPUT2=${UP2}/regulon${REGULON}_up2.rm.fna
    INPUT3=${UP3}/regulon${REGULON}_up3.rm.fna
    INPUT4=${UP4}/regulon${REGULON}_up4.rm.fna
    FOOTDBFILE=/packages/rsat/public_html/motif_databases/footprintDB/footprintDB.plants.motif.tf
    DISCO=oligos,dyads
    PM_TASKS=purge,seqlen,composition,disco,merge_motifs,split_motifs,timelog,motifs_vs_db,synthesis,small_summary,clean_seq
    
    # Run peak-motifs pipeline
    peakmotifs: 
        @echo "Running peak-motifs for M11 within upstream1"
        @peak-motifs -i ${INPUT1} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP1}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream1"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP1}/$${RNDMSK} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP1}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream2"
        @peak-motifs -i ${INPUT2} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP2}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream2"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP2}/$${RNDMSK} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP2}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream3"
        @peak-motifs -i ${INPUT3} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP3}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream3"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP3}/$${RNDMSK} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP3}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream4"
        @peak-motifs -i ${INPUT4} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP4}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream4"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP4}/$${RNDMSK} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}; \
        rm -rf ${UP4}/random$$r.rm.fna.peaks-rm/data; \
        done
        
    
    ####################################################################################
    # THIS THIRD PART IS FOR MATRIX-SCANNING
    ####################################################################################
    
    SCAN_INPUT=${UP1}/regulon${REGULON}.rm.fna.peaks-rm/results/discovered_motifs/
    matrix=m1 #here we are going to scan only the first matrix
    SCAN_OLIGO_OUTPUT=${UP1}/scan_results/oligo/
    SCAN_DYAD_OUTPUT=${UP1}/scan_results/dyad/
    SCANMAXPVALUE=1e-4
    
    scan:
        @mkdir -p ${SCAN_OLIGO_OUTPUT}
        @mkdir -p ${SCAN_DYAD_OUTPUT}
        @echo "Run the matrix-scan within long upstream region -1500 +200 for oligo-motif"
        @matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/oligos_5-8nt_m1/peaks_oligos_5-8nt_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna \
        -seq_format fasta \
        -pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
        -n score -o ${SCAN_OLIGO_OUTPUT}/scan_oligo_up1.tab
        
        @echo "Run the matrix-scan within long upstream region -1500 +200 for dyad-motif"
        @matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/dyads_test_vs_ctrl_m1/peaks_dyads_test_vs_ctrl_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna -seq_format fasta \
        -pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
        -n score -o ${SCAN_DYAD_OUTPUT}/scan_dyad_up1.tab
    
    # Run all target at once
    all: retrieve random peakmotifs scan

3 Interpretation of the results

3.1 Sequence analysis

To ensure clarity, the interpretation of the results will be confined to the boundary Up 1: [-1500 bp; +200 bp]. The same reasoning will be applied to other upstream regions.

Let’s begin by examining the length distribution of the returned promoter sequences. For instance, we can easily execute the following command:

# Verify sequence lengths of Upstream 1
sequence-lengths -i ${UP1}/regulon${REGULON}_up1.rm.fna

Figure 1. Sequences length distribution.

Note 5: As illustrated, some sequences are shorter than the expected length of 1701 bp when the boundaries are set from -1500 bp to +200 bp.

This discrepancy can be explained by the activation of the “-noorf” option in the retrieve-seq tool. When this option is enabled, upstream sequences are automatically clipped to prevent overlap with neighboring genes. For instance, in Figure 2, if gene 1 is the query gene, it is common to have a close upstream neighbor, such as gene 2. The “-noorf” option ensures that the upstream sequences do not extend into the neighboring gene’s region.


Figure 2. Schematic representation of overlapping neighbor genes.

3.2 Analysis of the predicted sites

The complete results are available in an HTML report called “peak_synthesis.html.” This report provides detailed information for each algorithm, including five motifs with their corresponding logos in both direct and reverse orientations. Additionally, the report includes data on statistical significance, E-value, annotation, Ncor, matrix, and other relevant parameters for each motif.

In the present case, Motif 2 and Motif 6 are considered the best-scored potential motifs, detected respectively by oligo and dyad analysis. These motifs have been identified as having the highest significance and Ncor score, as illustrated in Figure 3.

Figure 3. Top significant motifs discovered by aligo and dyad algorithms.


However, to ensure a reliable motif comparison, we recommend re-comparing these two selected motifs against the FootprintDB Plants database, using stringent values for the “cor” and “Ncor” parameters.

As depicted in Figure 4, both oligo-analysis (top) and dyad analysis (bottom) reveal that the best match among plant motifs in FootprintDB corresponds to the CAMTA transcription factor. The Ncor scores for oligo-analysis and dyad analysis are 0.64 and 0.879, respectively.

Figure 4: Annotation of the discovered motifs by comparison against FootprintDB collection of plant curated motifs


3.3 Estimating the relevance of the results.

While identifying CAMTA-associated motifs is a valuable first step, caution is necessary before considering them as reliable predictions. To assess their reliability and rule out chance identification, we must compare their significance scores with those of top-ranking motifs found in random control clusters (negative controls).

Note 6: Information regarding the statistical significance of the detected patterns is stored in the file “{UP1}/random$r.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab”, where the variable r ranges from 1 to 50.

Doing this manually may be time-consuming and cumbersome for the user. Above, we propose a simple R script to generate a barplot with the best-scored motifs.

Upon the successful execution of the makefile, the results of the co-expressed gene modules and their corresponding random control clusters are systematically stored within a parent directory named “rsat_results.” This directory is structured as follows:

├── rsat_results
   ├── upstream1
   ├── upstream2
   ├── upstream3
   ├── upstream4
   │   ├──  M11
   │   │   │──RegulonM11.rm.fna.peaks-rm
   │   │   ├── results
   │   │   │   ├── oligos_5-8nt
   │   │   │   │   ├── peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
   │   ├── random1.rm.fna.peaks-rm
   │   ├── results
   │   │   │   ├── oligos_5-8nt
   │   │   │   │   ├── peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab
   │   ├── random2.rm.fna.peaks-rm
   │   ├── ...
   │   │   ├── ...
   │   │   │   ├── ...
   │   ├── random3.rm.fna.peaks-rm
   │   ├── random4.rm.fna.peaks-rm
   │   ├── random5.rm.fna.peaks-rm
   │   ├── random6.rm.fna.peaks-rm
   .
   .
   .
   .
   │   ├── random50.rm.fna.peaks-rm

To generate the barplot graph of motifs significance, execute the following script:

# Set the working directory
parent.dir <- ("~/rsat_results/upstream4/M11")

# Load required libraries
required.libs <- c("gtools", "ggplot2", "ggridges")

for (lib in required.libs) {
  if (!require(lib, character.only = TRUE)) {
    install.packages(lib)
    library(lib, character.only = TRUE)
  }
}


# List all sub-directories
subdirs <- list.dirs(parent.dir, recursive = TRUE, full.names = TRUE)

# Initialize a vector to store the file paths of random clusters
random.list <- c()

# Loop through each sub-directory
for (i in subdirs) {
  # Check if the sub-directory contains the pattern "random"
  if (grepl("random", i, ignore.case = TRUE)) {
    # List all files inside the current sub-directory
    files <- list.files(i, full.names = TRUE)
    # Filter files containing the pattern "5-8nt_mvk-2_-2.tab"
    matching_files <- files[grep("5-8nt_mvk-2_-2.tab", files, ignore.case = TRUE)] 
    # Add the matching file paths to the file list vector
    random.list <- c(random.list, matching_files)
  }
}

# Sort files
random.list <- mixedsort(random.list, decreasing = TRUE)

# Create an empty data frame where new values will be assigned
results <- as.matrix(data.frame(matrix(ncol = 1, nrow = 0)))

# Name the column
col.name <- c("significance")
colnames(results) <- col.name


# Loop to read files and retrieve the significance values
for (j in 1:length(random.list)) {
  # Read the files
  random_oligo <- read.table(random.list[j], header = TRUE, comment.char = ";")
  
  # Transform the data to numeric class
  random_oligo$occ_sig <- as.numeric(as.character(random_oligo$occ_sig))
  
  # Replace NA by 0
  random_oligo[is.na(random_oligo)] <- 0
  
  # Get the maximum value of the significance
  max_sig <- max(random_oligo$occ_sig)
  
  # Add new row to data frame
  results <- rbind(results, data.frame(clusters = paste("r", j, sep = ""), significance = max_sig))
}



# List files from regulon directories
regulon_dir <- subdirs[grep("regulon", subdirs, ignore.case = TRUE)]
regulon_list <- list.files(regulon_dir, full.names = TRUE, pattern = "5-8nt_mvk-2_-2.tab", recursive = TRUE)

# Read files
regulon_oligo <- lapply(regulon_list, function(file) {
  data <- read.table(file, header = TRUE, comment.char = ";")
  data$occ_sig <- as.numeric(as.character(data$occ_sig))
  data[is.na(data)] <- 0
  max(data$occ_sig)
})


# Get maximum value of significance
max_sig_regulon <- max(unlist(regulon_oligo))

# Get maximum value of significance
max_sig_regulon <- max(unlist(regulon_oligo))

# Add new row for M11
results <- rbind(results, data.frame(clusters = "M11", significance = max_sig_regulon))

# Order the data frame by significance
results <- results[order(results$significance, decreasing = TRUE), ]

# Assign color to bars
results$color <- ifelse(results$clusters == "M11", "black", "grey")

# Create bar plot
#Create the bar plot
ggplot(results, aes(x = reorder(clusters, -significance), y = significance, fill = color)) +
  geom_bar(stat = "identity") +
  scale_fill_identity() +
  labs(x = "", y = "Significance", title = "Bar Plot M11: Oligo motifs") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

As shown in the barplot above, the CAMTA-corresponding motif detected in module M11 is the most significant pattern when compared to those found in the negative control clusters. This indicates that it may be considered a potential candidate.

Note 7: To reproduce the barplot with dyad results, just you need to create a new list files of all peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab.


Please note that in the barplot, we only consider the most significant motif from the module test and from the negative control clusters. If you want to assess the significance of all predicted motifs, we propose the following steps to generate a significance boxplot:

  1. Generate a file containing all significance values of predicted oligo-motifs in module M11.

M11_results=~/rsat_results/upstream4/M11/Results
mkdir -p ${M11_results}

M11_oligo_regulon=~/rsat_results/upstream4/M11/regulonM11.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab

grep -v "^;" ${M11_oligo_regulon} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.oligo.regulon.tsv
  1. Generate a file containing all significance values of predicted oligo-motifs in random clusters.
M11_oligo_random=~/rsat_results/upstream4/M11/random*.rm.fna.peaks-rm/results/oligos_5-8nt/peaks_oligos-2str-noov_5-8nt_mvk-2_-2.tab

grep -v "^;" ${M11_oligo_random} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.oligo.random.tsv
  1. Repeat steps 1 and 2 for dyad motifs
M11_dyad_regulon=~/rsat_results/upstream4/M11/regulonM11.rm.fna.peaks-rm/results/dyads_test_vs_ctrl/peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab 

M11_dyad_random=~/rsat_results/upstream4/M11/random*.rm.fna.peaks-rm/results/dyads_test_vs_ctrl/peaks_dyads-2str-noov_3nt_sp0-20_bg_monads.tab 

grep -v "^;" ${M11_dyad_regulon} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.dyad.regulon.tsv
grep -v "^;" ${M11_dyad_random} | cut -f 8 | grep -v occ_sig > ${M11_results}/M11.dyad.random.tsv

Once you have these files, you can create significance boxplots using R. Here’s a simplified outline of the process:

library(data.table)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:data.table':

    between, first, last
The following objects are masked from 'package:plyr':

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize
The following object is masked from 'package:gridExtra':

    combine
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)


# -------------- #
# Required input #
# -------------- #

analysis.name <- "M11"
rsat.results.folder  <- "/home/sie/rsat_results/upstream4/M11/Results"

analysis.results.folder <- file.path(rsat.results.folder)
result.files            <- list.files(analysis.results.folder, full.names = T)
regulon.name            <- paste0(analysis.name, "_regulon")


result.agg.df <- data.table()
for (rf in result.files) {
  
  sequence.type <- ""
  program.type  <- ""
  shape.type    <- ""
  
  # Detect program from file name
  if (grepl("oligo", rf)) {
    program.type <- "oligo"
    # shape.type  <- "1"
    shape.type  <- 1
    
  } else if (grepl("dyad", rf)) {
    program.type <- "dyad"
    # shape.type   <- "2"
    shape.type  <- 2
  }
  
  
  # Detect sequence type from file name
  if (grepl("regulon", rf)) {
    sequence.type <- regulon.name
    
  } else if (grepl("random", rf)) {
    sequence.type <- "random"
  }
  
  
  results.df <- read.table(rf) |> 
                rename("Significance" = "V1") |> 
                mutate(Program  = program.type,
                       Analysis = sequence.type,
                       Label    = shape.type) 
                # mutate( toupper(substr(Program, 1, 1)))
  
  result.agg.df <- rbind(result.agg.df, results.df)
  
}
result.agg.df$Analysis <- factor(result.agg.df$Analysis, levels = c(regulon.name, "random"))
 
result.agg.df
     Significance Program    Analysis Label
  1:         0.06    dyad      random     2
  2:         0.72    dyad      random     2
  3:         0.24    dyad      random     2
  4:         0.11    dyad      random     2
  5:         0.02    dyad      random     2
 ---                                       
316:         1.33   oligo M11_regulon     1
317:         0.96   oligo M11_regulon     1
318:         0.92   oligo M11_regulon     1
319:         0.46   oligo M11_regulon     1
320:         0.37   oligo M11_regulon     1
random.color  <- "#878787"
regulon.color <- "#b2182b"

max.random.signif    <- max(subset(result.agg.df, Analysis == "random")$Significance)
median.random.signif <- median(subset(result.agg.df, Analysis == "random")$Significance)


ggplot(result.agg.df, aes(x = Analysis, y = Significance, fill = Analysis, colour = Analysis, group = Analysis)) +
  geom_hline(yintercept = max.random.signif, linetype="dashed", color = random.color, size = 0.75) +
  geom_hline(yintercept = median.random.signif, color = random.color, size = 0.75) +
  geom_boxplot(outlier.shape = NA, alpha = 0.25, show.legend = F) +
  geom_point(aes(shape = Program), size = 3, position = position_jitter(width = 0.2), show.legend = T) +
  scale_shape_manual(values = c(1, 2),
                     breaks = c("oligo", "dyad")) +
  theme_classic() +
  scale_color_manual(breaks = c(regulon.name, "Random"),
                     values = c(regulon.color, random.color)) +
  scale_fill_manual(breaks = c(regulon.name, "Random"),
                    values = c(regulon.color, random.color)) +
  labs(x = "")

3.4 Positional distribution of transcrption factor binding sites

The final step in the makefile involves scanning the position weight matrix of the discovered motifs across the long upstream promoter region Up1:[-1500 bp, +200 bp].

To better visualize the density distribution of the TF binding sites, we offer a simple R script to generate a ridgeline plot.

# Set the working directory
working.dir <- ("~/rsat_results/upstream1/")

# Load required libraries
required.libs <- c("ggplot2","gridExtra","ggridges")

for (lib in required.libs){
if (!require(lib, character.only = T)){
  install.packages(lib)
  library(lib, character.only = T)
}
}

# Read the files, remove the rows with missing data and add new column called algorithm
scan.oligo <- read.csv("~/rsat_results/upstream1/scan_results/oligo/scan_oligo_up1.tab", header=T, dec = ".", sep = "",comment.char = ";")
scan.oligo <- scan.oligo[complete.cases(scan.oligo),]
scan.oligo$algorithm <- c("oligo")

scan.dyad <- read.csv("~/rsat_results/upstream1/scan_results/dyad/scan_dyad_up1.tab", header=T, dec = ".", sep = "",comment.char = ";")
scan.dyad <- scan.dyad[complete.cases(scan.dyad),]
scan.dyad$algorithm <- c("dyad")

# Join the 2 files vertically
scan.data <- rbind(scan.oligo,scan.dyad)
scan.data <- na.omit(scan.data)



# Ridgeline plots
plot_M11_up1 <- ggplot(scan.data,aes(start, algorithm, fill = algorithm))+
  geom_density_ridges_gradient(scale = 1, show.legend = T,rel_min_height = 0.0011) + 
  theme_ridges() + theme(legend.position = "none") +
  labs(x = "",y = "") + ggtitle("") + theme(plot.title = element_text(hjust = 0.4)) + 
  theme( axis.text.x = element_text( angle = 90,  hjust = 1, size = 10)) +
  scale_color_manual(values = c("#868686FF","#fdef64"))+ 
  scale_fill_manual(values = c("#868686FF","#fdef64"))+
  theme(strip.text = element_text(size=12),strip.background = element_rect(fill="gray", colour="black",size=4)) + scale_x_continuous (limit = c(-1500, +200),breaks = c(-1500,-500,0,+200)) + annotate("text", x = c(-250,-250), y = c(2.5,1.5), label = c("CAMTA", "CAMTA"), color="black", size=4, fontface="bold")
plot_M11_up1 
Picking joint bandwidth of 116

4 References

Contreras-Moreira B, Castro-Mondragon J.A, Rioualen C, Cantalapiedra C.P, van Helden J. (2016) RSAT::Plants: Motif Discovery Within Clusters of Upstream Sequences in Plant Genomes. In: Hehl R. (eds) Plant Synthetic Promoters. Methods in Molecular Biology, vol 1482. Humana Press, New York, NY

Thomas-Chollier M, Herrmann C, Defrance M, Sand O, Thieffry D, van Helden J. (2012). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets. Nucleic Acids Res 40(4): e31.

Thomas-Chollier M, Darbo E, Herrmann C, Defrance M, Thieffry D, van Helden J. (2012). A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 7(8): 1551-1568.