Author: Najla Ksouri

Revised by: Bruno Contreras-Moreira

Last updated: 15/05/2024


1 Description

This hands-on tutorial serves as a supplement to our article entitled “Motif analysis in co-expression networks reveals regulatory elements in plants: The peach as a model”. It offers step-by-step instructions to reproduce the analysis and test it on the user’s data using the Regulatory Sequence Analysis Tools (RSAT) directly from the Unix shell.

Doing so, users are able to execute larger jobs and exert finer control over the tool parameters.

1.1 Installation of RSAT

As the installation of RSAT can be somehow complex due to multiple tools, dependencies, and configuration steps, we highly recommend using its Docker image.

Please be aware that this repository hosts a deprecated Docker image for Regulatory sequence analysis tools (RSAT).

For the current and updated ready-to-use image, visit BioContainer. Detailed information about the Dockerfile is available on GitHub.

If you haven’t set up a Docker engine on your machine yet, please refer to the instructions provided by the Docker community for a simplified installation procedure.


# 1. download the container (from the Linux/WSL terminal)
docker pull biocontainers/rsat:20240507_cv1

# 2. Create local folders for input data and results named respectively as "rsat_data/" and "rsat_results/"
# In addition, a subfolder (rsat_data/genomes) should be created too

# assuming you are at $HOME 
mkdir -p rsat_data/genomes rsat_results
chmod -R a+w rsat_data rsat_results   # Change the permissions


# 3. Launch the Docker RSAT container
docker run --rm -v $HOME/rsat_data:/packages/rsat/public_html/data/ -v $HOME/rsat_results:/home/rsat_user/rsat_results -it biocontainers/rsat:20240507_cv1

# Ps: You should see a warning that can be safely ignored: 
# * Starting Apache httpd web server apache2
# (13)Permission denied: AH00091: apache2: could not open error log file /var/log/apache2/error.log.
# AH00015: Unable to open logs
# Action 'start' failed.
# The Apache error log may have more information

# 4. Download organism from public RSAT plant server (in this case Prunus persica)
# Ps: Other available servers are http://fungi.rsat.eu, http://metazoa.rsat.eu, http://protists.rsat.eu and http://teaching.rsat.eu

download-organism -v 2 -org Prunus_persica.Prunus_persica_NCBIv2.38 -server https://rsat.eead.csic.es/plants

# 5. Test the container
cd rsat_results 
make -f ../test_data/peak-motifs.mk RNDSAMPLES=2 all

Once the session is launched, you will be logged as an “rsat_user” user in “$HOME/rsat_user”. Docker container are supposed to be light-weighted, so that any created files are immediately deleted once you exit.

To solve this issue, we have mounted two local directories “rsat_results” and “rsat_data” in the host machine serving as a persistent storage volume inside the RSAT docker container.

Installed genomes will be saved in “rsat_data/genomes” folder and results from this tutorial will be saved in “rsat_results”.

1.2 Launch Docker Web server

To connect to RSAT Web server running from Docker container (Linux only):


# Retrieve your IP adress
hostname -I

# Launch Docker Web server
# Password: rsat_2020
sudo service apache2 restart

# Launch the following URL in your browser, using the IP address, ie http://172.17.0.2/rsat

In addition to logging into the Docker container as explained in the previous section, you can also call individual tools from the terminal non-interactively. See the example


2 Let´s get started

We will use one showcase module (Module 11) produced from our analysis (see article). Module 11 can be found within the docker container in the sub-directory “$HOME/rsat_user/test_data”.

Our approach of de novo motif discovery relies on the combination of different steps:

  1. Retrieving the upstream sequences from the test genes
  2. Creating the background model
  3. Building the negative control clusters and retrieving their upstream sequences
  4. Running the peak-motifs tool under the differential mode englobing various programs of RSAT suite:
    • Motif discovery
    • Motif comparison
    • Matrix scanning

2.1 Retrieve upstream sequences from the test genes

As we aim to test the effect of four upstream tracts on DNA-motifs discovery, four different upstream sequences need to be retrieved from each gene in module M11.
Up 1: [-1500 bp ; +200 bp], Up 2: [-500 bp ; +200 bp], Up 3: [-500 bp ; 0 bp], Up 4: [0 bp ; +200 bp].


Note 3: By default, the reference point is the first DNA base of an annotated gene, which usually corresponds to the beginning of a transcript. In cases where there is no untranslated region or where genes have been predicted de novo, position 0 would be the A of the start codon (ATG). Negative coordinates indicate sequences upstream the reference point (toward the 5’end on the sense strand) while positive ones correspond to downstream sequences (towards the 3’end).

Generally, regulatory signals are located upstream the promoter (in the non-coding sequence). However, sometimes they may overlap the coding sequence. For this reason we analyze here both non-coding and coding regions.

We recommend to generate a repeat-masked output file (-rm) as the presence of repetitive elements in the genome can hamper the detection of motifs. Indeed they have a distinct composition than the rest of the genome.

2.2 Create the background model

We define the background as the set of upstream sequences collected from all genes of Prunus persica genome. The resulting background will be used to estimate the random expectation of each oligonucleotide (Contreras-Moreira et al., 2016).

2.3 Build the negative control clusters and retrieve their upstream sequences

Bioinformatic approaches for de novo motif discovery are by essence predictive. Thus the need to evaluate the rate of false positive predictions. To do so, our strategy proceeds to build 50 sets of random genes called “negative control clusters” (Contreras-Moreira et al., 2016).

RSAT suite offers a variety of tools to build the sets of random control genes.

2.4 Run the peak-motifs pipeline in differential mode

To discover exceptional motifs in the promoters of M11, we employ the “Peak-motifs” tool in differential mode. This differential analysis requires two data sets as input (test versus control) and will be run in two separated phases:

Phase 1: Considering the promoter sequences of “test genes” as the test set and the promoter sequences of background as the control.

Phase 2: Considering the promoter sequences of each random cluster as test and keeping the background sequences as control.

Note 4: Peak-motifs gathers three essential components for putative motif detection. Here, we describe the parameters set for each component.

  • Motif discovery
  • – Relies on a serie of complementary algorithms() oligo analysis, dyad analysis, position analysis and local word analysis). However, when executing the differential mode, position and local word will be automatically ignored.

    – In differential mode, the markov order of the background model is ignored.

    – For the oligo-analysis, we set the oligomer length from 5 to 8 (-minol 5 -maxol 8) because it gave us better results but can be setted as (-minol 6 -maxol 8)

    – The number of motifs returned for each algorithm was set to 5 (-nmotifs 5). A higher number of motifs may increase the computing time.

  • Motif comparison
  • Discovered patteerns are compared to one or several collections of known motifs to identify transcription factors (TFs) that may recognize and bind the discovered motifs. Here we choose “footprintDB plants database” in TRANSFAC format.

  • Matrix scanning
  • Peak sequences are scanned to predict their binding sites positions using the default Markov order m=1 (-scan_markov 1). Predicted sites are then plotted for better visualization of the binding preferences.

    Here, we attach a detailed makefile including all the steps.

    Click here to download the makefile

    MAKEFILE=/rsat_results/peak-motifs.mk
    
    # Define the directories
    DIR_DATA=/home/rsat_user/test_data
    UP1=~/rsat_results/upstream1
    UP2=~/rsat_results/upstream2
    UP3=~/rsat_results/upstream3
    UP4=~/rsat_results/upstream4
    
    # Define the species of interest and the four upstream boundaries
    SPECIES=Prunus_persica.Prunus_persica_NCBIv2.38
    REGULON=M11
    FROM1=-1500
    TO1=+200
    FROM2=-500
    TO2=+200
    FROM3=-500
    TO3=0
    FROM4=0
    TO4=+200
    
    # Define the corresponding background for each upstream interval
    BGMSK1=${UP1}/${SPECIES}-up1.rm.fna
    BGMSK2=${UP2}/${SPECIES}-up2.rm.fna
    BGMSK3=${UP3}/${SPECIES}-up3.rm.fna
    BGMSK4=${UP4}/${SPECIES}-up4.rm.fna
    
    # Precise the number of random clusters
    RNDSAMPLES=50 
    
    # Collect the different upstream sequences from module M11 and from your reference genome.
    retrieve:
        @echo "Creating the result directories"
        @mkdir -p ${UP1}
        @mkdir -p ${UP2}
        @mkdir -p ${UP3}
        @mkdir -p ${UP4}
    
        @echo "Retrieving 4 upstream sequences boundaries from the genes of module M11"
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP1}/regulon${REGULON}_up1.rm.fna
        @echo "${UP1}/regulon${REGULON}_up1.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP2}/regulon${REGULON}_up2.rm.fna
        @echo "${UP2}/regulon${REGULON}_up2.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP3}/regulon${REGULON}_up3.rm.fna
        @echo "${UP3}/regulon${REGULON}_up3.rm.fna"
        
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -i ${DIR_DATA}/${REGULON}.txt -label id -rm -o ${UP4}/regulon${REGULON}_up4.rm.fna
        @echo "${UP4}/regulon${REGULON}_up4.rm.fna"
    
        @echo "Retrieving 4 upstream sequences boundaries from all Prunus persica genome"
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -all -label id -rm -o ${BGMSK1}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -all -label id -rm -o ${BGMSK2}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -all -label id -rm -o ${BGMSK3}
        @retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -all -label id -rm -o ${BGMSK4}
    
    # Generate 50 random clusters 
    random:
        @echo "Creating ${RNDSAMPLES} random clusters"
        @for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            NSEQS=`wc -l ${DIR_DATA}/${REGULON}.txt`; \
            RNDREGULON=random$$r.txt; \
            random-genes -n $$NSEQS} -org ${SPECIES} -feattype gene -g 1 -o $$RNDREGULON; \
        done
        
        @echo "retrieve the different upstream sequence lengths from the random clusters"
        @for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDREGULON=random$$r.txt; \
            RNDMSK=random$${r}.rm.fna; \
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM1} -to ${TO1} -noorf -rm -i $$RNDREGULON -label id -o ${UP1}/$$RNDMSK; \
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM2} -to ${TO2} -noorf -rm -i $$RNDREGULON -label id -o ${UP2}/$$RNDMSK;\
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM3} -to ${TO3} -noorf -rm -i $$RNDREGULON -label id -o ${UP3}/$$RNDMSK;\
            retrieve-seq -org ${SPECIES} -feattype gene -from ${FROM4} -to ${TO4} -noorf -rm -i $$RNDREGULON -label id -o ${UP4}/$$RNDMSK;\
        done
    
    ####################################################################################
    # THIS SECOND PART IS FOR PEAK-MOTIFS ANALYSIS IN DIFFERENTIAL MODE
    ####################################################################################
    
    # Define the directories
    INPUT1=${UP1}/regulon${REGULON}_up1.rm.fna
    INPUT2=${UP2}/regulon${REGULON}_up2.rm.fna
    INPUT3=${UP3}/regulon${REGULON}_up3.rm.fna
    INPUT4=${UP4}/regulon${REGULON}_up4.rm.fna
    FOOTDBFILE=/packages/rsat/public_html/motif_databases/footprintDB/footprintDB.plants.motif.tf
    DISCO=oligos,dyads
    PM_TASKS=purge,seqlen,composition,disco,merge_motifs,split_motifs,timelog,motifs_vs_db,synthesis,small_summary,clean_seq
    
    # Run peak-motifs pipeline
    peakmotifs: 
        @echo "Running peak-motifs for M11 within upstream1"
        @peak-motifs -i ${INPUT1} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP1}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream1"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP1}/$${RNDMSK} -ctrl ${BGMSK1} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP1}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP1}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream2"
        @peak-motifs -i ${INPUT2} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP2}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream2"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP2}/$${RNDMSK} -ctrl ${BGMSK2} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP2}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP2}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream3"
        @peak-motifs -i ${INPUT3} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP3}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream3"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP3}/$${RNDMSK} -ctrl ${BGMSK3} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP3}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS};\
        rm -rf ${UP3}/random$$r.rm.fna.peaks-rm/data;\
        done
    
        @echo "Running peak-motifs for M11 within upstream4"
        @peak-motifs -i ${INPUT4} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/regulon${REGULON}.rm.fna.peaks-rm -title analysis_M11 -origin end \
        -disco ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}
        @rm -rf ${UP4}/regulon${REGULON}.rm.fna.peaks-rm/data
        
        @echo "Running peak-motifs for random clusters within upstream4"
        for r in `seq 1 ${RNDSAMPLES}`; do \
            echo Replicate $$r; \
            RNDMSK=random$${r}.rm.fna; \
        peak-motifs -i ${UP4}/$${RNDMSK} -ctrl ${BGMSK4} -motif_db footDB transfac ${FOOTDBFILE} -prefix peaks -outdir ${UP4}/random$$r.rm.fna.peaks-rm -title analysis_random$$r -origin end -disco \
        ${DISCO} -nmotifs 5 -minol 5 -maxol 8 -scan_markov 1 -noov -img_format png -task ${PM_TASKS}; \
        rm -rf ${UP4}/random$$r.rm.fna.peaks-rm/data; \
        done
        
    
    ####################################################################################
    # THIS THIRD PART IS FOR MATRIX-SCANNING
    ####################################################################################
    
    SCAN_INPUT=${UP1}/regulon${REGULON}.rm.fna.peaks-rm/results/discovered_motifs/
    matrix=m1 #here we are going to scan only the first matrix
    SCAN_OLIGO_OUTPUT=${UP1}/scan_results/oligo/
    SCAN_DYAD_OUTPUT=${UP1}/scan_results/dyad/
    SCANMAXPVALUE=1e-4
    
    scan:
        @mkdir -p ${SCAN_OLIGO_OUTPUT}
        @mkdir -p ${SCAN_DYAD_OUTPUT}
        @echo "Run the matrix-scan within long upstream region -1500 +200 for oligo-motif"
        @matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/oligos_5-8nt_m1/peaks_oligos_5-8nt_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna \
        -seq_format fasta \
        -pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
        -n score -o ${SCAN_OLIGO_OUTPUT}/scan_oligo_up1.tab
        
        @echo "Run the matrix-scan within long upstream region -1500 +200 for dyad-motif"
        @matrix-scan -v 1 -matrix_format transfac -m ${SCAN_INPUT}/dyads_test_vs_ctrl_m1/peaks_dyads_test_vs_ctrl_m1.tf -i ${UP1}/regulon${REGULON}_up1.rm.fna -seq_format fasta \
        -pseudo 1 -decimals 1 -2str -origin end -bginput -markov 1 -bg_pseudo 0.01 -return limits -return sites -return pval -lth score 1 -uth pval ${SCANMAXPVALUE} \
        -n score -o ${SCAN_DYAD_OUTPUT}/scan_dyad_up1.tab
    
    # Run all target at once
    all: retrieve random peakmotifs scan

3 Interpretation of the results

3.1 Sequence analysis

To ensure clarity, the interpretation of the results will be confined to the boundary Up 1: [-1500 bp; +200 bp]. The same reasoning will be applied to other upstream regions.

Let’s begin by examining the length distribution of the returned promoter sequences. For instance, we can easily execute the following command:

# Verify sequence lengths of Upstream 1
sequence-lengths -i ${UP1}/regulon${REGULON}_up1.rm.fna

Figure 1. Sequences length distribution.

Note 5: As illustrated, some sequences are shorter than the expected length of 1701 bp when the boundaries are set from -1500 bp to +200 bp.

This discrepancy can be explained by the activation of the “-noorf” option in the retrieve-seq tool. When this option is enabled, upstream sequences are automatically clipped to prevent overlap with neighboring genes. For instance, in Figure 2, if gene 1 is the query gene, it is common to have a close upstream neighbor, such as gene 2. The “-noorf” option ensures that the upstream sequences do not extend into the neighboring gene’s region.


Figure 2. Schematic representation of overlapping neighbor genes.

3.2 Analysis of the predicted sites

The complete results are available in an HTML report called “peak_synthesis.html.” This report provides detailed information for each algorithm, including five motifs with their corresponding logos in both direct and reverse orientations. Additionally, the report includes data on statistical significance, E-value, annotation, Ncor, matrix, and other relevant parameters for each motif.

In the present case, Motif 2 and Motif 6 are considered the best-scored potential motifs, detected respectively by oligo and dyad analysis. These motifs have been identified as having the highest significance and Ncor score, as illustrated in Figure 3.

Figure 3. Top significant motifs discovered by aligo and dyad algorithms.


However, to ensure a reliable motif comparison, we recommend re-comparing these two selected motifs against the FootprintDB Plants database, using stringent values for the “cor” and “Ncor” parameters.

As depicted in Figure 4, both oligo-analysis (top) and dyad analysis (bottom) reveal that the best match among plant motifs in FootprintDB corresponds to the CAMTA transcription factor. The Ncor scores for oligo-analysis and dyad analysis are 0.64 and 0.879, respectively.