About MEBS

The main goal of MEBS is capture with a single value the importance of complex metabolic pathways or biogeochemical cycles in a large omic datasets (either genomes or metagenomes). The algortithm has been thoroughly tested with the sulfur cycle, but N, O, CH4, and Fe cycles are also supported.

Use the script mebs.pl to avoid reading the manual and score your own genomes/metagnomes in terms of biogeochemical cycles. All that is required is a directory containing peptide FASTA files of encoded proteins/fragments with .faa extension.

Installation

The MEBS software is available as an open-source package distributed from a GitHub repository. Thus, the natural way of installing it is by cloning the repository via the following commands:

git clone https://github.com/eead-csic-compbio/metagenome_Pfam_score

#Alternatively, a ZIP file can be downloaded and then unpacked:

unzip metagenome_Pfam_score-master.zip

Requisites

Before you start, make sure you have hmmserch installed, v3.1b1 or greater, otherwise the program will generate errors, see issue 1

Cycles

The following biogeochemical cycles are ready to use with MEBS:

  1. sulfur: Includes the mobilization of inorganic and inorganic sulfur compounds
  2. carbon:Usage of CH4 compounds by methanotrophs, methanogens, and methylotrophs
  3. oxygen: Represented by oxygenic photosynthesis
  4. iron: The Fe reduction and oxidation including also siderophores uptake
  5. nitrogen: We included the pathways involved in the reduction and oxidation of both inorganic (nitrate(+5) to ammonia(-3) ) and organic nitrogen compounds (i.e taurine, urea, and choline degradation)
perl mebs.pl -cycles

# Available cycles:
sulfur
carbon
oxygen
iron
nitrogen

Modalities

The main modalities of MEBS are

  1. Score: Capturing the metabolic machinery of your genome or metagenome in terms of single scores (Basic Mode)
  2. Markers: Detect possible marker genes according to their informational content (Advance mode)
  3. Completeness: Evaluate the metabolic completeness of metabolic pathways (Basic Mode)
  4. Kegg visualization: Visualizate the protein domains in your genome or metagenome using KEGG visualization (Basic mode)

Running MEBS

Starting point

MEBS assumes that your sequencing data is in fasta format allocated in an especific directory. Example of the input data can be foun in test_genomes/ or test_metagenomes/ directories

MEBS can only accept fasta files if they have .faa extension

To run MEBS you only need to specifyt the input folder and the type of data (either genomic or metagenomic). If you are working with MAGs (metagenome-assembled genomes), use the option ‘genomic’. The type of file is required for MEBS to allocate the pre-computed entropies for each type of data considering the fragmentary nature of the metagenomic sequences.

Genomic data

perl mebs.pl  -input test_genomes/ -type genomic 
# mebs.pl -input test_genomes/ -type genomic -fdr 0.01 -comp 0


    sulfur  carbon  oxygen  iron    nitrogen
Synechococcus_sp_KORDI-52.faa   2.425   4.877   9.746*  3.460   13.434
Desulfohalobium_retbaense_DSM_5692.faa  11.511* 9.170   3.521   4.749   9.052
Methanosarcina_barkeriMS.faa    5.930   79.606* 5.469   0.208   10.704
Archaeoglobus_profundus_DSM_5631.faa    12.024* 24.795* 2.036   0.911   6.928
Candidatus_Nitrosopumilus_sp_AR2.faa    3.256   5.377   3.229   2.624   9.807
Bradyrhizobium_oligotrophicum_S58.faa   1.211   6.366   8.871*  7.482   19.970*
Dechloromonas_aromatica_RCB.faa 3.941   10.369  8.004*  10.100* 17.793*
Enterococcus_durans.faa -0.194  0.349   1.433   0.337   2.652

The scores that meets the criteria of specific FDR are shown with an asterisc, yet the score will be the same regardless of the FDR that is used. If the Score if greater or equal to the FDR of choice, then an asterisc will be shown in the output. In the case of using the default FDR (0.01), more false positive will be obtained, for example the genome Archaeoglobus profundus a well known microorgnism involved in the S-cycle, could seem to have a CH4 metabolism by using a default FDR, however if we increase to FDR 0.001, the C cycle asterisk disappear and only the * S-cycle asterisc* ramain. Therefore, we recomend a more restrictive FDR in order to eliminate false positives.

If you attempt to benchmark your own metabolism, we recomend to add your own FDR values in this config file at the end of this file.

Metagenomic data

In the case of using metagenomic data metagenomic data the mean size length (MSL) and the size of the allocated entropies (MSLbin) is especify as warning. If you redirect the output of the script to a file this information will not be printed.

perl mebs.pl -input test_metagenomes -type metagenomic
# mebs.pl -input test_metagenomes -type metagenomic -fdr 0.01 -comp 0

# Computing Mean Size Length (MSL) ...
# 4511045.3_metagenome.faa MSL=32 MSLbin=30
# 4440966.3_metagenome.faa MSL=175 MSLbin=150

    sulfur  carbon  oxygen  iron    nitrogen
4511045.3_metagenome.faa    -2.295  1.790   5.412   2.745   13.024
4440966.3_metagenome.faa    5.817   8.804   1.178   4.579   11.697

As MEBS needs to compute the MSL of each metagenome the time will depend on the size of the input sample. In the test_metagenomes/ directory there are only two metagenomes with size of 2,8M and 6,5M respectively. The computation time will depend on the size and the number of metagenomes in the input forlder. I

Ploting your results

The output from MEBS is a tabular file text, where the columns indicate the scores (sulfur, carbon, oxygen iron, nitrogen), and the completeness of the pathways related to S, C,N and Fe cycles.

To plot MEBS results, redirect the output of MEBS to a file and make sure that you are using the -comp option (pathway completeness)

perl mebs.pl  -input test_genomes/ -type genomic -comp > test.genomes.tsv 
# mebs.pl -input test_genomes/ -type genomic -fdr 0.01 -comp 1

Eliminate the asterisks in the file In order to plot the results

 sed -i 's/\*//g' test.genomes.tsv 

Make sure that you have the following libraries installed.

  1. Numpy >= 1.14.5
  2. Matplotlib => 2.2.2
  3. Pandas => 0.23.3
  4. seaborn => 0.9.0
  5. scipy => 1.1.0

Run the script within MEBS /home directory. The script requires the file dataset_heatmap_template.txt in order to create an ad hoc file to mapp the scores and the completeness of the pathways into itol.

python3 mebs_output.py test.genomes.tsv 

No handles with labels found to put in legend.
Done........................
Please check the following files:
 1. Heatmap displaying the metabolic completeness of N,Fe,S and CH4 pathways: test.genomes.tsvcomp_heatmap.png
 2. Barplot with normalized MEBS score values: test.genomes.tsv_barplot.png
 3. Heatmap with normalized MEBS score values: test.genomes.tsv_mebs_heatmap.png
 4. Dotplot with normalized MEBS score values: test.genomes.tsv_mebs_dotplot.png
 5. Completeness file with description of the columns: test.genomes.tsv_completenes.tab
 6. Mapping file to itol with normalized MEBS scores: test.genomes.tsv_itol_mebs.txt
 7. Mapping file to itol with metabolic completeness: test.genomes.tsv_itol_mebs_comp.txt
 .............................
  If you have a tree file loaded in  itol, you can drag directly the _itol.txt files into your tree
 and customize the colors of the pathways and the scores as in the following example
 https://itol.embl.de/tree/97981518041461538630153
Figure 1.Graphs derived from mebs_output.py script using the test.genomes.tsv file derived from mebs.pl output using the -comp option.

Figure 1.Graphs derived from mebs_output.py script using the test.genomes.tsv file derived from mebs.pl output using the -comp option.

MEBS scores

If you want to compare your MEBS scores with a dataset of 2107 non-redundant genomes plese click at the following TABLE. To compare your data with the maximum scores that you can obtain from the entropy data, have a look at the following data

If you are computing MEBS in genomes compare your results with the raw “Genomic data”. In the case that you are computing MEBS in metagenomes see the corresponding MSL and MSLbin you compare your results.

sulfur methane oxygen iron nitrogen
Genomic data 16.018 85.332 10.703 10.464 22.079
30 13.676 84.503 10.438 8.843 20.642
60 16.818 85.347 11.253 9.567 22.148
100 15.566 85.221 9.965 10.676 21.43
150 15.848 84.81 10.152 10.316 21.379
200 15.887 84.765 10.463 9.832 21.938
250 16.031 85.057 10.387 10.215 21.853
300 15.929 84.942 10.569 10.284 21.968

Download public available MG-RAST data

If you want to analyze public available data, or compare your results, we provide an example of how to download a and analyze available data from MG-RAST.

To do so, first generate a file with MG-RAST id’s of interest. As example, we provide a list of microbialites (microbial mats and stromatolites) and other environments publicly available from MG-RAST. Please have a look at the table containing metadata of the metagenomic dataset described in De Anda et al., 2017

MG-RAST id Type and origin Reference
4449591.3 4449590.3 Stromatolites from Highborne Cay, Bahamas Khodadad & Foster,2012
4445126.3 4445129.3 Polar microbial mats (Antartic) Varin et al., 2012
4442467.3 4442466.3 4441363.3 4441347.3 Freshwater microbial mats from Cuatro Cienegas Basin (CCB Bonilla-Rosso et al., 2012, Peimbert et al., 2012
4443746.3 4443747.3 4443762.3 4443749.3 4443750.3 Microbial mats from Yellowstone National Park Bhaya et al., 2007
4454153.3 Purple sulfur bacteria biofilm Wilbanks et al., 2014
4440060.4 4440067.3 Stromatolite and thrombolite from CCB Breitbart et al., 2009, Desnues et al., 2008
4487624.3 4487625.3 Hydrothermal vents Taiwan Tang et al., 2013
4441138.3 4441137.3 Acid Mine Dreinage (AMD) mats Jiao et al.,2011
4449206.3 Hydrothermal vent Andes Colombia Jiménez et al., 2012
4491734.3 Polar cryconite Edwards et al., 2014
4744007.3 4744008.3 4744009.3 4744010.3 4744011.3 4744012.3 4744013.3 4744014.3 4744015.3 4744016.3 4744017.3 4744018.3 Microbial mats from Churince CCB De Anda et al., 2018

You can also include freshwater microbrialites from Pavilion Lake, Clinton Creek described in Allen-White III et al., 2016, Allen-White III., et al., 2016, Respectively, hypersaline from marine environements described in Ruvindy et al., 2016

head IDmats.txt
4449591.3
4449590.3
4445126.3
4445129.3
4442467.3
4442466.3
4441363.3
4441347.3

Download 22 metagenomes using the MG-RAST API

This can take an hour depending on your download bandwith. We recomend you to have enough space drive to perform this action.

mkdir microbialites
cd microbialites 

time for line in `cat ../IDmats.txt`; do wget "http://api.metagenomics.anl.gov/1/download/mgm$line?file=350.1" -O $line.faa ; done

Compute MEBS

Use restrictive FDR and save the output in a tabular file

perl mebs.pl -input test_mats -input microbialites  -type metagenomic -fdr 0.0001 -comp > scores.tab
# mebs.pl -input test_mats -type metagenomic -fdr 0.0001 -comp 1

# Computing Mean Size Length (MSL) ...
# 4441588.3.faa MSL=146 MSLbin=150
# 4442467.3.faa MSL=99 MSLbin=100
# 4443749.3.faa MSL=210 MSLbin=200
# 4441137.3.faa MSL=209 MSLbin=200
# 4441571.3.faa MSL=213 MSLbin=200
# 4442466.3.faa MSL=75 MSLbin=60
# 4493725.3.faa MSL=223 MSLbin=200
........

Summary of MEBS basic mode

Below we summarize the internal steps performed with MEBS in the basic mode using the main script mebs.pl

MEBS flowchart basic mode

MEBS flowchart basic mode

Advance mode

Requisites

The following external packages are required if you want to benchmark your own metabolic pathway. Interproscan and hmmsearch are needed in order to annotate Pfam domains within peptide sequences. The rest of packages are needed to run the full pipeline, which comprises four steps.

  1. Interproscan
  2. Python3
  3. Matplotlib 1.4 or greater
  4. Numpy
  5. Pandas
  6. Scikit-learn
  7. Jupyter-notebook
  8. MPL_toolkits

Scoring your data: Train your own classifier. Advanced Mode

For more advanced uses a manual is provided. The required input data are:

  1. FASTA file with peptides sequences of proteins involved in the cycle/pathway of interest.
  2. List of RefSeq accesions of (curated) genomes known to be involved in the cycle/pathway of interest.

These inputs are processed in order to train a classifier which internally uses Pfam domains.

As seen above, genomes or metagenomes provided by the user can then be scored with the trained classifier. Once a classifier has been trained, such as the Sulfur cycle, steps 1 and 3 can be skipped.

MEBS flowchart advance mode

MEBS flowchart advance mode

Support and Development

Planned feature improvements are publicly catalogued at the main MEBS development site on github. Bug reports and problems using MEBS are welcome on the issues tracker. We prefer posting to the issue tracker over email as these posts are searchable by other users who may experience the same problems.

Talk to us

We’d love to hear from you. Any comments and suggestions can be sent to Valerie De Anda valdeanda at ciencias dot unam dot mx

Cite

If you find this software usefull please cite us as:

  • De Anda V, Zapata-Penasco I, Poot Hernandez AC, Fruns LE, Contreras Moreira B, Souza V (2017) MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle. doi:10.1093/gigascience/gigascience/gix096/4561660

2018-10-18