Pangenome analysis of plant transcripts and coding sequences

Bruno Contreras-Moreira (1)*, Alvaro Rodriguez del Rio (1), Carlos P. Cantalapiedra (1), Ruben Sancho (2) and Pablo Vinuesa (3)

  1. Estación Experimental de Aula Dei-CSIC, Av. Montañana 1.005, 50059 Zaragoza, Spain.
  2. Escuela Politécnica Superior, Universidad de Zaragoza, Huesca, Spain.
  3. Centro de Ciencias Genómicas, Universidad Nacional Autónoma de México, Cuernavaca, Mexico.

i. Summary/Abstract

The pangenome of a species is the sum of the genomes of its individuals. As coding sequences often represent only a small fraction of each genome, analyzing the pangene set can be a cost-effective strategy for plants with large genomes or highly heterozygous species. Here we describe a step-by-step protocol to analyze plant pangene sets with the software GET_HOMOLOGUES-EST. After a short introduction, where the main concepts are illustrated, the remaining sections cover the installation and typical operations required to analyze and annotate pantranscriptomes and gene sets of plants. The recipes include instructions on how to call core and accessory genes, how to compute a presence-absence pangenome matrix and how to identify and analyze private genes, present only in some genotypes. Downstream phylogenetic analyses are also discussed.

ii. Key Words

  • Pangenome, pangene set, crops, model plants, wild plants, polyploids, scripting

iv. Running head

Defining pangene sets with GET_HOMOLOGUES-EST

v. Citation

Contreras-Moreira, B., del Río, Á.R., Cantalapiedra, C.P., Sancho, R., Vinuesa, P. (2022). Pangenome Analysis of Plant Transcripts and Coding Sequences. In: Pereira-Santana, A., Gamboa-Tuz, S.D., Rodríguez-Zapata, L.C. (eds) Plant Comparative Genomics. Methods in Molecular Biology, vol 2512. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-2429-6_9

1. Introduction

1.1 Background

A pangenome can be defined as the sum of the core genome, shared by all individuals of a species, plus the dispensable genome, which includes partially shared and population-specific genes (1). Among plants, previous works have compared ecotypes of model species and cultivars of crops such as maize, barley, soybean, or rice, revealing that dispensable genes play important roles in evolution, and in the complex interplay between plants and the environment (reviewed at (2)). Moreover, accounting for accessory genomic features improves the association of genotypes to phenotypes beyond SNPs called on a single reference genome (3), (4). For these reasons we prefer to rename dispensable genes as “accessory” or “shell” genes, terms borrowed from microbiology (5). In summary, pangenomes are the new reference sequences for plant genomic studies (6).

The data structures used to represent pangenomes are evolving in parallel with the sequencing technologies, and the state-of-the-art are the so called pangenome graphs, which significantly improve variant calling by read mapping (4), (7) and have multiple implementations (see for instance (8)). However, there are currently no community accepted solutions for visualizing or setting coordinate systems in genome graphs. More importantly, the input genome sequences used to construct a graph need to be highly contiguous. This contrasts with most plant genome assemblies found in the public archives, which are often fragmented, particularly for species with large, highly repetitive, heterozygous and polyploid genomes. For these reasons, approaches that do not require a fully-assembled reference genome are of interest for plant breeding and ecology. Some of these strategies reduce the natural complexity of genomes by computing the frequency of nucleotide words and looking for enriched subsets of sequences (9), (10). Other strategies, like the one presented in this protocol around GET_HOMOLOGUES-EST (11), take transcripts or coding sequences (CDS) as the genomic unit of interest (see examples at (12), (13)). Therefore, a more appropriate name for this approach would be pangene set analysis.

Compared to whole genome sequencing (WGS) and assembling, pangene analyses have the advantage of sampling only the expressed part of the genome, the transcriptome, which is only a fraction of the complete genome. Recent technological advances, such as single-molecule long-read sequencing, are producing transcripts with unprecedented accuracy (14), even without a reference genome (15). Their main disadvantage is that genetic variation in non-expressed sequences cannot be sampled. Nevertheless, the definition of pangene sets is a necessary step for pangenome projects of crops and plants, as a way to deduce consistent gene collections with uniform nomenclature across genotypes.

1.2 Pangenome history and concepts

In 2002 Welch and colleagues (16) compared the genome sequences of three strains of the bacteria Escherichia coli, two of them pathogens, and found shared genes, which mostly conserve their positions in the genome, and also accessory genes, which are encoded only in some strains. Three years later, Tettelin and collaborators (1) analyzed 8 strains of a Gram+ bacteria and, for the first time, defined the pangenome as a core genome shared by all strains, plus a dispensable genome consisting of partially shared and strain-specific genes. In addition, a mathematical model suggested that the pangenome of Streptococcus agalactiae might be very large and open, as novel genes would continue to be identified in newly sequenced strains. In 2007 Morgante and co-workers (17) took these ideas and proposed a role for transposable elements (TEs) in generating the dispensable genomic components that differentiate maize inbred lines B73 and Mo17. A few years later they concluded that these components might not be dispensable after all (18). In order to make this classification problem more tractable, dispensability scores are now being computed (19).

Figure 1 summarizes the computational analysis of the pangene set of three toy genomes with GET_HOMOLOGUES-EST. Note that occupancy is defined as the number of genomes/cultivars/ecotypes present in a sequence cluster and in this example takes values from 1 to 3 (see also definitions in Table 1). Core loci are found in all three genomes and hence have occupancy=3 in the example. Accessory loci are allocated to shell and cloud compartments. Although not shown in the figure, it’s often convenient to define a fourth class, the soft-core, which is a relaxed core that tolerates assembly and annotation errors.

Figure 1. pangene set of three genomes (A, B and C). In this example there are 101 core loci and 44 (11+9+24) shell loci are found in two accessions. Moreover, 82 (35+25+22) cloud loci are annotated only in one genome. The pangene set size is 82+44+101=227 loci. The table below shows four example clusters and their occupancy, computed using the rules in Table 1.

class or compartment definition
core Genes contained in all considered genomes/taxa.
soft-core Genes contained in 95% of the considered genomes/taxa, as in (20).
cloud Genes present only in a few genomes/taxa, generally \(\leq 2\). The cutoff is defined as the class next to the most populated non-core cluster class.
shell Remaining genes, present in several genomes/taxa.

Table 1. Definition of occupancy-based classes used by the software GET_HOMOLOGUES-EST.

In addition to occupancy differences, previous benchmarks have shown that occupancy-based classes differ in their functional annotations. The examples in Figure 2 highlight the different functions of core and accessory genes in two bacterial clades; note that similar observations have been made in plants (11), (12).

Figure 2. Functional annotations (COGs) of protein-coding genes classified in occupancy classes in two bacterial sets.

2. Materials

The GET_HOMOLOGUES software was originally designed for the analysis of bacterial genomes, and has been described in (21), (22). That software was then adapted to the study of intra-specific eukaryotic pangene sets, as described in (11), taking the GET_HOMOLOGUES-EST name. Its source code and documentation can be found at https://github.com/eead-csic-compbio/get_homologues. Table 2 summarizes the main differences between the two flavours of the software. In this protocol we will use GET_HOMOLOGUES-EST (see Note 1).

version primary input primary engine align coverage COGS isoforms filtered
GET_HOMS peptides BLASTP / DIAMOND query sequence yes no
GET_HOMS-EST nucleotides BLASTN shortest sequence no yes

Table 2. Summary of features and differences of GET_HOMOLOGUES and GET_HOMOLOGUES-EST. Note that GET_HOMOLOGUES can also cluster user-selected nucleotide features in GenBank files, and in this case BLASTN is used as well.

2.1 Installation and up-to-date documentation

GET_HOMOLOGUES-EST is an open source software package, written in Perl, R and bash, available for Linux and MacOS systems. A manual is available at http://eead-csic-compbio.github.io/get_homologues/manual-est (see Note 2). Some of the scripts included in the package require additional software dependencies, as described in detail in the manual.

There are two ways to install GET_HOMOLOGUES-EST, as a bundled release or by cloning the GitHub repository. Both options, described below, will need the installation of a few dependencies in order to follow this tutorial. Assuming we are in Ubuntu, these can be installed as follows:

sudo apt-get -y install r-base r-base-dev parallel
sudo cpan -i Inline::C Inline::CPP

Instead, the Docker container described in section 2.1.3 is self-contained and does not require any extra installation steps.

2.1.1 Bundled release

The simplest way to get the current release of GET_HOMOLOGUES is to check https://github.com/eead-csic-compbio/get_homologues/releases, download it and extract it in your local filesystem. Let’s assume we have a dedicated folder called soft:

cd soft

# make sure you choose the right version
wget -c https://github.com/eead-csic-compbio/get_homologues/releases/download/v3.4.2/get_homologues-x86_64-20210305.tgz
# or
# wget -c https://github.com/eead-csic-compbio/get_homologues/releases/download/v3.4.2/get_homologues-macosx-20210305.tgz

tar xvfz get_homologues-x86_64-20210305.tgz

Releases include scripts (Perl, R and bash), documentation, sample data and the required binary dependencies, which are listed in Table 3 and discussed in the manual.

software source citation
mcl v14-137 http://micans.org/mcl (25)
NCBI Blast-2.8.1+ http://blast.ncbi.nlm.nih.gov (26)
BioPerl v1.5.2 http://www.bioperl.org (27)
HMMER 3.1b2 http://hmmer.org
Pfam http://pfam.xfam.org (28)
PHYLIP 3.695 http://evolution.genetics.washington.edu/phylip
Transdecoder r20140704 http://transdecoder.sf.net (29)
MVIEW 1.60.1 https://github.com/desmid/mview (30)
diamond 0.8.25 https://github.com/bbuchfink/diamond (31)

Table 3. Dependencies of GET_HOMOLOGUES-EST.

2.1.2 GitHub clone / pull

A more sustainable way of obtaining the software is to use the software git. For this you might need to install git in your system. Note also that the git protocol requires having network port 9418 open, which might be blocked by your firewall. This method does not include the binary dependencies, which must downloaded during installation:

cd soft
git clone https://github.com/eead-csic-compbio/get_homologues.git

# this creates a folder called get_homologues which does not 
# include the binary dependencies; these must be downloaded:

cd get_homologues
perl install.pl no_databases

This approach makes future updates very simple, after moving to the git repository:

cd soft/get_homologues
git pull

2.1.3 Bioconda install

You can also install the software from Bioconda as follows:

conda activate bioconda
$ conda create -n get_homologues -c conda-forge -c bioconda get_homologues
$ conda activate get_homologues

2.1.4 Environment and optional data (Pfam & SwissProt)

Whatever way we installed the software we can now define an environment variable pointing to our install location (note this won’t be required with Bioconda):

# set GET_HOMOLOGUES path, for example:
export GETHOMS=~/soft/get_homologues
# or
# export GETHOMS=~/soft/get_homologues-x86_64-20210305

In order to annotate protein domains and to translate open reading frames (ORFs) a couple of databases must be downloaded and formatted. These are Pfam (28) and SwissProt (32), which you can download and format it as follows in the terminal:

cd $GETHOMS
perl install.pl

You should be able to control the installation process by typing Y or N in the terminal. This script will also tell you of any missing dependencies.

2.1.5 Docker container

A way to use GET_HOMOLOGUES-EST with all dependencies pre-installed is the Docker image available at https://hub.docker.com/r/csicunam/get_homologues. This container also includes GET_PHYLOMARKERS (see section 3.3.6).

2.1.6 High performance cluster (HPC) configuration

In order to prepare your installation to run on a computer cluster please follow the instructions in section “Optional software dependencies” of the manual. Three job managers are currently supported: gridengine, LSF and Slurm. The default configuration is for gridengine, but this can be changed by creating a file named “HPC.conf” and setting appropriate values and paths for your HPC cluster (see Note 3). This file should be placed at $GETHOMS. The sample configuration file looks like this:

cat sample.HPC.conf 
# cluster/farm configuration file, edit as needed (use spaces or tabs)
# comment lines start with #
# PATH might be empty or set to a path/ ending with '/'
# QARGS might be empty or contain specific queue name, resources, etc 
#
# example configuration for LSF
#PATH   /lsf/10.1/linux3.10-glibc2.17-x86_64/bin/
TYPE    lsf
SUBEXE  bsub
CHKEXE  bjobs
DELEXE  bkill
ERROR   EXIT
# 
# example configuration for slurm
TYPE    slurm
SUBEXE  sbatch
CHKEXE  squeue
DELEXE  scancel
ERROR   F

Running GET_HOMOLOGUES-EST on a computer cluster distributes batches of BLASTN/HMMER jobs and the actual clustering tasks (isoforms, orthologues, inparalogues), which is recommended if you plan to analyze a large number of sequence sets.

3. Methods

3.1 Overview of the pipeline and the get_homologues-est.pl script

Unlike its predecessor, which was designed for the analysis of genes within fully sequenced genomes, GET_HOMOLOGUES-EST has been adapted to the large size of plant genomic data sets, and adds new features to adequately handle redundant and fragmented transcript sequences, as those usually obtained from transcript profiling experiments, as well as incomplete/fragmented gene models from WGS assemblies. Sequence clusters can be produced using the bidirectional best-hit (BDBH) or the OrthoMCL (33) (OMCL) algorithms. The granularity of the clusters can be tuned by a configurable filtering strategy based on a combination of BLAST pairwise alignment parameters and, optionally, hmmscan-based scanning of Pfam domain composition of the proteins encoded in each cluster. By default redundant sequences are removed from input sets if they overlap a longer identical sequence over a length \(\geq 40\) or when they are completely matched. This is to prevent BLAST results being biased by transcript isoforms (34). If your input are annotated CDS sequences this behavior can be disabled with option -i 0 .

As summarized in Figure 3, the main script get_homologues-est.pl can produce different outputs:

  • Sequence clusters based on nucleotide similarity, which provide higher resolution than peptide comparisons.
  • Presence-absence pangenome matrices in different formats, which can be easily transformed and exported for different applications.
  • Simulations on the growth of the pangenome at the gene level using different fits (1), (35) and mixture models (36).
  • Average nucleotide sequence identity (ANI) matrices, which contain average sequence identity values among pairs of genomes, computed from clustered sequences. These values are derived from BLASTN alignments.
  • Matrices of percentage of conserved sequence clusters (POCS), which contain the % of clusters that contain sequence from every two species being compared (37).

Figure 3. Features of GET_HOMOLOGUES-EST. Flowchart of the main tasks and deliverables. BLASTN and optional Pfam scans, as well as BDBH and OMCL clustering, can be run on a local computer, preferably multi-core, or over a HPC cluster. Resulting clusters are post-processed to produce pangenome or average nucleotide identity matrices, as well as to estimate pan-, soft-core-, and core-genomes. Note that both clustering algorithms can be fine-tuned by customizing an array of parameters, of which alignment coverage (-C) and same Pfam domain composition (-D) are perhaps the most important. While OMCL is adequate for most applications, BDBH is faster for the calculation of core sequences within large datasets.

As assembled transcripts are often incomplete, and gene models might contain errors, by default GET_HOMOLOGUES-EST computes alignment coverage with respect to the shortest sequence. This adds robustness against split genes, partial genes and genes or transcripts with retained introns (11).

3.2 Other scripts

A few scripts are bundled with GET_HOMOLOGUES-EST to assist in the interpretation of results. We will use some of them in this protocol:

  • compare_clusters.pl primarily calculates the intersection between cluster sets, which can be used to select clusters supported by different algorithms or settings. This script can also produce pangenome matrices and Venn diagrams.

  • parse_pangenome_matrix.pl can be used to analyze pangenome sets, in order to find transcripts/genes present in a group A of strains which are absent in set B. This script can also be used for calculating and plotting cloud, shell and core genome compartments with mixture models.

  • plot_pancore_matrix.pl plots pan/soft/core-genome sampling results and fits regression curves with help from R functions.

  • check_BDBHs.pl can be used, after a previous get_homologues-est.pl run, to find out the bidirectional best hits of a sequence identifier chosen by the user. It can also retrieve its Pfam annotations.

  • annotate_cluster.pl produces a multiple alignment view of the supporting local BLAST alignments of sequences in a cluster. It can also annotate Pfam domains and find private sequence variants to an arbitrary group of sequences.

  • plot_matrix_heatmap.sh calculates ordered heatmaps with attached row and column dendrograms from tab-separated numeric matrices, which can be presence/absence pangenomic matrices or identity matrices as those produced by get_homologues-est with flag -A. Requires R packages ape (38), dendextend, factoextra and gplots (see Note 4).

  • pfam_enrich.pl calculates the enrichment of a set of sequence clusters in terms of Pfam domains, by using Fisher’s exact test.

Apart from these, scripts transcripts2cds.pl and transcripts2cdsCPP.pl are also bundled to assist in the analysis of transcripts. They can be used to annotate potential ORFs contained within raw transcripts, which might be truncated or contain retained introns. The CPP version is faster but requires some optional Perl dependencies (which you should have installed earlier). These scripts use TransDecoder and BLASTX (see Table 3) to scan protein sequences in SWISSPROT (see Note 5).

3.3 Protocol for plant transcripts and CDS sequences

In this section we present a step-by-step protocol for the analysis of Hordeum vulgare WGS-based cDNA sequences (from reference cultivar Morex) and de novo assembled transcriptomes. These sequence sets were first used in (11). For convenience, these barley sequences can be obtained as explained below from files included in the test_barley folder, which should be bundled with your copy of GET_HOMOLOGUES (see Note 6).

3.3.1 Preparing input sequences, outgroups and extracting CDS from transcripts

The main script reads in sequences in FASTA format, which might be GZIP- or BZIP2-compressed, from a folder containing at least two files with extension .fna, one per cultivar/ecotype. You should consider adding sequences from one or more outgroup taxa if you plan to run downstream phylogenomic analyses with the resulting clusters (see Note 7).

By default only sequences with length \(\geq 20\) bases are considered (see Note 8).

Files with .fna extension allow clustering any kind of nucleotide sequences, such as conserved intergenic regions or transposons. However, if you are only interested in the analysis of CDS sequences, .fna files might optionally have twin .faa files with translated amino acid sequences in the same order. Note that you will need .faa files to run Pfam-based analyses and the phylogenomics recipes of GET_PHYLOMARKERS (40) (see section 3.3.6).

Peptide files can be downloaded from resources such as Ensembl Plants (41) or Phytozome (42), or deduced with script transcripts2cds.pl, as in the examples of this protocol:

$GETHOMS/transcripts2cds.pl -n 3 $GETHOMS/sample_transcripts.fna

# ./transcripts2cds.pl -p 0 -m  -d $GETHOMS/db/uniprot_sprot.fasta -E 1e-05 -l 50 -g 1 -n 3 -X 0
# input files(s):
# sample_transcripts.fna

## processing file sample_transcripts.fna ...
# parsing re-used transdecoder output (_sample_transcripts.fna_l50.transdecoder.cds.gz) ...
# running blastx...
# parsing blastx output (_sample_transcripts.fna_E1e-05.blastx.gz) ...
# calculating consensus sequences ...
# input transcripts = 9
# transcripts with ORFs = 7
# transcripts with no ORFs = 2
# output files: sample_transcripts.fna_l50_E1e-05.transcript.fna , sample_transcripts.fna_l50_E1e-05.cds.fna , sample_transcripts.fna_l50_E1e-05.cds.faa , sample_transcripts.fna_l50_E1e-05.noORF.fna

If some of your input files contain high-quality sequences, such as full-length cDNA sequences, you can add the tag ‘flcdna’ to their file names. By doing this you make sure that the length of these sequences is used downstream in order to estimate coverage alignment with high confidence. For example, if we wanted to add cultivar Haruna Nijo sequences (39) to our analysis, we could put them in a file named HarunaNijo.flcdna.fna.

Other analyses you might want to carry out on your input sequences include computing their coding potential or their gene completeness (see Note 9).

Let’s now get all the barley sequences and extract CDS from them, which takes several hours:


cd ${GETHOMS}/test_barley
cd seqs

# download all transcriptomes
wget -c -i wgetlist.txt

# Extract CDS sequences (run transcripts2cds.pl for each transcriptome).
# Choose cdsCPP.sh if dependency Inline::CPP is available in your system
# the script will use 20 CPU cores, please adapt it to your system
./cds.sh 

# clean and compress (we want only in the CDS files)
rm -f _* *noORF* *transcript* # might be useful, but not used
gzip *diamond*

# put cds sequences aside
mv *cds.f*gz ../cds && cd ..

Alternatively, it is possible to get precomputed CDS sequences (170MB) ready to go:


cd ${GETHOMS}/test_barley
wget http://floresta.eead.csic.es/plant-pan-genomes/barley_transcripts/cds.tgz .
tar xvfz cds.tgz

# Make sure files with lists of accessions are in place
ls cds/*list

Either way, make sure files with lists of accessions are in place; these will be used later on to compare input subsets:

ls cds/*list

#cds/leaf.list  
#cds/ref.list  
#...

3.3.2 Clustering sequences

In this step we will use the main script get_homologues-est.pl. The only required option is -d, which indicates an input folder or directory. It is important to remark that in principle only files with extensions .fna / .fa / .fasta and optionally .faa are considered when parsing the -d directory. The use of an input folder allows for new files to be added there in the future, reducing the computing required for updated analyses. For instance, if a user does a first analysis with 5 input genomes today, it is possible to check how the resulting clusters would change when adding 10 new genomes tomorrow, by copying the new .fna input files to the pre-existing -d folder, so that all previous BLASTN searches are re-used.

All remaining flags are options that can modify the default behavior of the program, which is to use the BDBH algorithm in order to compile core clusters of DNA sequences based on BLASTN megablast searches requiring 95% sequence identity (-S 95) and 75% alignment coverage (-C 75). The available clustering algorithms are:

  • BDBH (default) ; starting from a reference genome, keep adding genomes stepwise while storing the sequence clusters that result from merging the latest bidirectional best hits.
  • OMCL (-M) ; uses the Markov Cluster Algorithm to group sequences, with inflation (-F) controlling cluster granularity, as described in (33).

As a rule of thumb, use OMCL (-M) for pangenome analyses, as it can handle sequences missing from the reference, and the BDBH algorithm for rapid calculations of core gene/transcript sets.

Moreover, by default redundant isoforms are filtered out (-i 40). However, when working with curated CDS sequences, with one selected sequence per gene, this is not necessary. You can disable this by adding -i 0 to your command line.

By default the input set with least sequences is taken as a reference. However, if you have a previously annotated input set you might want to choose that as a reference, using option -r.

Option -t allows you to control the minimum occupancy of the output clusters. By default only core clusters are produced, while with -t 0 clusters of any size, including singletons and cloud clusters, are output. The latter are named ‘control’ clusters as they will be used to compute the background frequencies of Pfam domains in later sections.

Option -A, used in the last example below, requests an ANI matrix. In this case it is combined with flag -e to exclude clusters with inparalogues and therefore minimize biases when averaging observed identities. ANI matrices are useful to robustly compute the similarity among your input genotypes and to cluster them, so that redundancy can easily be managed. If you have outgroups among the input sequences you can also check them in the ANI matrix.


# precompute (-o) sequence similarities and protein domain frequencies (-D);
# clusters are not produced
# Creates cds_est_homologues/tmp/all.pfam and cds_est_homologues/tmp/all.bpo
$GETHOMS/get_homologues-est.pl -d cds -D -m cluster -o &> log.cds.pfam

# alternatively, if not running in a SGE cluster, taking for instance 4 CPUs 
# $GETHOMS/get_homologues-est.pl -d cds -D -n 4 -o &> log.cds.pfam

# calculate 'control' cds clusters (-t 0) using OMCL (-M)
# cluster_list = cds_est_homologues/Alexis_0taxa_algOMCL_e0_.cluster_list
# cluster_directory = cds_est_homologues/Alexis_0taxa_algOMCL_e0_
$GETHOMS/get_homologues-est.pl -d cds -M -t 0 -m cluster &> log.cds

# get non-cloud clusters (-t 3) and percentage of conserved sequence clusters (POCS) matrix (-P)
# cluster_list = cds_est_homologues/Alexis_3taxa_algOMCL_e0_.cluster_list
# cluster_directory = cds_est_homologues/Alexis_3taxa_algOMCL_e0_
# percent_conserved_sequences_file = cds_est_homologues/Alexis_3taxa_algOMCL_e0_POCS.tab
$GETHOMS/get_homologues-est.pl -d cds -M -t 3 -P -m cluster &> log.cds.t3

# single-copy (-e) clusters with high occupancy (-t 10) & Average Nucleotide Identity (-A)
# [Note that flag -e filters out clusters with inparalogues]
# Only high-confidence clusters, containing sequences from at least 10 barleys
# This produces cds_est_homologues/Alexis_10taxa_algOMCL_e1_Avg_identity.tab
# as a summary of 2254 clusters
$GETHOMS/get_homologues-est.pl -d cds -M -t 10 -m cluster -A -e &> log.cds.t10.e

# single-copy (-e) core clusters for phylogenomic analyses (see section 3.3.6)
# number_of_clusters = 883
# cluster_directory = cds_est_homologues/Alexis_alltaxa_algOMCL_e1_
$GETHOMS/get_homologues-est.pl -d cds -M -m cluster -e &> log.cds.core.e

# Make heatmap and dendrograms based on ANI

# This requires some R dependencies, see install instructions with 
$GETHOMS/plot_matrix_heatmap.sh -M

# You may also want to edit/shorten the labels in ANI tab file
# This produces several outfiles:
# 1) Alexis_10taxa_algOMCL_e1_Avg_identity_heatmap.pdf (Figure 4)
# 2) Alexis_10taxa_algOMCL_e1_Avg_identity_BioNJ.ph (Neighbor Joining Newick file)
# 3) ANDg_meand_silhouette_width_statistic_plot.pdf (silhouette width statistics to compute K)
# 4) ANDg_hc_plot_cut_at_mean_silhouette_width_k4.pdf (K optimal silhouette clusters)

# note that two decimals are used (-d 2)
$GETHOMS/plot_matrix_heatmap.sh -i cds_est_homologues/Alexis_10taxa_algOMCL_e1_Avg_identity.tab \
  -H 10 -W 15 -t "ANI of single-copy transcripts (occupancy > 9)" -N -o pdf -d 2

It’s a good idea to inspect the log files to check the version of the software and to make sure that no errors occurred. The logs contain useful information such as the output mask:

# mask=Alexis_0taxa_algOMCL_e0_ (_0taxa_algOMCL)

This mask is a prefix added to the output folder. In this case it tells that the reference genomes is ‘Alexis’, that no occupancy cutoff was set (0taxa), the clustering algorithm used was OMCL and that clusters containing inparalogues were included (e0, see Note 10).

Note that internally sequences are numbered with natural numbers, called sequence ids. These numbers start counting from the reference genome, in this case Alexis. This is important because the resulting clusters are also numbered accordingly.

The resulting heatmap summarizing the ANI matrix in the example is shown in Figure 4. In our experience, ANI matrices computed from core genes can recapitulate the expected phylogeny of your input set (11).