Synolog

Synolog Manual

Gio Madrigal1, and Julian Catchen1

1Department of Evolution, Ecology, and Behavior
University of Illinois at Urbana-Champaign
Urbana, Illinois, 61820
USA

Introduction [⇑top]


Synolog is a set of software designed to identify orthologs across speices using sequence similarity and blocks of conserved synteny. The core program synolog is implemented in C++ and parallelized using OpenMP. We also provide a suite of Python scripts to manage the data input and exports, as well as for visualization of syntenic regions.


Installation [⇑top]


Prerequisites

Synolog should build on any standard UNIX-like environment (Apple OS X, Linux, etc.) The supporting scripts are written in Python v3.

The only dependencies required for the softare are Python modules included in the standard library, as well as the Python Cairo library (PyCairo) if you want to generate visualizations.

Synolog uses the outputs of BLAST to determine sequence similarity between genes/genome objects so in most cases you will want to make sure BLAST is installed and available on the executable path (or you can specify a local path directly to Synolog when executing).

Build the software

Synolog uses the standard autotools install:

% tar xfvz synolog-1.xx.tar.gz % cd synolog-1.xx % ./configure % make (become root) # make install (or, use sudo) % sudo make install

You can change the root of the install location (/usr/local/ on most operating systems) by specifying the --prefix command line option to the configure script.

% ./configure --prefix=/home/smith/local

Or, if you want to group the install in a single location, e.g.,

% ./configure --prefix=/projects/smithlab/local/opt/synolog-v1.0

You can speed up the build if you have more than one processor:

% make -j 8

A default install will install files in the following way:

/usr/local/bin Synolog executable and Python scripts.

Install Python Modules

Install the PyCairo Python module for visualizaing synteny blocks.

The pipeline is now ready to run.


Running the pipeline [⇑top]


What input data do I need?

Synolog requires the following data for each genome you wish to compare:

  1. The sequence of the protein-coding genes in FASTA format, optionally gzipped. Synolog is designed to work with multiple transcripts/splice variants per gene.
  2. The location of each protein-coding gene, in either GTF or GFF format, optionally gzipped.
  3. A set of BLAST hits for each pairwise set of genomes. Given organisms A and B this requires one BLAST output with A as the query and B as the hits, and one BLAST output with B as the query and A as the hits (sets of reciprocal hits). These output should be in TSV format, which is generated with the BLAST option, -outfmt 6. If you are using the species cache, and using the synolog_blastctl.py program to execute BLAST on your behalf, these files will be generated automatically.
Synolog benefits from having the following, optional data for each genome you wish to compare:
  1. The sequence of the non-coding genes in FASTA format, optionally gzipped. Synolog is designed to work with multiple transcripts/splice variants per gene.
  2. An AGP file describing the structure of the genome, optionally gzipped. This file is used in visualizations to show where contig/scaffold boundaries occur relative to synteny blocks as well as to give accurate lengths of each chromosome.
  3. A set of non-coding BLAST hits for each pairwise set of genomes. Given organisms A and B this requires one BLAST output with A as the query and B as the hits, and one BLAST output with B as the query and A as the hits. These output should be in TSV format, which is generated with the BLAST option, -outfmt 6. If you are using the species cache, and using the synolog_blastctl.py program to execute BLAST on your behalf, these files will be generated automatically.
If you would like to use a tree to guide Synolog and produce sets of tree-spanning orthologs and/or enumerate segmental duplications:
  1. Include a phylogenetic tree describing your set of species in Newick format. The tree can be larger than your set of species of interest and branch lengths are not required. Give a tree that is a superset of species, Synolog will prune the tree at runtime for use with only the species you want to calculate the synteny/orthologs for.

AGP structure files and GTF or GFF annotation files are generally available from repositories such as Ensembl or NCBI, distributed with a genome assemnbly. This is also true of FASTA files containing gene sequences. Alternatively, if generating your own assembly, these files are output by annotation pipelines such as Braker. If a genome is distributed without an AGP file, a default file can be generated from the FASTA file containing the genome assembly with several available scripts, such as agp2fasta.py distributed with Synolog.

Running the pipeline directly in a simple case

In simple cases, we can directly specify the required and/or optional files to synolog and execute it. This requires us to have produced sets of reciprocal BLAST hits for each pair of genomes we want to examine.

For this example, we will consider three species, the gulf pipefish (Sygnathus scovelli), the threespine stickleback (Gasterosteus aculeatus), and the platyfish (Xiphophorus maculatus). In this hypothetical example, we have assembled and annotated the pipefish ourselves, which gave us a GFF annotation, and downloaded the other two species from Ensembl here and here which distributes GTF files.

To reference each genome in the analysis, we are going to assign it an organism ID that we create. We will use ssc for pipefish, gac for stickleback, and xma for platyfish.

Prior to running synolog we created BLAST databases from the FASTA files containing the set of genes from each species and then ran blastp pairwise (with the -outfmt 6 option to give us TSV files):

  • ssc as the query, gac as the database
  • gac as the query, ssc as the database
  • ssc as the query, xma as the database
  • xma as the query, ssc as the database
  • xma as the query, gac as the database
  • gac as the query, xma as the database
giving us six output files from BLAST.

Now we are ready to run synolog. We use options to tell synolog what format files to expect and those options are persistent along hte commond line until a different option occurs. So, gzgff, gff, gzgtf, gtf tell the software what format the succeeding annotation file is in, while gzblast or blast tell the software whether the succeeding BLAST hit files are compressed or not. Here is the final command (the \ characters just continue the command onto the next line):

synolog -o ./synolog/ \ --ann_type gzgff \ --org ssc --annotation ./gff/ssc_v1.gff.gz \ --ann_type gzgtf \ --org gac --annotation ./gtf/Gasterosteus_aculeatus.BROADS1.81.gtf.gz \ --org xma --annotation ./gtf/Xiphophorus_maculatus.Xipmac4.4.2.81.gtf.gz \ --hit_type gzblast \ --hits ./blast/gac_ssc.tsv.gz \ --hits ./blast/gac_xma.tsv.gz \ --hits ./blast/ssc_gac.tsv.gz \ --hits ./blast/ssc_xma.tsv.gz \ --hits ./blast/xma_gac.tsv.gz \ --hits ./blast/xma_ssc.tsv.gz

It is easy to see that as more organsims are added to an analysis, keeping track and specifying the correct files to synolog becomes more difficult. That is why we recomment using the species cache, as this system, and the programs designed to interact with it automate and/or order many of these steps.

Synolog output files

MetaData

org_list.tsv

This is a single column output file that lists all the organisms used in a Synolog analysis.

ColumnNameDescription
1 Org Org identifier in the analysis
hitmap.tsv

This file is generated at the start of a Synolog analysis that notes all the BLAST hit files being used.

ColumnNameDescription
1 Hit File Name of cached blast hits file between the Query and Subject
2 Query Org identifier that was set as the query in the BLAST step
3 Subject Org identifier that was set as the subject in the BLAST step
4 File Type A short tag denoting whether the hit file is compressed or uncompressed
org_annotation.tsv

For each organism in a Synolog analysis, information regarding the location of all the annotated genetic elements in a Synolog analysis.

ColumnNameDescription
1 Org Org identifier
2 ID Genetic Element identifier
3 Name Genetic Element Name for ID
4 Chrom Chromosome identifier containing ID
5 StartBp Starting base pair for ID
6 EndBp Ending base pair for ID
7 Orientation Chromosomal orientation (+ or -) for ID
org_scaffold_bounds.tsv

For each organism in a Synolog analysis, the information denoting the structure of the chromosomal sequences is written to a list.

ColumnNameDescription
1 Chromosome Chromosome identifier
2 Scaffold Scaffold identifier
3 StartBp Start base pair for Scaffold
4 EndBp End base pair for Scaffold
chromosomes.tsv

This file contains a list of all the chromosomal sequences and their corresponding lengths in a Synolog analysis.

1 Org Org identifier
2 Chromosome Chromosome identifier
3 Length Chromosome length in base pairs
orgA-orgB_homologs.tsv

For each pairwise comparison, Synolog will write a homologs file containing the homolog pairings found between two organisms.

ColumnNameDescription
1 GroupID Homolog group identifier
2 Type Type of genetic element (gene or ncGene)
3 OrgA First org identifier in the pairwise comparison
4 OrgB Second org identifier in the pairwise comparison
5 IdA Genetic element identifier for OrgA
6 NameA Genetic element name for IdA
7 Chr Chromosome identifier containing the genetic element IdA
8 StartBP Starting base pair for IdA
9 EndBp Ending base pair for IdA
10 IdB Genetic element identifier for OrgB
11 NameB Genetic element name for IdB
12 Chr Chromosome identifier containing the genetic element IdA
13 StartBP Starting base pair for IdB
14 EndBp Ending base pair for IdB
15 SyntenyCluster If not None, synteny cluster identifier if IdA and IdB are syntenic
orthologs.tsv

After all ortholog relationships have been determined, Synolog merges all the orthologs into orthogroups. If a phylogenetic tree was provided, the CopyNumState and ParalogCount columns are included in this file.

CopyNumState:

  • B: Basal – if the organism is the most basal member in the orthogroup or if the organism was found to have the same number of tandem duplicates as the most basal member in the orthogroup
  • E: Expansion – if the organism was found to have a greater number of tandem duplicates as the most basal member in the orthogroup
  • C: Contractions – if the organism was found to have a lower number of tandem duplicates as the most basal member in the orthogroup
  • T: Tandem – the specific element in the current record is labeled as a tandem duplicate
  • R: Retrocopy – the specific element in the current record is labeled as a retrocopy

ColumnNameDescription
1 GroupID Orthogroup identifier
2 Type Type of genetic elements in the orthogroup (gene or ncGene)
3 ElementCount Total number of genetic elements in the orthogroup
4 OrgCount Total number of orgs in the orthogroup
5 Org Org identifier
6 ElementId Genetic element identifier for Org
7 Name Genetic element name for ElementId
8 CopyNumState If not marked as a duplicate, B if the element has a local gene count (gene + tandem duplicate count) as the most basal org in the orthogroup, E if the element has a greater gene count than the most basal org in the orthogroup, C if the element has a lower gene count than the most basal org. If marked as a duplicate, T for tandem duplicate or R for retrogene duplicate.
9 ParalogCount If this element is not a duplicate (tandem or retro), the number of additional paralogs for Org in this orthogroup
10 Chr Chromosome identifier containing the genetic element ElementId
11 StartBp Starting base pair for IdB
12 EndBp Ending base pair for IdB
orthogroups.tsv

This file contains one record per orthogroup with a short description for each orthogroup. If a phylogenetic tree is provided, the number of duplicated elements (tandem + retro) is reported. Retrogene counts do not influence whether an orthogroup is considered SingleCopy.

ColumnNameDescription
1 GroupID Orthogroup identifier
2 Type Type of genetic elements in the orthogroup (gene or ncGene)
3 OrgCount Total number of orgs in the orthogroup
4 ElementCount Total number of genetic elements in the orthogroup
5 DuplicateCount Total number of gene duplicates in the orthogroup
6 SingleCopy True if the local element count is 1 and if all orgs in the analysis are in the orthogroup
7 Orgs Comma separated listed of all orgs in the orthogroup
orthogroups.gene.counts.tsv

This file is a matrix of gene counts for each of the orthogroups, that is compatible with the software Café (Fábio et al. 2020).

ColumnNameDescription
1 Desc Set as (null) as a placeholder for orthogroup description
2 Orthogroup.ID Orthogroup identifier
3 Org Number of gene copies in the orthogroup for Org
putative_orthologous_chromosomes.tsv

This file contains the pairwise predictions of orthologous chromosomes in a Synolog analysis.

ColumnNameDescription
1 QueryOrg First org identifier in the pairwise comparison
2 QueryChr Chromosome identifier for QueryOrg
3 SubjOrg Second org identifier in the pairwise comparison
4 SubjChr Chromosome identifier for SubjOrg
5 OrthGeneCnt Number of syntenic ortholog pairings shared between the two chromosomes
6 ClusterCnt Number of synteny clusters shared between the two chromosomes
org_ortholog_stats.tsv

This file contains organism specific ortholog statistics across all the other organisms in a Synolog analysis. If a phylogenetic tree is provided, columns containing tree cluster information, segmental duplication information, one to many ortholog relationships, and local gene expansions/contractions are reported.

Tree spanning clusters/elements denote elements that span across all the organisms in the analysis

Tree partial clusters/elements denote elements that span at least 3 but not all the organisms in the analysis.

ColumnNameDescription
1 Org Org identifier
2 Elements Total number of genetic elements in the analysis
3 Orthologs Total number of Elements found orthologous with Org
4 %Orthologs Percent of Elements found orthologous
5 OrthoGroups Number of shared orthogroups with Org
6 SyntenicElements Total number of Elements found syntenic with Org
7 %SyntenicElements Percent of Elements found syntenic
8 One-to-One Total number of Elements found in to be one-to-one with Org
9 %One-to-One Percent of Elements found to be one-to-one with Org
10 One-to-Many Total number of Elements found in to be one-to-many with Org
11 %One-to-Many Percent of Elements found to be one-to-many with Org
12 Many-to-One Total number of Elements found in to be many-to-one with Org
13 %Many-to-One Percent of Elements found to be many-to-one with Org
14 Many-to-Many Total number of Elements found in to be many-to-many with Org
15 %Many-to-Many Percent of Elements found to be many-to-many with Org
16 Tree-Spanning-Clusters Total number of tree spanning clusters shared with Org
17 Tree-Spanning-Elements Total number of Elements found in Tree-Spanning-Clusters
18 %Tree-Spanning-Elements Percent of Elements found in Tree-Spanning-Clusters
19 ree-Partial-Clusters Total number of tree partial clusters shared with Org
20 Tree-Partial-Elements Total number of Elements found in Tree-Partial-Clusters
21 %Tree-Partial-Elements Percent of Elements found in Tree-Parital-Clusters
22 Segmental Total number of segmental duplication clusters identified using Org
23 SegmentalElements Total number of segmental duplicated genetic elements identified using Org
24 %SegmentalElements Percent of Elements identified as segmental duplications using Org
25 #OrthoGroupCopyNumber List of Element counts found across orthogroups
26 NumOfOrthogroups List of the number of orthogroups containing a specific Element count
27 #Expansions Total number of local element expansions
28 %Expansions Percent of Elements found to be locally expanded
29 Contractions Total number of local element contractions
30 %Contractions Percent of Elements found to be locally contracted
overall_orthology_stats.tsv

This file contains the overall orthologs statistics in a Synolog analysis. If a phylogenetic tree is provided, information regarding tandem duplications, retrogenes, tree clusters, and local gene expansions/contractions are reported.

ColumnNameDescription
1 Org Org identifier
2 Elements Total number of genetic elements for Org in the Synolog analysis
3 Orthologs Total number of Elements found to be orthologous across the Synolog analysis
4 %Orthologs Percent of Elements found to be orthologous across the Synolog analysis
5 Orthologs+Duplicates Total number of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis
6 %Orthologs+Duplicates Percent of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis
7 RetroCopies Total number of Elements predicted to be a retrocopy
8 %RetroCopies Percent of Elements predicted to be a retrocopy
9 OrthoGroups Total number of orthogroups
10 One-to-One-Orthogroups Total number of one-to-one orthogroups
11 SyntenicElements Total number of Elements found to be syntenic
12 %SyntenicElements Percent of Elements found to be syntenic
13 undetermined Total number of undetermined Elements
14 %Undetermined Percent of undetermined Elements
16 Elements-With-Tandem-Duplicates Total number of Elements with a tandem duplicate
17 %Elements-With-Tandem-Duplicates Percent of Elements with a tandem duplicate
18 Tandem-Duplicates Total number of Elements found to be a tandem duplicate
19 %Tandem-Duplicates Percent of Elements found to be a tandem duplicate
20 Local-Expansions Total number of Elements found to have a local expansion
21 %Local-Expansions Percent of Elements found to have a local expansion
22 Local-Constractions Total number of Elements found to have a local contraction
23 %Local-Contractions Percent of Elements found to have a local contraction
24 Elements-With-Retrocopies Total number of Elements found to have a retrocopy
25 %Elements-With-Retropcopies Percent of Elements found to have a retrocopy
26 Tree-Spanning-Clusters Total number of tree spanning clusters
27 Tree-Spanning-Elements Total number of Elements found in tree spanning clusters
28 %Tree-Spanning-Elements Percent of Elements found in tree spanning clusters
29 Tree-Partial-Clusters Total number of tree partial clusters
30 Tree-Parital-Elements Total number of Elements found in tree partial clusters
31 %Tree-Parital-Elements Percent of Elements found in tree partial clusters
orgA-orgB_consyn_cluster_membership.tsv

For each pairwise comparison, Synolog will report the syntenic orthologs alongside their locations.

ColumnNameDescription
1 ClusterID
2 Type Type of genetic elements in the pairing (gene or ncGene)
3 OrgA First org identifier in the pairwise comparison
4 OrgB Second org identifier in the pairwise comparison
5 IdA Genetic element identifier for OrgA
6 NameA Genetic element name for IdA
7 Chr Chromosome identifier containing the genetic element IdA
8 Index 1-based index along Chr for IdA, sorted by chromosomal position
9 StartBP Starting base pair for IdA
10 EndBp Ending base pair for IdA
11 IdB Genetic element identifier for OrgB
12 NameB Genetic element name for IdB
13 Chr Chromosome identifier containing the genetic element IdA
14 Index 1-based index along Chr for IdB, sorted by chromosomal position
15 StartBP Starting base pair for IdB
16 EndBp Ending base pair for IdB
orgA-orgB_consyn_clusters.tsv

For each pairwise synteny cluster, Synolog will report a brief description of the cluster including its location relative to both organisms and its size.

ColumnNameDescription
1 ID Synteny cluster identifier
2 ElementCnt Number of syntenic ortholog pairs in the cluster
3 SlidingWinSize Size of the sliding window used in the analysis
4 Orientation Whether the cluster was found in the forward or inverse orientation relative to both organisms
5 OrgA First org identifier in the pairwise comparison
6 Chr Chromosome identifier for OrgA that contains the synteny cluster
7 StartIndex 1-based index along Chr for the 5’ most element in the synteny cluster for OrgA, sorted by chromosomal position
8 EndIndex 1-based index along Chr for the 3’ most element in the synteny cluster for OrgA, sorted by chromosomal position
9 StartBp Start base pair for the synteny cluster on Chr in OrgA
10 EndBp End base pair for the synteny cluster on Chr in OrgA
11 OrgB Second org identifier in the pairwise comparison
12 Chr Chromosome identifier for OrgB that contains the synteny cluster
13 StartIndex 1-based index along Chr for the 5’ most element in the synteny cluster for OrgB, sorted by chromosomal position
14 EndIndex 1-based index along Chr for the 3’ most element in the synteny cluster for OrgB, sorted by chromosomal position
15 StartBp Start base pair for the synteny cluster on Chr in OrgB
16 EndBp End base pair for the synteny cluster on Chr in OrgB
duplicate_regions.tsv

If a phylogenetic tree is provided, Synolog will generate this file that contains the coordinates of loci that are predicted to harbor tandem duplicates.

ColumnNameDescription
1 GroupID Anchor group identifier
2 Org Org identifier
3 Chr Chromosome identifier containing the tandem duplicates
4 Start Start base pair of the 5’ most gene copy
5 End End base pair of the 3’ most gene copy
6 TandemCount Number of tandem duplicates in the region
orgA-orgB_segmentalduplications.tsv

If a phylogenetic tree is provided, Synolog will report the segmental duplications detected for each pairwise combination of organisms.

ColumnNameDescription
1 ID Segmental duplication identifier
2 ElementCount Total number of segmentally duplicated genetic elements
3 OrgA Org identifier for the duplicated genetic elements
4 DupID Genetic element identifier for the duplicated gene copy
5 DupName Genetic element name for DupID
6 DupChr Chromosome identifier containing DupID
7 DupIndex 1-based index along DupChr, sorted by chromosomal position
8 StartBP Start base pair for DupID
9 EndBP End base pair for DupID
10 ElementID Genetic element identifier for the syntenic gene copy
11 ElementName Genetic element name for ElementID
12 Chr Chromosome identifier containing ElementID
13 Index 1-based index along Chr, sorted by chromosomal position
14 StartBP Start base pair for ElementID
15 EndBP End base pair for ElementID
16 OrgB Org identifier used to identify the DupID
17 ElementID Genetic element identifier for the syntenic ortholog used to identify DupID
18 ElementName Genetic element name for ElementID
19 Chr Chromosome identifier containing ElementID
20 Index 1-based index along Chr, sorted by chromosomal position
21 StartBP Start base pair for ElementID
22 EndBP End base pair for ElementID
org_segmentalduplication_clusters.tsv

If a phylogenetic tree is provided, Synolog will generate a file containg clusters of genetic elements that were flagged as a segmental duplication in at least one of the pairwise segmental duplication.

ColumnNameDescription
1 ClusterID Segmental duplication cluster identifier
2 ElementCount Total number of genetic elements in the segmental duplication cluster
3 Chrom Chromosome identifier containing the segmentally duplicated genetic elements
4 ElementID Genetic element identifier
5 Name Genetic element name for ElementID
6 Index 1-based index along Chrom, sorted by chromosomal position
7 StartBP Start base pair for ElementID
8 EndBP End base pair for ElementID
9 Source Comma separated list of all element IDs, org IDs, and
tree_clusters_membership.tsv

If a phylogenetic tree is provided, Synolog will generate this file detailing the syntenic genetic elements found to span three or more organisms.

ColumnNameDescription
1 Org Number of orgs in the Synolog analysis
2 TreeClusterID Tree cluster identifier
3 NumElements Total number of genetic elements in the tree cluster
4 OrgCount Total number of orgs in the tree cluster
5 Org Org identifier
6 Chrom Chromosome identifier containing the genetic element
7 ElementType Type of genetic element (gene or ncGene)
8 ElementID Genetic element identifier for org
9 ElementName Genetic element name for ElementID
10 Index 1-based index along Chrom, sorted by chromosomal position
11 StartBP Starting base pair for ElementID
12 EndBP Ending base pair for ElementID
13 OrthologIDs A comma separated list of identifiers for syntenic orthologs from the other orgs and the corresponding genetic identifiers

Working with the species cache

The species cache is just a directory containing an ordered set of subdirectories and files that is meant to hold annotation and structure information for each genome in an analysis, along with a set of gene models and related BLAST datbases, and BLAST output files needed to complete a synteny analysis. One can interact with the files in the species cache using standard commands on a UNIX terminal, but it is designed to be managed by the synolog_cctrl.py, synolog_blastctl.py, and synolog_linkint.py programs.

One can create as many independent species caches as desired; when running the various synolog programs you will always specify the location of the particular cache for that analysis.

A species cache is structured in the following way. There two major types of data in the cache: first, an annotation, which includes the location of genes and similar objects (encoded in a GTF or GFF file), and the structural definition of how contigs and scaffolds are joined together into chromosomes (encoded in an AGP file). The second major type of data is a set of gene models (the protein or nucleotide sequence of each transcript encoded in a FASTA file). For each organism, the cache allows for their to be multiple annotations and gene models, since the set of gene models may not change when the underlying assembly changes (say scaffolds are moved around in one version of the assembly versus another; or, as a new assembly is curated, it is improved in stages without re-annotating the genes). This allows one to update a genome without having to re-execute all the underlying BLAST analyses.

So, in the simple case, each organism has one annotation and one set of gene models. But in a more complex case there can be more than one of each of these. Each of these data types are assigned an ID for use in the cache.

Once these data are added to the cache, they are integrated in the cache which tells Synolog to join one particular annotation with one set of gene models and the associated BLAST hits for those genes. Then, to run the Synolog pipeline, the user will specify the IDs of the genomes they want to include in the analysis and the synolog_run.py program will form the proper command, similar to the one shown above, in order to execute the pipeline.

The synolog_cctrl.py program handles creating a cache, adding various data to a cache, and integrating different sets of data.

Adding data to the cache
Creating the cache

We first create a new cache. The cache consists of a set of defined, but ordinary directories and we can have as many distinct caches, located whereever we like, within our disk space.

synolog_cctl.py new -p ./path/to/species_cache

Add a species to the cache

After creating a new cache, we can add one or more species to that cache by choosing a species ID and specifying the species name and common name for that species. Whenever we refer to the species, with respect to Synolog, we will use the species ID we define at this step. (The \ characters just continue the command onto the next line and are not part of the actual command.)

synolog_cctl.py species -p ./path/to/species_cache --id gacu \ --name "Gasterosteus aculeatus" --common "Threespine stickleback"

Add an annotation to the cache

The annotation file should be in GTF or GFF format and should describe the location of protein-coding (and optionally non-coding) genes in the genome. An annotation can also include an AGP file, which describes how contigs and/or scaffolds are put together into chromosomes. The files can optionally be compressed (gzipped). The user must also provide the species ID to which these annotation files belong. Finally, the user can specify an annotation id as well. If one is not provided, then they will be given the annotation ID, "def" for default. Typically, if this is the only genome from this species the user plans to add, then using the default ID is fine, but if the user wants to add two or more genomes from this species, then each annotation should have its own, meaningful ID. (The \ characters just continue the command onto the next line and are not part of the actual command.)

Here is the simplest form of the command:

synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf

And here is a more complex form:

synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --ann-id gannv1 \ --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf.gz \ --agp ./data/g.aculeatus/reference/g.aculeatus.agp.gz

Add a set of genes to the cache

A set of gene models must be supplied for each species in FASTA format (and optionally gzipped) -- these will be used to execute BLAST. The user can choose to supply protein-coding gene models, supplied as amino acid sequences (flag --aa); this is the most common way to run BLAST between species and will cause BLASTp to be used. The user can alternatively supply gene models in nucleotides (flag --nuc) which will result in BLASTn being used. Finally, the user can also supply a set of non-coding genes (in nucleotide format) using the flag --non-coding option. Non-coding genes will be stored separately and BLASTn will be executed with a slightly modified set of matching paramters (to optimize for the difference in genetic distance between non-coding genes).

When this command is executed, the FASTA file will be stored in the cache with a standardized name and compressed, and a BLAST database will be generated from the FASTA file.

Here is the simplest form of the command supplying a default ID and amino acid genic sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --aa ./data/g.aculeatus/reference/g.aculeatus.aa

Here is a form of the command supplying a specific ID and amino acid genic sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacaav1 \ --aa ./data/g.aculeatus/reference/g.aculeatus.aa

Finally, here is a form of the command supplying a specific ID and non-coding gene sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacncv1 \ --non-coding ./data/g.aculeatus/reference/g.aculeatus.noncoding.fa

Integrating an annotation with a set of genes

The final step is to integrate or link together a set of gene models and an underlying genome annotation. In the default case, there is only one of each, so the integration step does not add much. However, in a more complex case, there may be several versions of a genome assembly, with one or more set of gene models, depending on how/when the underlying assembly was annotated.

Here is the simplest case to integrate a default annotation and set of gene amino acid gene models:

synolog_cctl.py integration -p ./path/to/species_cache --id gacu

And here is a case specifying particular IDs for each annotation and set of gene models. Here we also can supply a description so the user can keep track of different versions of data in the cache.

synolog_cctl.py integration -p ./path/to/species_cache --id gacu --ann-id gannv1 --genes-id gacaav1 \ --desc "This is the default Ensembl BROAD stickleback assembly"

Running BLAST
Executing BLAST runs

Linking BLAST results

Run the pipeline using the species cache

Visualizing the results

Synolog is able to readily generate figures using the synolog_plot.py utility. There are five types of figures that are supported.

  • genome – visualize synteny at the genome-wide level
  • region – visualize synteny at a specific region/locus
  • synteny – visualize a synteny block inferred by Synolog
  • segmental – visualize a pairwise segmental duplication inferred by Synolog
  • tree_cluster – visualize a tree cluster inferred by Synolog

Exporting data from Synolog

Outside of the output files produced in a Synolog analysis, Synolog can export orthogroups in fasta format for downstream analysis. The utility synolog_fasta.py, makes use of the species cache along with the output files generated.

# all sequences for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all # longest sequence for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all \ --longest # longest sequence for all single copy orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --single_copy \ --longest # longest sequence for a specific set of orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --longest \ --orthogroups orthogroup_ids.txt # single column list of orthogroup IDs


Pipeline Components[⇑top]


Core

Species Cache

Execution control

Utility programs