Synolog: Synolog Manual

Introduction [⇑top]

Synolog is a set of software designed to identify orthologs across speices using sequence similarity and blocks of conserved synteny. The core program synolog is implemented in C++ and parallelized using OpenMP. We also provide a suite of Python scripts to manage the data input and exports, as well as for visualization of syntenic regions.

Installation [⇑top]

Prerequisites

Synolog should build on any standard UNIX-like environment (Apple OS X, Linux, etc.) The supporting scripts are written in Python v3.

The only dependencies required for the softare are Python modules included in the standard library, as well as the Python Cairo library (PyCairo) if you want to generate visualizations.

Synolog uses the outputs of BLAST to determine sequence similarity between genes/genome objects so in most cases you will want to make sure BLAST is installed and available on the executable path (or you can specify a local path directly to Synolog when executing).

Build the software

Synolog uses the standard autotools install:

% tar xfvz synolog-1.xx.tar.gz % cd synolog-1.xx % ./configure % make (become root) # make install (or, use sudo) % sudo make install

You can change the root of the install location (/usr/local/ on most operating systems) by specifying the --prefix command line option to the configure script.

% ./configure --prefix=/home/smith/local

Or, if you want to group the install in a single location, e.g.,

% ./configure --prefix=/projects/smithlab/local/opt/synolog-v1.0

You can speed up the build if you have more than one processor:

% make -j 8

A default install will install files in the following way:

/usr/local/bin

Synolog executable and Python scripts.

Install Python Modules

Install the PyCairo Python module for visualizaing synteny blocks.

The pipeline is now ready to run.

Running the pipeline [⇑top]

What input data do I need?

Synolog requires the following data for each genome you wish to compare:

The sequence of the protein-coding genes in FASTA format, optionally gzipped. Synolog is designed to work with multiple transcripts/splice variants per gene.
The location of each protein-coding gene, in either GTF or GFF format, optionally gzipped.
A set of BLAST hits for each pairwise set of genomes. Given organisms A and B this requires one BLAST output with A as the query and B as the hits, and one BLAST output with B as the query and A as the hits (sets of reciprocal hits). These output should be in TSV format, which is generated with the BLAST option, -outfmt 6. If you are using the species cache, and using the synolog_blastctl.py program to execute BLAST on your behalf, these files will be generated automatically.

Synolog benefits from having the following, optional data for each genome you wish to compare:

The sequence of the non-coding genes in FASTA format, optionally gzipped. Synolog is designed to work with multiple transcripts/splice variants per gene.
An AGP file describing the structure of the genome, optionally gzipped. This file is used in visualizations to show where contig/scaffold boundaries occur relative to synteny blocks as well as to give accurate lengths of each chromosome.
A set of non-coding BLAST hits for each pairwise set of genomes. Given organisms A and B this requires one BLAST output with A as the query and B as the hits, and one BLAST output with B as the query and A as the hits. These output should be in TSV format, which is generated with the BLAST option, -outfmt 6. If you are using the species cache, and using the synolog_blastctl.py program to execute BLAST on your behalf, these files will be generated automatically.

If you would like to use a tree to guide Synolog and produce sets of tree-spanning orthologs and/or enumerate segmental duplications:

Include a phylogenetic tree describing your set of species in Newick format. The tree can be larger than your set of species of interest and branch lengths are not required. Give a tree that is a superset of species, Synolog will prune the tree at runtime for use with only the species you want to calculate the synteny/orthologs for.

AGP structure files and GTF or GFF annotation files are generally available from repositories such as Ensembl or NCBI, distributed with a genome assemnbly. This is also true of FASTA files containing gene sequences. Alternatively, if generating your own assembly, these files are output by annotation pipelines such as Braker. If a genome is distributed without an AGP file, a default file can be generated from the FASTA file containing the genome assembly with several available scripts, such as agp2fasta.py distributed with Synolog.

Running the pipeline directly in a simple case

In simple cases, we can directly specify the required and/or optional files to synolog and execute it. This requires us to have produced sets of reciprocal BLAST hits for each pair of genomes we want to examine.

For this example, we will consider three species, the gulf pipefish (Sygnathus scovelli), the threespine stickleback (Gasterosteus aculeatus), and the platyfish (Xiphophorus maculatus). In this hypothetical example, we have assembled and annotated the pipefish ourselves, which gave us a GFF annotation, and downloaded the other two species from Ensembl here and here which distributes GTF files.

To reference each genome in the analysis, we are going to assign it an organism ID that we create. We will use ssc for pipefish, gac for stickleback, and xma for platyfish.

Prior to running synolog we created BLAST databases from the FASTA files containing the set of genes from each species and then ran blastp pairwise (with the -outfmt 6 option to give us TSV files):

ssc as the query, gac as the database
gac as the query, ssc as the database
ssc as the query, xma as the database
xma as the query, ssc as the database
xma as the query, gac as the database
gac as the query, xma as the database

giving us six output files from BLAST.

Now we are ready to run synolog. We use options to tell synolog what format files to expect and those options are persistent along hte commond line until a different option occurs. So, gzgff, gff, gzgtf, gtf tell the software what format the succeeding annotation file is in, while gzblast or blast tell the software whether the succeeding BLAST hit files are compressed or not. Here is the final command (the \ characters just continue the command onto the next line):

synolog -o ./synolog/ \ --ann_type gzgff \ --org ssc --annotation ./gff/ssc_v1.gff.gz \ --ann_type gzgtf \ --org gac --annotation ./gtf/Gasterosteus_aculeatus.BROADS1.81.gtf.gz \ --org xma --annotation ./gtf/Xiphophorus_maculatus.Xipmac4.4.2.81.gtf.gz \ --hit_type gzblast \ --hits ./blast/gac_ssc.tsv.gz \ --hits ./blast/gac_xma.tsv.gz \ --hits ./blast/ssc_gac.tsv.gz \ --hits ./blast/ssc_xma.tsv.gz \ --hits ./blast/xma_gac.tsv.gz \ --hits ./blast/xma_ssc.tsv.gz

It is easy to see that as more organsims are added to an analysis, keeping track and specifying the correct files to synolog becomes more difficult. That is why we recomment using the species cache, as this system, and the programs designed to interact with it automate and/or order many of these steps.

Synolog output files

MetaData

org_list.tsv

This is a single column output file that lists all the organisms used in a Synolog analysis.

Column	Name	Description
1	Org	Org identifier in the analysis

hitmap.tsv

This file is generated at the start of a Synolog analysis that notes all the BLAST hit files being used.

Column	Name	Description
1	Hit File	Name of cached blast hits file between the Query and Subject
2	Query	Org identifier that was set as the query in the BLAST step
3	Subject	Org identifier that was set as the subject in the BLAST step
4	File Type	A short tag denoting whether the hit file is compressed or uncompressed

org_annotation.tsv

For each organism in a Synolog analysis, information regarding the location of all the annotated genetic elements in a Synolog analysis.

Column	Name	Description
1	Org	Org identifier
2	ID	Genetic Element identifier
3	Name	Genetic Element Name for ID
4	Chrom	Chromosome identifier containing ID
5	StartBp	Starting base pair for ID
6	EndBp	Ending base pair for ID
7	Orientation	Chromosomal orientation (+ or -) for ID

org_scaffold_bounds.tsv

For each organism in a Synolog analysis, the information denoting the structure of the chromosomal sequences is written to a list.

Column	Name	Description
1	Chromosome	Chromosome identifier
2	Scaffold	Scaffold identifier
3	StartBp	Start base pair for Scaffold
4	EndBp	End base pair for Scaffold

chromosomes.tsv

This file contains a list of all the chromosomal sequences and their corresponding lengths in a Synolog analysis.

1	Org	Org identifier
2	Chromosome	Chromosome identifier
3	Length	Chromosome length in base pairs

orgA-orgB_homologs.tsv

For each pairwise comparison, Synolog will write a homologs file containing the homolog pairings found between two organisms.

Column	Name	Description
1	GroupID	Homolog group identifier
2	Type	Type of genetic element (gene or ncGene)
3	OrgA	First org identifier in the pairwise comparison
4	OrgB	Second org identifier in the pairwise comparison
5	IdA	Genetic element identifier for OrgA
6	NameA	Genetic element name for IdA
7	Chr	Chromosome identifier containing the genetic element IdA
8	StartBP	Starting base pair for IdA
9	EndBp	Ending base pair for IdA
10	IdB	Genetic element identifier for OrgB
11	NameB	Genetic element name for IdB
12	Chr	Chromosome identifier containing the genetic element IdA
13	StartBP	Starting base pair for IdB
14	EndBp	Ending base pair for IdB
15	SyntenyCluster	If not None, synteny cluster identifier if IdA and IdB are syntenic

orthologs.tsv

After all ortholog relationships have been determined, Synolog merges all the orthologs into orthogroups. If a phylogenetic tree was provided, the CopyNumState and ParalogCount columns are included in this file.

CopyNumState:

B: Basal – if the organism is the most basal member in the orthogroup or if the organism was found to have the same number of tandem duplicates as the most basal member in the orthogroup
E: Expansion – if the organism was found to have a greater number of tandem duplicates as the most basal member in the orthogroup
C: Contractions – if the organism was found to have a lower number of tandem duplicates as the most basal member in the orthogroup
T: Tandem – the specific element in the current record is labeled as a tandem duplicate
R: Retrocopy – the specific element in the current record is labeled as a retrocopy

Column	Name	Description
1	GroupID	Orthogroup identifier
2	Type	Type of genetic elements in the orthogroup (gene or ncGene)
3	ElementCount	Total number of genetic elements in the orthogroup
4	OrgCount	Total number of orgs in the orthogroup
5	Org	Org identifier
6	ElementId	Genetic element identifier for Org
7	Name	Genetic element name for ElementId
8	CopyNumState	If not marked as a duplicate, B if the element has a local gene count (gene + tandem duplicate count) as the most basal org in the orthogroup, E if the element has a greater gene count than the most basal org in the orthogroup, C if the element has a lower gene count than the most basal org. If marked as a duplicate, T for tandem duplicate or R for retrogene duplicate.
9	ParalogCount	If this element is not a duplicate (tandem or retro), the number of additional paralogs for Org in this orthogroup
10	Chr	Chromosome identifier containing the genetic element ElementId
11	StartBp	Starting base pair for IdB
12	EndBp	Ending base pair for IdB

orthogroups.tsv

This file contains one record per orthogroup with a short description for each orthogroup. If a phylogenetic tree is provided, the number of duplicated elements (tandem + retro) is reported. Retrogene counts do not influence whether an orthogroup is considered SingleCopy.

Column	Name	Description
1	GroupID	Orthogroup identifier
2	Type	Type of genetic elements in the orthogroup (gene or ncGene)
3	OrgCount	Total number of orgs in the orthogroup
4	ElementCount	Total number of genetic elements in the orthogroup
5	DuplicateCount	Total number of gene duplicates in the orthogroup
6	SingleCopy	True if the local element count is 1 and if all orgs in the analysis are in the orthogroup
7	Orgs	Comma separated listed of all orgs in the orthogroup

orthogroups.gene.counts.tsv

This file is a matrix of gene counts for each of the orthogroups, that is compatible with the software Café (Fábio et al. 2020).

Column	Name	Description
1	Desc	Set as (null) as a placeholder for orthogroup description
2	Orthogroup.ID	Orthogroup identifier
3	Org	Number of gene copies in the orthogroup for Org

putative_orthologous_chromosomes.tsv

This file contains the pairwise predictions of orthologous chromosomes in a Synolog analysis.

Column	Name	Description
1	QueryOrg	First org identifier in the pairwise comparison
2	QueryChr	Chromosome identifier for QueryOrg
3	SubjOrg	Second org identifier in the pairwise comparison
4	SubjChr	Chromosome identifier for SubjOrg
5	OrthGeneCnt	Number of syntenic ortholog pairings shared between the two chromosomes
6	ClusterCnt	Number of synteny clusters shared between the two chromosomes

org_ortholog_stats.tsv

This file contains organism specific ortholog statistics across all the other organisms in a Synolog analysis. If a phylogenetic tree is provided, columns containing tree cluster information, segmental duplication information, one to many ortholog relationships, and local gene expansions/contractions are reported.

Tree spanning clusters/elements denote elements that span across all the organisms in the analysis

Tree partial clusters/elements denote elements that span at least 3 but not all the organisms in the analysis.

Column	Name	Description
1	Org	Org identifier
2	Elements	Total number of genetic elements in the analysis
3	Orthologs	Total number of Elements found orthologous with Org
4	%Orthologs	Percent of Elements found orthologous
5	OrthoGroups	Number of shared orthogroups with Org
6	SyntenicElements	Total number of Elements found syntenic with Org
7	%SyntenicElements	Percent of Elements found syntenic
8	One-to-One	Total number of Elements found in to be one-to-one with Org
9	%One-to-One	Percent of Elements found to be one-to-one with Org
10	One-to-Many	Total number of Elements found in to be one-to-many with Org
11	%One-to-Many	Percent of Elements found to be one-to-many with Org
12	Many-to-One	Total number of Elements found in to be many-to-one with Org
13	%Many-to-One	Percent of Elements found to be many-to-one with Org
14	Many-to-Many	Total number of Elements found in to be many-to-many with Org
15	%Many-to-Many	Percent of Elements found to be many-to-many with Org
16	Tree-Spanning-Clusters	Total number of tree spanning clusters shared with Org
17	Tree-Spanning-Elements	Total number of Elements found in Tree-Spanning-Clusters
18	%Tree-Spanning-Elements	Percent of Elements found in Tree-Spanning-Clusters
19	ree-Partial-Clusters	Total number of tree partial clusters shared with Org
20	Tree-Partial-Elements	Total number of Elements found in Tree-Partial-Clusters
21	%Tree-Partial-Elements	Percent of Elements found in Tree-Parital-Clusters
22	Segmental	Total number of segmental duplication clusters identified using Org
23	SegmentalElements	Total number of segmental duplicated genetic elements identified using Org
24	%SegmentalElements	Percent of Elements identified as segmental duplications using Org
25	#OrthoGroupCopyNumber	List of Element counts found across orthogroups
26	NumOfOrthogroups	List of the number of orthogroups containing a specific Element count
27	#Expansions	Total number of local element expansions
28	%Expansions	Percent of Elements found to be locally expanded
29	Contractions	Total number of local element contractions
30	%Contractions	Percent of Elements found to be locally contracted

overall_orthology_stats.tsv

This file contains the overall orthologs statistics in a Synolog analysis. If a phylogenetic tree is provided, information regarding tandem duplications, retrogenes, tree clusters, and local gene expansions/contractions are reported.

Column	Name	Description
1	Org	Org identifier
2	Elements	Total number of genetic elements for Org in the Synolog analysis
3	Orthologs	Total number of Elements found to be orthologous across the Synolog analysis
4	%Orthologs	Percent of Elements found to be orthologous across the Synolog analysis
5	Orthologs+Duplicates	Total number of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis
6	%Orthologs+Duplicates	Percent of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis
7	RetroCopies	Total number of Elements predicted to be a retrocopy
8	%RetroCopies	Percent of Elements predicted to be a retrocopy
9	OrthoGroups	Total number of orthogroups
10	One-to-One-Orthogroups	Total number of one-to-one orthogroups
11	SyntenicElements	Total number of Elements found to be syntenic
12	%SyntenicElements	Percent of Elements found to be syntenic
13	undetermined	Total number of undetermined Elements
14	%Undetermined	Percent of undetermined Elements
16	Elements-With-Tandem-Duplicates	Total number of Elements with a tandem duplicate
17	%Elements-With-Tandem-Duplicates	Percent of Elements with a tandem duplicate
18	Tandem-Duplicates	Total number of Elements found to be a tandem duplicate
19	%Tandem-Duplicates	Percent of Elements found to be a tandem duplicate
20	Local-Expansions	Total number of Elements found to have a local expansion
21	%Local-Expansions	Percent of Elements found to have a local expansion
22	Local-Constractions	Total number of Elements found to have a local contraction
23	%Local-Contractions	Percent of Elements found to have a local contraction
24	Elements-With-Retrocopies	Total number of Elements found to have a retrocopy
25	%Elements-With-Retropcopies	Percent of Elements found to have a retrocopy
26	Tree-Spanning-Clusters	Total number of tree spanning clusters
27	Tree-Spanning-Elements	Total number of Elements found in tree spanning clusters
28	%Tree-Spanning-Elements	Percent of Elements found in tree spanning clusters
29	Tree-Partial-Clusters	Total number of tree partial clusters
30	Tree-Parital-Elements	Total number of Elements found in tree partial clusters
31	%Tree-Parital-Elements	Percent of Elements found in tree partial clusters

orgA-orgB_consyn_cluster_membership.tsv

For each pairwise comparison, Synolog will report the syntenic orthologs alongside their locations.

Column	Name	Description
1	ClusterID
2	Type	Type of genetic elements in the pairing (gene or ncGene)
3	OrgA	First org identifier in the pairwise comparison
4	OrgB	Second org identifier in the pairwise comparison
5	IdA	Genetic element identifier for OrgA
6	NameA	Genetic element name for IdA
7	Chr	Chromosome identifier containing the genetic element IdA
8	Index	1-based index along Chr for IdA, sorted by chromosomal position
9	StartBP	Starting base pair for IdA
10	EndBp	Ending base pair for IdA
11	IdB	Genetic element identifier for OrgB
12	NameB	Genetic element name for IdB
13	Chr	Chromosome identifier containing the genetic element IdA
14	Index	1-based index along Chr for IdB, sorted by chromosomal position
15	StartBP	Starting base pair for IdB
16	EndBp	Ending base pair for IdB

orgA-orgB_consyn_clusters.tsv

For each pairwise synteny cluster, Synolog will report a brief description of the cluster including its location relative to both organisms and its size.

Column	Name	Description
1	ID	Synteny cluster identifier
2	ElementCnt	Number of syntenic ortholog pairs in the cluster
3	SlidingWinSize	Size of the sliding window used in the analysis
4	Orientation	Whether the cluster was found in the forward or inverse orientation relative to both organisms
5	OrgA	First org identifier in the pairwise comparison
6	Chr	Chromosome identifier for OrgA that contains the synteny cluster
7	StartIndex	1-based index along Chr for the 5’ most element in the synteny cluster for OrgA, sorted by chromosomal position
8	EndIndex	1-based index along Chr for the 3’ most element in the synteny cluster for OrgA, sorted by chromosomal position
9	StartBp	Start base pair for the synteny cluster on Chr in OrgA
10	EndBp	End base pair for the synteny cluster on Chr in OrgA
11	OrgB	Second org identifier in the pairwise comparison
12	Chr	Chromosome identifier for OrgB that contains the synteny cluster
13	StartIndex	1-based index along Chr for the 5’ most element in the synteny cluster for OrgB, sorted by chromosomal position
14	EndIndex	1-based index along Chr for the 3’ most element in the synteny cluster for OrgB, sorted by chromosomal position
15	StartBp	Start base pair for the synteny cluster on Chr in OrgB
16	EndBp	End base pair for the synteny cluster on Chr in OrgB

duplicate_regions.tsv

If a phylogenetic tree is provided, Synolog will generate this file that contains the coordinates of loci that are predicted to harbor tandem duplicates.

Column	Name	Description
1	GroupID	Anchor group identifier
2	Org	Org identifier
3	Chr	Chromosome identifier containing the tandem duplicates
4	Start	Start base pair of the 5’ most gene copy
5	End	End base pair of the 3’ most gene copy
6	TandemCount	Number of tandem duplicates in the region

orgA-orgB_segmentalduplications.tsv

If a phylogenetic tree is provided, Synolog will report the segmental duplications detected for each pairwise combination of organisms.

Column	Name	Description
1	ID	Segmental duplication identifier
2	ElementCount	Total number of segmentally duplicated genetic elements
3	OrgA	Org identifier for the duplicated genetic elements
4	DupID	Genetic element identifier for the duplicated gene copy
5	DupName	Genetic element name for DupID
6	DupChr	Chromosome identifier containing DupID
7	DupIndex	1-based index along DupChr, sorted by chromosomal position
8	StartBP	Start base pair for DupID
9	EndBP	End base pair for DupID
10	ElementID	Genetic element identifier for the syntenic gene copy
11	ElementName	Genetic element name for ElementID
12	Chr	Chromosome identifier containing ElementID
13	Index	1-based index along Chr, sorted by chromosomal position
14	StartBP	Start base pair for ElementID
15	EndBP	End base pair for ElementID
16	OrgB	Org identifier used to identify the DupID
17	ElementID	Genetic element identifier for the syntenic ortholog used to identify DupID
18	ElementName	Genetic element name for ElementID
19	Chr	Chromosome identifier containing ElementID
20	Index	1-based index along Chr, sorted by chromosomal position
21	StartBP	Start base pair for ElementID
22	EndBP	End base pair for ElementID

org_segmentalduplication_clusters.tsv

If a phylogenetic tree is provided, Synolog will generate a file containg clusters of genetic elements that were flagged as a segmental duplication in at least one of the pairwise segmental duplication.

Column	Name	Description
1	ClusterID	Segmental duplication cluster identifier
2	ElementCount	Total number of genetic elements in the segmental duplication cluster
3	Chrom	Chromosome identifier containing the segmentally duplicated genetic elements
4	ElementID	Genetic element identifier
5	Name	Genetic element name for ElementID
6	Index	1-based index along Chrom, sorted by chromosomal position
7	StartBP	Start base pair for ElementID
8	EndBP	End base pair for ElementID
9	Source	Comma separated list of all element IDs, org IDs, and

tree_clusters_membership.tsv

If a phylogenetic tree is provided, Synolog will generate this file detailing the syntenic genetic elements found to span three or more organisms.

Column	Name	Description
1	Org	Number of orgs in the Synolog analysis
2	TreeClusterID	Tree cluster identifier
3	NumElements	Total number of genetic elements in the tree cluster
4	OrgCount	Total number of orgs in the tree cluster
5	Org	Org identifier
6	Chrom	Chromosome identifier containing the genetic element
7	ElementType	Type of genetic element (gene or ncGene)
8	ElementID	Genetic element identifier for org
9	ElementName	Genetic element name for ElementID
10	Index	1-based index along Chrom, sorted by chromosomal position
11	StartBP	Starting base pair for ElementID
12	EndBP	Ending base pair for ElementID
13	OrthologIDs	A comma separated list of identifiers for syntenic orthologs from the other orgs and the corresponding genetic identifiers

Working with the species cache

The species cache is just a directory containing an ordered set of subdirectories and files that is meant to hold annotation and structure information for each genome in an analysis, along with a set of gene models and related BLAST datbases, and BLAST output files needed to complete a synteny analysis. One can interact with the files in the species cache using standard commands on a UNIX terminal, but it is designed to be managed by the synolog_cctrl.py, synolog_blastctl.py, and synolog_linkint.py programs.

One can create as many independent species caches as desired; when running the various synolog programs you will always specify the location of the particular cache for that analysis.

A species cache is structured in the following way. There two major types of data in the cache: first, an annotation, which includes the location of genes and similar objects (encoded in a GTF or GFF file), and the structural definition of how contigs and scaffolds are joined together into chromosomes (encoded in an AGP file). The second major type of data is a set of gene models (the protein or nucleotide sequence of each transcript encoded in a FASTA file). For each organism, the cache allows for their to be multiple annotations and gene models, since the set of gene models may not change when the underlying assembly changes (say scaffolds are moved around in one version of the assembly versus another; or, as a new assembly is curated, it is improved in stages without re-annotating the genes). This allows one to update a genome without having to re-execute all the underlying BLAST analyses.

So, in the simple case, each organism has one annotation and one set of gene models. But in a more complex case there can be more than one of each of these. Each of these data types are assigned an ID for use in the cache.

Once these data are added to the cache, they are integrated in the cache which tells Synolog to join one particular annotation with one set of gene models and the associated BLAST hits for those genes. Then, to run the Synolog pipeline, the user will specify the IDs of the genomes they want to include in the analysis and the synolog_run.py program will form the proper command, similar to the one shown above, in order to execute the pipeline.

The synolog_cctrl.py program handles creating a cache, adding various data to a cache, and integrating different sets of data.

Adding data to the cache

Creating the cache

We first create a new cache. The cache consists of a set of defined, but ordinary directories and we can have as many distinct caches, located whereever we like, within our disk space.

synolog_cctl.py new -p ./path/to/species_cache

Add a species to the cache

After creating a new cache, we can add one or more species to that cache by choosing a species ID and specifying the species name and common name for that species. Whenever we refer to the species, with respect to Synolog, we will use the species ID we define at this step. (The \ characters just continue the command onto the next line and are not part of the actual command.)

synolog_cctl.py species -p ./path/to/species_cache --id gacu \ --name "Gasterosteus aculeatus" --common "Threespine stickleback"

Add an annotation to the cache

The annotation file should be in GTF or GFF format and should describe the location of protein-coding (and optionally non-coding) genes in the genome. An annotation can also include an AGP file, which describes how contigs and/or scaffolds are put together into chromosomes. The files can optionally be compressed (gzipped). The user must also provide the species ID to which these annotation files belong. Finally, the user can specify an annotation id as well. If one is not provided, then they will be given the annotation ID, "def" for default. Typically, if this is the only genome from this species the user plans to add, then using the default ID is fine, but if the user wants to add two or more genomes from this species, then each annotation should have its own, meaningful ID. (The \ characters just continue the command onto the next line and are not part of the actual command.)

Here is the simplest form of the command:

synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf

And here is a more complex form:

synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --ann-id gannv1 \ --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf.gz \ --agp ./data/g.aculeatus/reference/g.aculeatus.agp.gz

Add a set of genes to the cache

A set of gene models must be supplied for each species in FASTA format (and optionally gzipped) -- these will be used to execute BLAST. The user can choose to supply protein-coding gene models, supplied as amino acid sequences (flag --aa); this is the most common way to run BLAST between species and will cause BLASTp to be used. The user can alternatively supply gene models in nucleotides (flag --nuc) which will result in BLASTn being used. Finally, the user can also supply a set of non-coding genes (in nucleotide format) using the flag --non-coding option. Non-coding genes will be stored separately and BLASTn will be executed with a slightly modified set of matching paramters (to optimize for the difference in genetic distance between non-coding genes).

When this command is executed, the FASTA file will be stored in the cache with a standardized name and compressed, and a BLAST database will be generated from the FASTA file.

Here is the simplest form of the command supplying a default ID and amino acid genic sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --aa ./data/g.aculeatus/reference/g.aculeatus.aa

Here is a form of the command supplying a specific ID and amino acid genic sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacaav1 \ --aa ./data/g.aculeatus/reference/g.aculeatus.aa

Finally, here is a form of the command supplying a specific ID and non-coding gene sequences:

synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacncv1 \ --non-coding ./data/g.aculeatus/reference/g.aculeatus.noncoding.fa

Integrating an annotation with a set of genes

The final step is to integrate or link together a set of gene models and an underlying genome annotation. In the default case, there is only one of each, so the integration step does not add much. However, in a more complex case, there may be several versions of a genome assembly, with one or more set of gene models, depending on how/when the underlying assembly was annotated.

Here is the simplest case to integrate a default annotation and set of gene amino acid gene models:

synolog_cctl.py integration -p ./path/to/species_cache --id gacu

And here is a case specifying particular IDs for each annotation and set of gene models. Here we also can supply a description so the user can keep track of different versions of data in the cache.

synolog_cctl.py integration -p ./path/to/species_cache --id gacu --ann-id gannv1 --genes-id gacaav1 \ --desc "This is the default Ensembl BROAD stickleback assembly"

Running BLAST

Executing BLAST runs

Linking BLAST results

Run the pipeline using the species cache

Visualizing the results

Synolog is able to readily generate figures using the synolog_plot.py utility. There are five types of figures that are supported.

genome – visualize synteny at the genome-wide level
region – visualize synteny at a specific region/locus
synteny – visualize a synteny block inferred by Synolog
segmental – visualize a pairwise segmental duplication inferred by Synolog
tree_cluster – visualize a tree cluster inferred by Synolog

Exporting data from Synolog

Outside of the output files produced in a Synolog analysis, Synolog can export orthogroups in fasta format for downstream analysis. The utility synolog_fasta.py, makes use of the species cache along with the output files generated.

# all sequences for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all # longest sequence for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all \ --longest # longest sequence for all single copy orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --single_copy \ --longest # longest sequence for a specific set of orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --longest \ --orthogroups orthogroup_ids.txt # single column list of orthogroup IDs

Synolog Manual