Gio Madrigal1, and Julian Catchen1
|
1Department of Evolution, Ecology, and Behavior University of Illinois at Urbana-Champaign Urbana, Illinois, 61820 USA |
Synolog is a set of software designed to identify orthologs across speices using sequence similarity and blocks of conserved synteny. The core program synolog is implemented in C++ and parallelized using OpenMP. We also provide a suite of Python scripts to manage the data input and exports, as well as for visualization of syntenic regions.
Synolog should build on any standard UNIX-like environment (Apple OS X, Linux, etc.) The supporting scripts are written in Python v3.
The only dependencies required for the softare are Python modules included in the standard library, as well as the Python Cairo library (PyCairo) if you want to generate visualizations.
Synolog uses the outputs of BLAST to determine sequence similarity between genes/genome objects so in most cases you will want to make sure BLAST is installed and available on the executable path (or you can specify a local path directly to Synolog when executing).
Synolog uses the standard autotools install:
% tar xfvz synolog-1.xx.tar.gz % cd synolog-1.xx % ./configure % make (become root) # make install (or, use sudo) % sudo make install
You can change the root of the install location (/usr/local/ on most operating systems) by specifying the --prefix command line option to the configure script.
% ./configure --prefix=/home/smith/local
Or, if you want to group the install in a single location, e.g.,
% ./configure --prefix=/projects/smithlab/local/opt/synolog-v1.0
You can speed up the build if you have more than one processor:
% make -j 8
A default install will install files in the following way:
| /usr/local/bin | Synolog executable and Python scripts. |
Install the PyCairo Python module for visualizaing synteny blocks.
The pipeline is now ready to run.
Synolog requires the following data for each genome you wish to compare:
AGP structure files and GTF or GFF annotation files are generally available from repositories such as Ensembl or NCBI, distributed with a genome assemnbly. This is also true of FASTA files containing gene sequences. Alternatively, if generating your own assembly, these files are output by annotation pipelines such as Braker. If a genome is distributed without an AGP file, a default file can be generated from the FASTA file containing the genome assembly with several available scripts, such as agp2fasta.py distributed with Synolog.
In simple cases, we can directly specify the required and/or optional files to synolog and execute it. This requires us to have produced sets of reciprocal BLAST hits for each pair of genomes we want to examine.
For this example, we will consider three species, the gulf pipefish (Sygnathus scovelli), the threespine stickleback (Gasterosteus aculeatus), and the platyfish (Xiphophorus maculatus). In this hypothetical example, we have assembled and annotated the pipefish ourselves, which gave us a GFF annotation, and downloaded the other two species from Ensembl here and here which distributes GTF files.
To reference each genome in the analysis, we are going to assign it an organism ID that we create. We will use ssc for pipefish, gac for stickleback, and xma for platyfish.
Prior to running synolog we created BLAST databases from the FASTA files containing the set of genes from each species and then ran blastp pairwise (with the -outfmt 6 option to give us TSV files):
Now we are ready to run synolog. We use options to tell synolog what format files to expect and those options are persistent along hte commond line until a different option occurs. So, gzgff, gff, gzgtf, gtf tell the software what format the succeeding annotation file is in, while gzblast or blast tell the software whether the succeeding BLAST hit files are compressed or not. Here is the final command (the \ characters just continue the command onto the next line):
synolog -o ./synolog/ \ --ann_type gzgff \ --org ssc --annotation ./gff/ssc_v1.gff.gz \ --ann_type gzgtf \ --org gac --annotation ./gtf/Gasterosteus_aculeatus.BROADS1.81.gtf.gz \ --org xma --annotation ./gtf/Xiphophorus_maculatus.Xipmac4.4.2.81.gtf.gz \ --hit_type gzblast \ --hits ./blast/gac_ssc.tsv.gz \ --hits ./blast/gac_xma.tsv.gz \ --hits ./blast/ssc_gac.tsv.gz \ --hits ./blast/ssc_xma.tsv.gz \ --hits ./blast/xma_gac.tsv.gz \ --hits ./blast/xma_ssc.tsv.gz
It is easy to see that as more organsims are added to an analysis, keeping track and specifying the correct files to synolog becomes more difficult. That is why we recomment using the species cache, as this system, and the programs designed to interact with it automate and/or order many of these steps.
This is a single column output file that lists all the organisms used in a Synolog analysis.
| Column | Name | Description |
|---|---|---|
| 1 | Org | Org identifier in the analysis |
This file is generated at the start of a Synolog analysis that notes all the BLAST hit files being used.
| Column | Name | Description |
|---|---|---|
| 1 | Hit File | Name of cached blast hits file between the Query and Subject |
| 2 | Query | Org identifier that was set as the query in the BLAST step |
| 3 | Subject | Org identifier that was set as the subject in the BLAST step |
| 4 | File Type | A short tag denoting whether the hit file is compressed or uncompressed |
For each organism in a Synolog analysis, information regarding the location of all the annotated genetic elements in a Synolog analysis.
| Column | Name | Description |
|---|---|---|
| 1 | Org | Org identifier |
| 2 | ID | Genetic Element identifier |
| 3 | Name | Genetic Element Name for ID |
| 4 | Chrom | Chromosome identifier containing ID |
| 5 | StartBp | Starting base pair for ID |
| 6 | EndBp | Ending base pair for ID |
| 7 | Orientation | Chromosomal orientation (+ or -) for ID |
For each organism in a Synolog analysis, the information denoting the structure of the chromosomal sequences is written to a list.
| Column | Name | Description |
|---|---|---|
| 1 | Chromosome | Chromosome identifier |
| 2 | Scaffold | Scaffold identifier |
| 3 | StartBp | Start base pair for Scaffold |
| 4 | EndBp | End base pair for Scaffold |
This file contains a list of all the chromosomal sequences and their corresponding lengths in a Synolog analysis.
| 1 | Org | Org identifier |
| 2 | Chromosome | Chromosome identifier |
| 3 | Length | Chromosome length in base pairs |
For each pairwise comparison, Synolog will write a homologs file containing the homolog pairings found between two organisms.
| Column | Name | Description |
|---|---|---|
| 1 | GroupID | Homolog group identifier |
| 2 | Type | Type of genetic element (gene or ncGene) |
| 3 | OrgA | First org identifier in the pairwise comparison |
| 4 | OrgB | Second org identifier in the pairwise comparison |
| 5 | IdA | Genetic element identifier for OrgA |
| 6 | NameA | Genetic element name for IdA |
| 7 | Chr | Chromosome identifier containing the genetic element IdA |
| 8 | StartBP | Starting base pair for IdA |
| 9 | EndBp | Ending base pair for IdA |
| 10 | IdB | Genetic element identifier for OrgB |
| 11 | NameB | Genetic element name for IdB |
| 12 | Chr | Chromosome identifier containing the genetic element IdA |
| 13 | StartBP | Starting base pair for IdB |
| 14 | EndBp | Ending base pair for IdB |
| 15 | SyntenyCluster | If not None, synteny cluster identifier if IdA and IdB are syntenic |
After all ortholog relationships have been determined, Synolog merges all the orthologs into orthogroups. If a phylogenetic tree was provided, the CopyNumState and ParalogCount columns are included in this file.
CopyNumState:
| Column | Name | Description |
|---|---|---|
| 1 | GroupID | Orthogroup identifier |
| 2 | Type | Type of genetic elements in the orthogroup (gene or ncGene) |
| 3 | ElementCount | Total number of genetic elements in the orthogroup |
| 4 | OrgCount | Total number of orgs in the orthogroup |
| 5 | Org | Org identifier |
| 6 | ElementId | Genetic element identifier for Org |
| 7 | Name | Genetic element name for ElementId |
| 8 | CopyNumState | If not marked as a duplicate, B if the element has a local gene count (gene + tandem duplicate count) as the most basal org in the orthogroup, E if the element has a greater gene count than the most basal org in the orthogroup, C if the element has a lower gene count than the most basal org. If marked as a duplicate, T for tandem duplicate or R for retrogene duplicate. |
| 9 | ParalogCount | If this element is not a duplicate (tandem or retro), the number of additional paralogs for Org in this orthogroup |
| 10 | Chr | Chromosome identifier containing the genetic element ElementId |
| 11 | StartBp | Starting base pair for IdB |
| 12 | EndBp | Ending base pair for IdB |
This file contains one record per orthogroup with a short description for each orthogroup. If a phylogenetic tree is provided, the number of duplicated elements (tandem + retro) is reported. Retrogene counts do not influence whether an orthogroup is considered SingleCopy.
| Column | Name | Description |
|---|---|---|
| 1 | GroupID | Orthogroup identifier |
| 2 | Type | Type of genetic elements in the orthogroup (gene or ncGene) |
| 3 | OrgCount | Total number of orgs in the orthogroup |
| 4 | ElementCount | Total number of genetic elements in the orthogroup |
| 5 | DuplicateCount | Total number of gene duplicates in the orthogroup |
| 6 | SingleCopy | True if the local element count is 1 and if all orgs in the analysis are in the orthogroup |
| 7 | Orgs | Comma separated listed of all orgs in the orthogroup |
This file is a matrix of gene counts for each of the orthogroups, that is compatible with the software Café (Fábio et al. 2020).
| Column | Name | Description |
|---|---|---|
| 1 | Desc | Set as (null) as a placeholder for orthogroup description |
| 2 | Orthogroup.ID | Orthogroup identifier |
| 3 | Org | Number of gene copies in the orthogroup for Org |
This file contains the pairwise predictions of orthologous chromosomes in a Synolog analysis.
| Column | Name | Description |
|---|---|---|
| 1 | QueryOrg | First org identifier in the pairwise comparison |
| 2 | QueryChr | Chromosome identifier for QueryOrg |
| 3 | SubjOrg | Second org identifier in the pairwise comparison |
| 4 | SubjChr | Chromosome identifier for SubjOrg |
| 5 | OrthGeneCnt | Number of syntenic ortholog pairings shared between the two chromosomes |
| 6 | ClusterCnt | Number of synteny clusters shared between the two chromosomes |
This file contains organism specific ortholog statistics across all the other organisms in a Synolog analysis. If a phylogenetic tree is provided, columns containing tree cluster information, segmental duplication information, one to many ortholog relationships, and local gene expansions/contractions are reported.
Tree spanning clusters/elements denote elements that span across all the organisms in the analysis
Tree partial clusters/elements denote elements that span at least 3 but not all the organisms in the analysis.
| Column | Name | Description |
|---|---|---|
| 1 | Org | Org identifier |
| 2 | Elements | Total number of genetic elements in the analysis |
| 3 | Orthologs | Total number of Elements found orthologous with Org |
| 4 | %Orthologs | Percent of Elements found orthologous |
| 5 | OrthoGroups | Number of shared orthogroups with Org |
| 6 | SyntenicElements | Total number of Elements found syntenic with Org |
| 7 | %SyntenicElements | Percent of Elements found syntenic |
| 8 | One-to-One | Total number of Elements found in to be one-to-one with Org |
| 9 | %One-to-One | Percent of Elements found to be one-to-one with Org |
| 10 | One-to-Many | Total number of Elements found in to be one-to-many with Org |
| 11 | %One-to-Many | Percent of Elements found to be one-to-many with Org |
| 12 | Many-to-One | Total number of Elements found in to be many-to-one with Org |
| 13 | %Many-to-One | Percent of Elements found to be many-to-one with Org |
| 14 | Many-to-Many | Total number of Elements found in to be many-to-many with Org |
| 15 | %Many-to-Many | Percent of Elements found to be many-to-many with Org |
| 16 | Tree-Spanning-Clusters | Total number of tree spanning clusters shared with Org |
| 17 | Tree-Spanning-Elements | Total number of Elements found in Tree-Spanning-Clusters |
| 18 | %Tree-Spanning-Elements | Percent of Elements found in Tree-Spanning-Clusters |
| 19 | ree-Partial-Clusters | Total number of tree partial clusters shared with Org |
| 20 | Tree-Partial-Elements | Total number of Elements found in Tree-Partial-Clusters |
| 21 | %Tree-Partial-Elements | Percent of Elements found in Tree-Parital-Clusters |
| 22 | Segmental | Total number of segmental duplication clusters identified using Org |
| 23 | SegmentalElements | Total number of segmental duplicated genetic elements identified using Org |
| 24 | %SegmentalElements | Percent of Elements identified as segmental duplications using Org |
| 25 | #OrthoGroupCopyNumber | List of Element counts found across orthogroups |
| 26 | NumOfOrthogroups | List of the number of orthogroups containing a specific Element count |
| 27 | #Expansions | Total number of local element expansions |
| 28 | %Expansions | Percent of Elements found to be locally expanded |
| 29 | Contractions | Total number of local element contractions |
| 30 | %Contractions | Percent of Elements found to be locally contracted |
This file contains the overall orthologs statistics in a Synolog analysis. If a phylogenetic tree is provided, information regarding tandem duplications, retrogenes, tree clusters, and local gene expansions/contractions are reported.
| Column | Name | Description |
|---|---|---|
| 1 | Org | Org identifier |
| 2 | Elements | Total number of genetic elements for Org in the Synolog analysis |
| 3 | Orthologs | Total number of Elements found to be orthologous across the Synolog analysis |
| 4 | %Orthologs | Percent of Elements found to be orthologous across the Synolog analysis |
| 5 | Orthologs+Duplicates | Total number of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis |
| 6 | %Orthologs+Duplicates | Percent of Elements found to be orthologous (including tandemly duplicated) across the Synolog analysis |
| 7 | RetroCopies | Total number of Elements predicted to be a retrocopy |
| 8 | %RetroCopies | Percent of Elements predicted to be a retrocopy |
| 9 | OrthoGroups | Total number of orthogroups |
| 10 | One-to-One-Orthogroups | Total number of one-to-one orthogroups |
| 11 | SyntenicElements | Total number of Elements found to be syntenic |
| 12 | %SyntenicElements | Percent of Elements found to be syntenic |
| 13 | undetermined | Total number of undetermined Elements |
| 14 | %Undetermined | Percent of undetermined Elements |
| 16 | Elements-With-Tandem-Duplicates | Total number of Elements with a tandem duplicate |
| 17 | %Elements-With-Tandem-Duplicates | Percent of Elements with a tandem duplicate |
| 18 | Tandem-Duplicates | Total number of Elements found to be a tandem duplicate |
| 19 | %Tandem-Duplicates | Percent of Elements found to be a tandem duplicate |
| 20 | Local-Expansions | Total number of Elements found to have a local expansion |
| 21 | %Local-Expansions | Percent of Elements found to have a local expansion |
| 22 | Local-Constractions | Total number of Elements found to have a local contraction |
| 23 | %Local-Contractions | Percent of Elements found to have a local contraction |
| 24 | Elements-With-Retrocopies | Total number of Elements found to have a retrocopy |
| 25 | %Elements-With-Retropcopies | Percent of Elements found to have a retrocopy |
| 26 | Tree-Spanning-Clusters | Total number of tree spanning clusters |
| 27 | Tree-Spanning-Elements | Total number of Elements found in tree spanning clusters |
| 28 | %Tree-Spanning-Elements | Percent of Elements found in tree spanning clusters |
| 29 | Tree-Partial-Clusters | Total number of tree partial clusters |
| 30 | Tree-Parital-Elements | Total number of Elements found in tree partial clusters |
| 31 | %Tree-Parital-Elements | Percent of Elements found in tree partial clusters |
For each pairwise comparison, Synolog will report the syntenic orthologs alongside their locations.
| Column | Name | Description |
|---|---|---|
| 1 | ClusterID | |
| 2 | Type | Type of genetic elements in the pairing (gene or ncGene) |
| 3 | OrgA | First org identifier in the pairwise comparison |
| 4 | OrgB | Second org identifier in the pairwise comparison |
| 5 | IdA | Genetic element identifier for OrgA |
| 6 | NameA | Genetic element name for IdA |
| 7 | Chr | Chromosome identifier containing the genetic element IdA |
| 8 | Index | 1-based index along Chr for IdA, sorted by chromosomal position |
| 9 | StartBP | Starting base pair for IdA |
| 10 | EndBp | Ending base pair for IdA |
| 11 | IdB | Genetic element identifier for OrgB |
| 12 | NameB | Genetic element name for IdB |
| 13 | Chr | Chromosome identifier containing the genetic element IdA |
| 14 | Index | 1-based index along Chr for IdB, sorted by chromosomal position |
| 15 | StartBP | Starting base pair for IdB |
| 16 | EndBp | Ending base pair for IdB |
For each pairwise synteny cluster, Synolog will report a brief description of the cluster including its location relative to both organisms and its size.
| Column | Name | Description |
|---|---|---|
| 1 | ID | Synteny cluster identifier |
| 2 | ElementCnt | Number of syntenic ortholog pairs in the cluster |
| 3 | SlidingWinSize | Size of the sliding window used in the analysis |
| 4 | Orientation | Whether the cluster was found in the forward or inverse orientation relative to both organisms |
| 5 | OrgA | First org identifier in the pairwise comparison |
| 6 | Chr | Chromosome identifier for OrgA that contains the synteny cluster |
| 7 | StartIndex | 1-based index along Chr for the 5’ most element in the synteny cluster for OrgA, sorted by chromosomal position |
| 8 | EndIndex | 1-based index along Chr for the 3’ most element in the synteny cluster for OrgA, sorted by chromosomal position |
| 9 | StartBp | Start base pair for the synteny cluster on Chr in OrgA |
| 10 | EndBp | End base pair for the synteny cluster on Chr in OrgA |
| 11 | OrgB | Second org identifier in the pairwise comparison |
| 12 | Chr | Chromosome identifier for OrgB that contains the synteny cluster |
| 13 | StartIndex | 1-based index along Chr for the 5’ most element in the synteny cluster for OrgB, sorted by chromosomal position |
| 14 | EndIndex | 1-based index along Chr for the 3’ most element in the synteny cluster for OrgB, sorted by chromosomal position |
| 15 | StartBp | Start base pair for the synteny cluster on Chr in OrgB |
| 16 | EndBp | End base pair for the synteny cluster on Chr in OrgB |
If a phylogenetic tree is provided, Synolog will generate this file that contains the coordinates of loci that are predicted to harbor tandem duplicates.
| Column | Name | Description |
|---|---|---|
| 1 | GroupID | Anchor group identifier |
| 2 | Org | Org identifier |
| 3 | Chr | Chromosome identifier containing the tandem duplicates |
| 4 | Start | Start base pair of the 5’ most gene copy |
| 5 | End | End base pair of the 3’ most gene copy |
| 6 | TandemCount | Number of tandem duplicates in the region |
If a phylogenetic tree is provided, Synolog will report the segmental duplications detected for each pairwise combination of organisms.
| Column | Name | Description |
|---|---|---|
| 1 | ID | Segmental duplication identifier |
| 2 | ElementCount | Total number of segmentally duplicated genetic elements |
| 3 | OrgA | Org identifier for the duplicated genetic elements |
| 4 | DupID | Genetic element identifier for the duplicated gene copy |
| 5 | DupName | Genetic element name for DupID |
| 6 | DupChr | Chromosome identifier containing DupID |
| 7 | DupIndex | 1-based index along DupChr, sorted by chromosomal position |
| 8 | StartBP | Start base pair for DupID |
| 9 | EndBP | End base pair for DupID |
| 10 | ElementID | Genetic element identifier for the syntenic gene copy |
| 11 | ElementName | Genetic element name for ElementID |
| 12 | Chr | Chromosome identifier containing ElementID |
| 13 | Index | 1-based index along Chr, sorted by chromosomal position |
| 14 | StartBP | Start base pair for ElementID |
| 15 | EndBP | End base pair for ElementID |
| 16 | OrgB | Org identifier used to identify the DupID |
| 17 | ElementID | Genetic element identifier for the syntenic ortholog used to identify DupID |
| 18 | ElementName | Genetic element name for ElementID |
| 19 | Chr | Chromosome identifier containing ElementID |
| 20 | Index | 1-based index along Chr, sorted by chromosomal position |
| 21 | StartBP | Start base pair for ElementID |
| 22 | EndBP | End base pair for ElementID |
If a phylogenetic tree is provided, Synolog will generate a file containg clusters of genetic elements that were flagged as a segmental duplication in at least one of the pairwise segmental duplication.
| Column | Name | Description |
|---|---|---|
| 1 | ClusterID | Segmental duplication cluster identifier |
| 2 | ElementCount | Total number of genetic elements in the segmental duplication cluster |
| 3 | Chrom | Chromosome identifier containing the segmentally duplicated genetic elements |
| 4 | ElementID | Genetic element identifier |
| 5 | Name | Genetic element name for ElementID |
| 6 | Index | 1-based index along Chrom, sorted by chromosomal position |
| 7 | StartBP | Start base pair for ElementID |
| 8 | EndBP | End base pair for ElementID |
| 9 | Source | Comma separated list of all element IDs, org IDs, and |
If a phylogenetic tree is provided, Synolog will generate this file detailing the syntenic genetic elements found to span three or more organisms.
| Column | Name | Description |
|---|---|---|
| 1 | Org | Number of orgs in the Synolog analysis |
| 2 | TreeClusterID | Tree cluster identifier |
| 3 | NumElements | Total number of genetic elements in the tree cluster |
| 4 | OrgCount | Total number of orgs in the tree cluster |
| 5 | Org | Org identifier |
| 6 | Chrom | Chromosome identifier containing the genetic element |
| 7 | ElementType | Type of genetic element (gene or ncGene) |
| 8 | ElementID | Genetic element identifier for org |
| 9 | ElementName | Genetic element name for ElementID |
| 10 | Index | 1-based index along Chrom, sorted by chromosomal position |
| 11 | StartBP | Starting base pair for ElementID |
| 12 | EndBP | Ending base pair for ElementID |
| 13 | OrthologIDs | A comma separated list of identifiers for syntenic orthologs from the other orgs and the corresponding genetic identifiers |
The species cache is just a directory containing an ordered set of subdirectories and files that is meant to hold annotation and structure information for each genome in an analysis, along with a set of gene models and related BLAST datbases, and BLAST output files needed to complete a synteny analysis. One can interact with the files in the species cache using standard commands on a UNIX terminal, but it is designed to be managed by the synolog_cctrl.py, synolog_blastctl.py, and synolog_linkint.py programs.
One can create as many independent species caches as desired; when running the various synolog programs you will always specify the location of the particular cache for that analysis.
A species cache is structured in the following way. There two major types of data in the cache: first, an annotation, which includes the location of genes and similar objects (encoded in a GTF or GFF file), and the structural definition of how contigs and scaffolds are joined together into chromosomes (encoded in an AGP file). The second major type of data is a set of gene models (the protein or nucleotide sequence of each transcript encoded in a FASTA file). For each organism, the cache allows for their to be multiple annotations and gene models, since the set of gene models may not change when the underlying assembly changes (say scaffolds are moved around in one version of the assembly versus another; or, as a new assembly is curated, it is improved in stages without re-annotating the genes). This allows one to update a genome without having to re-execute all the underlying BLAST analyses.
So, in the simple case, each organism has one annotation and one set of gene models. But in a more complex case there can be more than one of each of these. Each of these data types are assigned an ID for use in the cache.
Once these data are added to the cache, they are integrated in the cache which tells Synolog to join one particular annotation with one set of gene models and the associated BLAST hits for those genes. Then, to run the Synolog pipeline, the user will specify the IDs of the genomes they want to include in the analysis and the synolog_run.py program will form the proper command, similar to the one shown above, in order to execute the pipeline.
The synolog_cctrl.py program handles creating a cache, adding various data to a cache, and integrating different sets of data.
We first create a new cache. The cache consists of a set of defined, but ordinary directories and we can have as many distinct caches, located whereever we like, within our disk space.
synolog_cctl.py new -p ./path/to/species_cache
After creating a new cache, we can add one or more species to that cache by choosing a species ID and specifying the species name and common name for that species. Whenever we refer to the species, with respect to Synolog, we will use the species ID we define at this step. (The \ characters just continue the command onto the next line and are not part of the actual command.)
synolog_cctl.py species -p ./path/to/species_cache --id gacu \ --name "Gasterosteus aculeatus" --common "Threespine stickleback"
The annotation file should be in GTF or GFF format and should describe the location of protein-coding (and optionally non-coding) genes in the genome. An annotation can also include an AGP file, which describes how contigs and/or scaffolds are put together into chromosomes. The files can optionally be compressed (gzipped). The user must also provide the species ID to which these annotation files belong. Finally, the user can specify an annotation id as well. If one is not provided, then they will be given the annotation ID, "def" for default. Typically, if this is the only genome from this species the user plans to add, then using the default ID is fine, but if the user wants to add two or more genomes from this species, then each annotation should have its own, meaningful ID. (The \ characters just continue the command onto the next line and are not part of the actual command.)
Here is the simplest form of the command:
synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf
And here is a more complex form:
synolog_cctl.py annotation -p ./path/to/species_cache --id gacu --ann-id gannv1 \ --gtf ./data/g.aculeatus/reference/g.aculeatus.gtf.gz \ --agp ./data/g.aculeatus/reference/g.aculeatus.agp.gz
A set of gene models must be supplied for each species in FASTA format (and optionally gzipped) -- these will be used to execute BLAST. The user can choose to supply protein-coding gene models, supplied as amino acid sequences (flag --aa); this is the most common way to run BLAST between species and will cause BLASTp to be used. The user can alternatively supply gene models in nucleotides (flag --nuc) which will result in BLASTn being used. Finally, the user can also supply a set of non-coding genes (in nucleotide format) using the flag --non-coding option. Non-coding genes will be stored separately and BLASTn will be executed with a slightly modified set of matching paramters (to optimize for the difference in genetic distance between non-coding genes).
When this command is executed, the FASTA file will be stored in the cache with a standardized name and compressed, and a BLAST database will be generated from the FASTA file.
Here is the simplest form of the command supplying a default ID and amino acid genic sequences:
synolog_cctl.py genes -p ./path/to/species_cache --id gacu --aa ./data/g.aculeatus/reference/g.aculeatus.aa
Here is a form of the command supplying a specific ID and amino acid genic sequences:
synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacaav1 \ --aa ./data/g.aculeatus/reference/g.aculeatus.aa
Finally, here is a form of the command supplying a specific ID and non-coding gene sequences:
synolog_cctl.py genes -p ./path/to/species_cache --id gacu --genes-id gacncv1 \ --non-coding ./data/g.aculeatus/reference/g.aculeatus.noncoding.fa
The final step is to integrate or link together a set of gene models and an underlying genome annotation. In the default case, there is only one of each, so the integration step does not add much. However, in a more complex case, there may be several versions of a genome assembly, with one or more set of gene models, depending on how/when the underlying assembly was annotated.
Here is the simplest case to integrate a default annotation and set of gene amino acid gene models:
synolog_cctl.py integration -p ./path/to/species_cache --id gacu
And here is a case specifying particular IDs for each annotation and set of gene models. Here we also can supply a description so the user can keep track of different versions of data in the cache.
synolog_cctl.py integration -p ./path/to/species_cache --id gacu --ann-id gannv1 --genes-id gacaav1 \ --desc "This is the default Ensembl BROAD stickleback assembly"
Synolog is able to readily generate figures using the synolog_plot.py utility. There are five types of figures that are supported.
Outside of the output files produced in a Synolog analysis, Synolog can export orthogroups in fasta format for downstream analysis. The utility synolog_fasta.py, makes use of the species cache along with the output files generated.
# all sequences for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all # longest sequence for all orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --all \ --longest # longest sequence for all single copy orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --single_copy \ --longest # longest sequence for a specific set of orthogroups % synolog_fasta.py --path /path/to/synolog/output_files/ \ --cache /path/to/species_cache/ \ --out-path /path/to/output_directory/ \ --longest \ --orthogroups orthogroup_ids.txt # single column list of orthogroup IDs
Core |
Species Cache |
Execution control |
Utility programs |