genotypes
This program exports a Stacks data set either as a set of observed haplotypes at each locus in the population, or with
the haplotypes encoded into genotypes. The -r option allows only loci that exist in a certain
number of population individuals to be exported. In a mapping context, raising or lowering this limit is an effective way
to control the quality level of markers exported as genuine markers will be found in a large number of progeny. If
exporting a set of observed haplotypes in a population, the -m option can be used to restict
exported loci to those that have a minimum depth of reads.
By default, when executing the pipeline (either denovo_map.pl or ref_map.pl)
the genotypes program will be executed last and will identify mappable markers in the population and
export both a set of observed haplotypes and a set of generic genotypes with -r 1. If SQL interaction
is enabled, these files will be uploaded to the database where Stacks will store the genotyping information in a neutral way.
From the web interface, additional, manual corrections can be made, as well as marker annotations and all of this data can
be exported directly from the web, after specifying a particular map type (if exporting data from a genetic cross).
Making Corrections
If enabled with the -c option, the genotypes program
will make automated corrections to the data. Since loci are matched up in the population, the script
can correct false-negative heterozygote alleles since it knows the existence of alleles at a particular
locus in the other individuals. For example, the program will identify loci with SNPs that didn’t have high
enough coverage to be identified by the SNP caller. It will also check that homozygous tags have a minimum
depth of coverage, since a low-coverage polymorphic locus may appear homozygous simply because the other allele
wasn’t sequenced.
Correction Thresholds
The thresholds for automatic corrections can be modified by changing the default values
for the min_hom_seqs, min_het_seqs, and
max_het_seqs parameters to genotypes. min_hom_seqs
is the minimum number of reads required to consider a stack homozygous (default of 5). The
min_het_seqs and max_het_seqs variables represent fractions. If
the ratio of the depth of the the smaller allele to the bigger allele is greater than
max_het_seqs (default of 1/10) a stack is called a het. If the ratio is less than
min_het_seqs (default of 1/20) a stack is called homozygous. If the ratio is in between
the two values it is is unknown and a genotype will not be assigned.
Automated corrections made by the program are shown in the output file in capital letters.
Making genotypes appear in the web interface
If the -s option is specified, a second file will be output containing the
genotypes in SQL format — which can be imported back in to the database (into the
catalog_genotypes table). These genotypes can then be seen in the web interface
and additional, manual corrections can be made through the web. The manual corrections can then be included
in the output by exporting the results directly from the web interface.
Example Usage
Exporting a set of observed haplotypes, with a minimum stack depth of 5 reads:
~/% genotypes -P ./stacks/ -b 1 -m 5 -r 3
Exporting a set of generic, map-agnostic genotypes, requiring a marker to be present in at least three progeny:
~/% genotypes -P ./stacks/ -b 1 -t gen -c -r 3 -s
Exporting a set of map-specific genotypes:
~/% genotypes -P ./stacks/ -b 1 -t BC1 -c -o joinmap -r 3 -s
Other Pipeline Programs
Raw Reads
|
Core
|
Execution control
|
Utilities
|
The process_radtags program examines raw reads from an Illumina sequencing run and
first, checks that the barcode and the restriction enzyme cutsite are intact (correcting minor errors).
Second, it slides a window down the length of the read and checks the average quality score within the window.
If the score drops below 90% probability of being correct, the read is discarded. Reads that pass quality
thresholds are demultiplexed if barcodes are supplied.
The process_shortreads program performs the same task as process_radtags
for fast cleaning of randomly sheared genomic or transcriptomic data. This program will trim reads that are below the
quality threshold instead of discarding them, making it useful for genomic assembly or other analyses.
The clone_filter program will take a set of reads and reduce them according to PCR
clones. This is done by matching raw sequence or by referencing a set of random oligos that have been included in the sequence.
The kmer_filter program allows paired or single-end reads to be filtered according to the
number or rare or abundant kmers they contain. Useful for both RAD datasets as well as randomly sheared genomic or
transcriptomic data.
The ustacks program will take as input a set of short-read sequences and align them into
exactly-matching stacks. Comparing the stacks it will form a set of loci and detect SNPs at each locus using a
maximum likelihood framework.
The pstacks program will extract stacks that have been aligned to a reference genome by a
program such as Bowtie or GSnap and identify SNPs at each locus using a maximum likelihood framework.
A catalog can be built from any set of samples processed
by the ustacks program. It will create a set of consensus loci, merging alleles together. In the case
of a genetic cross, a catalog would be constructed from the parents of the cross to create a set of
all possible alleles expected in the progeny of the cross.
Sets of stacks constructed by the ustacks
or pstacks programs can be searched against a catalog produced by the cstacks program. In the case of a
genetic map, stacks from the progeny would be matched against the catalog to determine which progeny
contain which parental alleles.
This populations program will compute population-level summary statistics such
as π, FIS, and FST. It can output site level SNP calls in VCF format and
can also output SNPs for analysis in STRUCTURE or in Phylip format for phylogenetics analysis.
The genotypes program exports Stacks data as a set of observed haplotypes, or with
the haplotypes encoded into genotypes, in either a generic format or for a particular linkage mapper such as
JoinMap, OneMap, or R/QTL. It also provides methods for making automated corrections to certain types of loci.
The rxstacks program makes corrections to individual genotypes and haplotypes based
on data from a population of samples.
The denovo_map.pl program executes each of the Stacks components to create a genetic
linkage map, or to identify the alleles in a set of populations. It also handles uploading data to the database.
The ref_map.pl program takes reference-aligned input data and executes each of the Stacks
components, using the reference alignment to form stacks, and identifies alleles. It can be used in a genetic map
of a set of populations. It also handles uploading data to the database.
The load_radtags.pl program takes a set of data produced by either the denovo_map.pl or
ref_map.pl progams (or produced by hand) and loads it into the database. This allows the data to be generated on
one computer, but loaded from another. Or, for a database to be regenerated without re-executing the pipeline.
The index_radtags.pl program indexes the database to speed execution in the web interface
and enable web-based filtering after all of the data has been loaded. It will be run by
denovo_map.pl or ref_map.pl at the end of execution.
The export_sql.pl program provides for the export of all of the Stacks data in a compact form.
For each locus, the script reports the consensus sequence, the number of parents and progeny that Stacks could find the
locus in, the number of SNPs found at that locus and a listing of the individual SNPs and the observed alleles, followed
by the particular allele observed in each individual. The script allows you to specify a number of filters for the data,
the same filters as available through the web interface.
The sort_read_pairs.pl program collates paired-end sequences for each catalog locus, creating
a FASTA file of paired-reads for each catalog locus. These FASTA files can then be assembled to make paired-end contigs.
The exec_velvet.pl program will execute Velvet on the collated set of reads generated by
sort_read_pairs.pl.