rxstacks
The rxstacks program makes corrections to genotype and haplotype calls in individual samples based on data accumulated from a population-wide examination.
The rxstacks program applies four types of corrections to a Stacks analysis:
- SNP model corrections: rxstacks will reevaluate model calls that were not significant in each individual. rxstacks
will use the catalog to look at the alleles for a particular nucleotide position in the population and filter out putative sequencing errors from the individual
under examination. It will then recall the nucleotide position as a heterozygote or homozygote using the bounded SNP model (which bounds the known error rate).
- Log likelihood filtering: starting in Stacks 1.20, when a stack is assembled a log likelihood will be assigned to that stack by either the
ustacks or pstacks programs. A log likelihood limit can be set in the rxstacks program
to remove loci that have a log likelihood below the threshold. Loci with poor log likelihoods tend to have low coverage, high sequencing erorr, or both, and do not
add to a Stacks analysis. Good log likelihoods are close to zero, bad log likelihoods are highliy negative, so the program filters lnls below the
value specified.
- Confounded locus filter: for each catalog in the locus, we expect each individual in the analysis to have a single, matching locus. When multiple loci from an
individual match a single catalog locus, we refer to the locus as confounded and remove it from the analysis. These loci are typically composed of
repetitive sequence. Given a catalog locus with a certain percentage of individuals are confounded, the rxstacks program will remove
those loci from the analysis.
- Haplotype pruning: the rxstacks program will prune excess haplotypes from individual loci according to the prevelence of haplotypes in
the population for a given catalog locus.
The rxstacks program must be run after the core pipeline: ustacks/pstacks,
cstacks, and sstacks. Once rxstacks is run, cstacks, and
sstacks should again be run, to rebuild and match to the catalog (minus all the filtered loci and haplotypes and with the new SNP calls).
It an output directory is specified, new Stacks files will be written to that directory. In this way, the uncorrected and corrected data
sets can both be loaded into the web interface, or vStacks and compared. If only an input directory is specified, the original files will
be overwritten.
If you specify the --verbose option, two additional log files will be written to your output directory,
batch_X.rxstacks.haplotypes.log and batch_X.rxstacks.snps.log which detail each pruned haplotype and each modified SNP call.
Filtering by locus log likelihood
Starting in Stacks 1.20 the ustacks and pstacks programs will assign each assembled locus a
log likelihood based on the summation of all nucleotide model calls made in the locus. We can use these values to remove loci from our analysis that
consistently show poor log likelihoods. Loci that have low depth of coverage or high sequencing error will exhibit poor log likelihood scores that
are highly negative.
We can tell the rxstacks program to filter catalog loci from the analysis that have a mean log likelihood below our
threshold (this is done not by physically removing the catalog locus, but by blacklisting the constituent matching loci so that when the catalog is
rebuilt after running rxstacks the catalog locus will not reappear).
Determining the distribution of catalog loci log likelihoods
We can tell the rxstacks program to print the set of mean log likelihoods for each catalog locus by specifying the
--lnl_dist parameter which will create the batch_X.rxstacks_lnls.tsv file in the output directory.
We can bucket the log likelihood values to plot it more conviently using the shell: (all on one line)
% cat batch_1.rxstacks_lnls.tsv |
grep -v "^#" |
awk '{bucket=(int($2)); lnls[bucket] += 1} END { for (bucket in lnls) print bucket, "\t", lnls[bucket]}' |
sort -n > lnls.tsv
We can then plot this file (here is an example script for Gnuplot to generate this plot):
You can then choose an appropriate threshold that will retain enough loci for your experimental question.
Example Usage
This assumes that the main pipeline (
ustacks/
pstacks,
cstacks,
and
sstacks) has already been run and the Stacks output files are in
stacks directory and
the corrected output files will be placed in the
corr directory:
~/% rxstacks -b 1 -P ./stacks/ -o ./corr/ --conf_lim 0.25 --prune_haplo --model_type bounded --bound_high 0.1 --lnl_lim -10.0 -t 8
Any catalog locus in which 25% or more individuals have a confounded match to the catalog will be removed
(--conf_lim 0.25), excess haplotypes will be pruned (--prune_haplo), SNPs will be recalled once
sequencing errors are removed using the bounded SNP model (--model_type bounded --bound_high 0.1), and catalog loci
with an average log likelihood less than -10.0 will be removed (--lnl_lim -10.0, more negative log likelihoods are better).
Other Pipeline Programs
Raw reads
|
Core
|
Execution control
|
Utility programs
|
The process_radtags program examines raw reads from an Illumina sequencing run and
first, checks that the barcode and the restriction enzyme cutsite are intact (correcting minor errors).
Second, it slides a window down the length of the read and checks the average quality score within the window.
If the score drops below 90% probability of being correct, the read is discarded. Reads that pass quality
thresholds are demultiplexed if barcodes are supplied.
The process_shortreads program performs the same task as process_radtags
for fast cleaning of randomly sheared genomic or transcriptomic data. This program will trim reads that are below the
quality threshold instead of discarding them, making it useful for genomic assembly or other analyses.
The clone_filter program will take a set of reads and reduce them according to PCR
clones. This is done by matching raw sequence or by referencing a set of random oligos that have been included in the sequence.
The kmer_filter program allows paired or single-end reads to be filtered according to the
number or rare or abundant kmers they contain. Useful for both RAD datasets as well as randomly sheared genomic or
transcriptomic data.
The ustacks program will take as input a set of short-read sequences and align them into
exactly-matching stacks. Comparing the stacks it will form a set of loci and detect SNPs at each locus using a
maximum likelihood framework.
A catalog can be built from any set of samples processed
by the ustacks program. It will create a set of consensus loci, merging alleles together. In the case
of a genetic cross, a catalog would be constructed from the parents of the cross to create a set of
all possible alleles expected in the progeny of the cross.
Sets of stacks constructed by the ustacks
program can be searched against a catalog produced by the cstacks program. In the case of a
genetic map, stacks from the progeny would be matched against the catalog to determine which progeny
contain which parental alleles.
The tsv2bam program will transpose data so that it is oriented by locus, instead of by sample.
In additon, if paired-ends are available, the program will pull in the set of paired reads that are associate with each
single-end locus that was assembled de novo.
The gstacks - For de novo analyses, this program will pull in paired-end
reads, if available, assemble the paired-end contig and merge it with the single-end locus, align reads
to the locus, and call SNPs. For reference-aligned analyses, this program will build loci from the
single and paired-end reads that have been aligned and sorted.
This populations program will compute population-level summary statistics such
as π, FIS, and FST. It can output site level SNP calls in VCF format and
can also output SNPs for analysis in STRUCTURE or in Phylip format for phylogenetics analysis.
The denovo_map.pl program executes each of the Stacks components to create a genetic
linkage map, or to identify the alleles in a set of populations.
The ref_map.pl program takes reference-aligned input data and executes each of the Stacks
components, using the reference alignment to form stacks, and identifies alleles. It can be used in a genetic map
of a set of populations.
The load_radtags.pl program takes a set of data produced by either the denovo_map.pl or
ref_map.pl progams (or produced by hand) and loads it into the database. This allows the data to be generated on
one computer, but loaded from another. Or, for a database to be regenerated without re-executing the pipeline.
The stacks-dist-extract script will pull data distributions from the log and distribs
files produced by the Stacks component programs.
The stacks-integrate-alignments script will take loci produced by the de novo pipeline,
align them against a reference genome, and inject the alignment coordinates back into the de novo-produced data.
The stacks-private-alleles script will extract private allele data from the populations program
outputs and output useful summaries and prepare it for plotting.