Stacks: rxstacks

rxstacks

The rxstacks program makes corrections to genotype and haplotype calls in individual samples based on data accumulated from a population-wide examination.

The rxstacks program applies four types of corrections to a Stacks analysis:

SNP model corrections: rxstacks will reevaluate model calls that were not significant in each individual. rxstacks will use the catalog to look at the alleles for a particular nucleotide position in the population and filter out putative sequencing errors from the individual under examination. It will then recall the nucleotide position as a heterozygote or homozygote using the bounded SNP model (which bounds the known error rate).
Log likelihood filtering: starting in Stacks 1.20, when a stack is assembled a log likelihood will be assigned to that stack by either the ustacks or pstacks programs. A log likelihood limit can be set in the rxstacks program to remove loci that have a log likelihood below the threshold. Loci with poor log likelihoods tend to have low coverage, high sequencing erorr, or both, and do not add to a Stacks analysis. Good log likelihoods are close to zero, bad log likelihoods are highliy negative, so the program filters lnls below the value specified.
Confounded locus filter: for each catalog in the locus, we expect each individual in the analysis to have a single, matching locus. When multiple loci from an individual match a single catalog locus, we refer to the locus as confounded and remove it from the analysis. These loci are typically composed of repetitive sequence. Given a catalog locus with a certain percentage of individuals are confounded, the rxstacks program will remove those loci from the analysis.
Haplotype pruning: the rxstacks program will prune excess haplotypes from individual loci according to the prevelence of haplotypes in the population for a given catalog locus.

The rxstacks program must be run after the core pipeline: ustacks/pstacks, cstacks, and sstacks. Once rxstacks is run, cstacks, and sstacks should again be run, to rebuild and match to the catalog (minus all the filtered loci and haplotypes and with the new SNP calls).

It an output directory is specified, new Stacks files will be written to that directory. In this way, the uncorrected and corrected data sets can both be loaded into the web interface, or vStacks and compared. If only an input directory is specified, the original files will be overwritten.

If you specify the --verbose option, two additional log files will be written to your output directory, batch_X.rxstacks.haplotypes.log and batch_X.rxstacks.snps.log which detail each pruned haplotype and each modified SNP call.

Program Options

rxstacks -b batch_id -P path [-o path] [-t threads] [-v] [-h]

b — Batch ID to examine when exporting from the catalog.
P — path to the Stacks output files.
o — output path to write results.
t — number of threads to run in parallel sections of code.
v — print program version.
h — display this help messsage.

Filtering options:

--lnl_filter — filter catalog loci based on the mean log likelihood of the catalog locus in the population.
--lnl_lim [limit] — minimum log likelihood required to keep a catalog locus.
--lnl_dist — print distribution of mean log likelihoods for catalog loci.
--conf_filter — filter confounded loci.
--conf_lim — between 0.0 and 1.0 (default 0.75), proportion of loci in population that must be confounded relative to the catalog locus.
--prune_haplo — prune out non-biological haplotypes unlikely to occur in the population.
--max_haplo — only consider haplotypes for pruning if they occur in fewer than max_haplo_cnt samples.

Model options:

--model_type [type] — either 'snp' (default), 'bounded', or 'fixed'

For the SNP or Bounded SNP model:

--alpha [num] — chi square significance level required to call a heterozygote or homozygote, either 0.1 (default), 0.05, 0.01, or 0.001.

For the Bounded SNP model:

--bound_low [num] — lower bound for epsilon, the error rate, between 0 and 1.0 (default 0).
--bound_high [num] — upper bound for epsilon, the error rate, between 0 and 1.0 (default 1).

Logging Options:

--verbose — extended logging, including coordinates of all changed nucleotides (forces single-threaded execution).

Filtering by locus log likelihood

Starting in Stacks 1.20 the ustacks and pstacks programs will assign each assembled locus a log likelihood based on the summation of all nucleotide model calls made in the locus. We can use these values to remove loci from our analysis that consistently show poor log likelihoods. Loci that have low depth of coverage or high sequencing error will exhibit poor log likelihood scores that are highly negative.

We can tell the rxstacks program to filter catalog loci from the analysis that have a mean log likelihood below our threshold (this is done not by physically removing the catalog locus, but by blacklisting the constituent matching loci so that when the catalog is rebuilt after running rxstacks the catalog locus will not reappear).

Determining the distribution of catalog loci log likelihoods

We can tell the rxstacks program to print the set of mean log likelihoods for each catalog locus by specifying the --lnl_dist parameter which will create the batch_X.rxstacks_lnls.tsv file in the output directory.

We can bucket the log likelihood values to plot it more conviently using the shell: (all on one line)

% cat batch_1.rxstacks_lnls.tsv | grep -v "^#" | awk '{bucket=(int($2)); lnls[bucket] += 1} END { for (bucket in lnls) print bucket, "\t", lnls[bucket]}' | sort -n > lnls.tsv

We can then plot this file (here is an example script for Gnuplot to generate this plot):

You can then choose an appropriate threshold that will retain enough loci for your experimental question.

Example Usage

This assumes that the main pipeline (ustacks/pstacks, cstacks, and sstacks) has already been run and the Stacks output files are in stacks directory and the corrected output files will be placed in the corr directory:

~/% rxstacks -b 1 -P ./stacks/ -o ./corr/ --conf_lim 0.25 --prune_haplo --model_type bounded --bound_high 0.1 --lnl_lim -10.0 -t 8

Any catalog locus in which 25% or more individuals have a confounded match to the catalog will be removed (--conf_lim 0.25), excess haplotypes will be pruned (--prune_haplo), SNPs will be recalled once sequencing errors are removed using the bounded SNP model (--model_type bounded --bound_high 0.1), and catalog loci with an average log likelihood less than -10.0 will be removed (--lnl_lim -10.0, more negative log likelihoods are better).

rxstacks

Program Options

Filtering options:

Model options:

For the SNP or Bounded SNP model:

For the Bounded SNP model:

Logging Options:

Filtering by locus log likelihood

Determining the distribution of catalog loci log likelihoods

Example Usage

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs