Stacks

ref_map.pl

The ref_map.pl program will execute the Stacks pipeline by running each of the Stacks components individually. It is the simplest way to run Stacks and it handles many of the details, such as sample numbering. The ref_map.pl program expects data to have been aligned to a reference genome, and can accept data from any aligner that can produce SAM or BAM formated files. The program performs several stages, including:

  1. Running gstacks on the samples specified, assembling loci according to the alignment positions provided for each read, and calling SNPs in each sample.
  2. The populations program will be run to generate population-level summary statistics. The population map (--popmap option) you specify will be supplied to populations.

Specifying Samples

The raw data for each sample in the analysis has to be specified to Stacks. All of your samples have to be sepcified together for a single run of the pipeline. Samples are specified to ref_map.pl by supplying a population map (--popmap and the path to the directory containing all samples (--samples option). ref_map.pl will read the contents of the population map file and search for each specified sample in the --samples directory.

Using a population map

A population map contains assignments of each of your samples to a particular population. See the manual for more information on how they work. The ref_map.pl program will not directly use the file, beyond reading the sample names out of it. It is the populations program that acutally uses the population map for statistical calculations, and ref_map.pl will provide the map to the populations program. You can run the populations program by hand, specifying other population maps as you like, after the pipeline completes its first execution.

Passing additional arguments to Stacks component programs

The Stacks component programs contain a lot of possible options that can be invoked. It would be impractical to expose them all througth the ref_map.pl wrapper program. Instead, you can pass additional options to internal programs that ref_map.pl will execute using the -X. To use this option, you specify (in quotes) the program the option goes to, followed by a colon (":"), followed by the option itself. For example, -X "populations:--fstats" will pass the --fstats option to the populations program. Another example, -X "populations:-r 0.8" will pass the -r option, with the argument 0.8, to the populations program. Each option should be specified separately with -X. See below for examples.

Some notes on alignments

The ref_map.pl program takes as input aligned reads. It does not provide the assembly parameters that denovo_map.pl does and this is because the job of assembling the loci is being taken over by your aligner program (e.g. BWA or GSnap). You must take care that you have good alignmnets — discarding reads with multiple alignments, making sure that you do not allow too many gaps in your sequences (otherwise loci with repeat elements can easily be collapsed during alignments), and take care not to allow soft-masking in the alignments. This occurs when an aligner can not make a full alignment and instead soft-masks the portion of the read that could not be aligned (pretending that this part of the read does not exist). These factors, if not cared for, can cause spurious SNP calls and problems in the downstream analysis.

When not to use ref_map.pl

There are a few reasons to run the pipeline manually instead of using the ref_map.pl wrapper.

  1. If you have a very large number of samples, you may not want to put them all in the catalog. In a population analysis, all of the samples specified to ref_map.pl will be loaded into the catalog. In a de novo analysis, each sample added to the catalog will also add a small fraction of error to the catalog. With a very large number of samples, the error can overwhelm true loci in the population. In this case you may only want to load a subset of each population in your analysis.
  2. Again, if you have a lot of samples, you may want to speed your analysis by splitting up your samples and running them on a number of nodes in a cluster. In this case, you would have to queue up pstacks to run on different nodes with different samples. This can't be done using ref_map.pl.

Program Options

ref_map.pl --samples dir --popmap path [-s spacer] [--paired] -o dir [-X prog:"opts" ...]

Input/Output files:

General options:

SNP model options:

  • --var-alpha — significance level at which to call variant sites (for gstacks; default: 0.05).
  • --gt-alpha — significance level at which to call genotypes (for gstacks; default: 0.05).

Miscellaneous:

  • --time-components — (for benchmarking)

Example Usage

  1. In this example, I will supply a population map to ref_map.pl containing the names of the samples I want to analyze, and I will tell ref_map.pl the directory the samples are located in.

    % ref_map.pl -T 15 -o ./stacks --popmap ./treestudy_popmap --samples ./samples

  2. In this example, I will instruct ref_map.pl to tell populations to enable F statistics.

    % ref_map.pl -o ./stacks --popmap ./treestudy_popmap --samples ./samples -S -X "populations:--fstats"

Other Pipeline Programs

Raw Reads

Core

Execution control