Stacks: denovo

denovo_map.pl

The denovo_map.pl program will execute the Stacks pipeline by running each of the Stacks components individually. It is the simplest way to run Stacks and it handles many of the details. The program performs several stages, including:

Running ustacks on each of the samples specified, building loci de novo in each sample.
Executing cstacks to create a catalog of all loci across the population (or from just the parents if processing a genetic map). Loci are matched up across samples according to sequence similarity.
Next, sstacks will be executed to match each sample against the catalog. In the case of a genetic map, the parents and progeny are matched against the catalog.
The tsv2bam program will be executed in order to transpose the data from being organized by sample, to being organized by RAD locus. This allows the downstream pipeline components to look at all data from across the meta-population for each locus. If paired-end reads are available, they will be fetched by tsv2bam and stored with the single-end reads for later use.
The gstacks program is executed next. If paired-end reads are available, a contig will be assembled from them and overlapped with the single-end locus. SNPs will be called, and for each locus, the SNPs will be phased into haplotypes.
The populations program will be run to generate population-level summary statistics and export data in a variety of formats. Computation is now complete.

Specifying Samples

The raw data for each sample in the analysis has to be specified to Stacks. The denovo_map.pl program expects (but does not require) that your raw sequencing data were demultiplexed and cleaned by the process_radtags program.

All of your samples have to be sepcified together for a single run of the pipeline. This is done by specifying your list of samples to denovo_map.pl by using a population map (--popmap) as well as specifying the path to the directory containing all samples using the --samples option. denovo_map.pl will read the contents of the population map file and search for each specified sample in the --samples directory.

Using a population map

A population map contains assignments of each of your samples to a particular population. See the manual for more information on how they work. The denovo_map.pl program will not directly use the file except to read the sample names for processing. It is the populations program that acutally uses the population map for statistical calculations, and denovo_map.pl will provide the map to the populations program. You can run the populations program by hand, specifying other population maps as you like, after the pipeline completes its first execution.

Passing additional arguments to Stacks component programs

The Stacks component programs contain a lot of possible options that can be invoked. It would be impractical to expose them all througth the denovo_map.pl wrapper program. Instead, you can pass additional options to internal programs that denovo_map.pl will execute using the -X. To use this option, you specify (in quotes) the program the option goes to, followed by a colon (":"), followed by the option itself. For example, -X "populations:--fstats" will pass the --fstats option to the populations program. Each option should be specified separately with -X. See below for examples.

When not to use denovo_map.pl

There are a few reasons to run the pipeline manually instead of using the denovo_map.pl wrapper.

If you have a very large number of samples, you may not want to put them all in the catalog. In a population analysis, all of the samples specified to denovo_map.pl will be loaded into the catalog. In a de novo analysis, each sample added to the catalog will also add a small fraction of error to the catalog. With a very large number of samples, the error can overwhelm true loci in the population. In this case you may only want to load a subset of each population in your analysis.
Again, if you have a lot of samples, you may want to speed your analysis by splitting up your samples and running them on a number of nodes in a cluster. In this case, you would have to queue up ustacks to run on different nodes with different samples. This can't be done using denovo_map.pl.

Post-processing with populations

Stacks is designed so that you run the core pipeline once (or several times to optimize parameters, then complete the final run), but once it is complete, the core information (assembled loci, genotyped and phased into haplotypes across the meta-population) is complete and does not need to be further modified. It is common to then run the populations program multiple times. populations will read the core pipeline outputs and can then be used to filter the data in many ways, to export the data in different formats, or to apply different population maps so the population genetics statistics are calculated according to those different maps (e.g. by geography, phenotype, or environmnetal vairable).

Program Options

denovo_map.pl --samples dir --popmap path -o dir [--paired [--rm-pcr-duplicates]] (assembly options) (filtering options) [-X prog:"opts" ...]

Input/Output files:

--samples [path] — specify a path to the directory of samples (samples will be read from population map).
--popmap [path] — path to a population map file (format is "[name] TAB [pop]", one sample per line).
-o [path] — path to write pipeline output files.

General options:

-X — additional options for specific pipeline components, e.g. -X "populations: --min-maf 0.05".
-T, --threads — the number of threads/CPUs to use (default: 1).
--dry-run — dry run. Do not actually execute anything, just print the commands that would be executed.
--resume — resume executing the pipeline from a previous run.

Stack assembly options:

-M — number of mismatches allowed between stacks within individuals (for ustacks).
-n — number of mismatches allowed between stacks between individuals (for cstacks; default: set to ustacks -M).

SNP model options:

--var-alpha — significance level at which to call variant sites (for gstacks; default: 0.05).
--gt-alpha — significance level at which to call genotypes (for gstacks; default: 0.05).

Paired-end options:

--paired — after assembling RAD loci, assemble contigs for each locus from paired-end reads.
--rm-pcr-duplicates — remove all but one set of read pairs of the same sample that have the same insert length.

Population filtering options:

-p,--min-populations — minimum number of populations a locus must be present in to process a locus (for populations; default: 1)
-r,--min-samples-per-pop — minimum percentage of individuals in a population required to process a locus for that population (for populations; default: 0)

For large datasets:

--catalog-popmap — path to a second population map file containing a subset of samples for use only in building the catalog.

Miscellaneous:

--time-components — (for benchmarking)

Example Usage

Single-end Reads

In this example, we will process single-end reads. We will supply a population map to denovo_map.pl containing the names of the samples we want to analyze and we will tell denovo_map.pl the directory the samples are located in.
Your samples directory should look similar to this, after processing with process_radtags:

% ls samples/ sample_06.fq.gz sample_07.01.fq.gz sample_07.02.fq.gz sample_09.fq.gz ...

The population map would look like this:

% cat ./treestudy_popmap sample_07.01 redlake sample_07.02 redlake sample_06 blueriver sample_09 blueriver ...

And we supply both of these paths to denovo_map.pl, along with an output directory to store results and some parameter settings for builing loci.

% denovo_map.pl -M 5 -T 8 -o ./stacks --popmap ./treestudy_popmap --samples ./samples
Building on the previous example, we will specifically tell the populations program to output only one SNP per RAD locus.
% denovo_map.pl -M 5 -o ./stacks --popmap ./treestudy_popmap --samples ./samples -X "populations:--write-single-snp"
In this example, perhaps we are optimizing our de novo assembly paramters, following the manual. When doing so, we usually want to examine r80 loci. To filter to those loci we can specify the --min-samples-per-pop flag (which denovo_map.pl will pass onto the populations program). We may run several iterations of denovo_map.pl, with different parameter choices, going to different output directories:
% denovo_map.pl -M 4 -o ./stacks/M4 --popmap ./treestudy_popmap --samples ./samples --min-samples-per-pop 0.80 % denovo_map.pl -M 5 -o ./stacks/M5 --popmap ./treestudy_popmap --samples ./samples --min-samples-per-pop 0.80 % denovo_map.pl -M 6 -o ./stacks/M6 --popmap ./treestudy_popmap --samples ./samples --min-samples-per-pop 0.80 % denovo_map.pl -M 7 -o ./stacks/M7 --popmap ./treestudy_popmap --samples ./samples --min-samples-per-pop 0.80 % denovo_map.pl -M 8 -o ./stacks/M8 --popmap ./treestudy_popmap --samples ./samples --min-samples-per-pop 0.80

Paired-end Reads

In this example, we will process paired-end reads. We will supply a population map to denovo_map.pl containing the common prefix prefix of the names of the samples. The denovo_map.pl program expects for a single sample that the file containing the single-end reads will end in a .1. and the corresponding paired-end file will end in a .2. (which is how process_radtags will output them). We we will tell denovo_map.pl the directory these sample files are located in.
Your samples directory should look similar to this, after processing with process_radtags:

% ls samples/ sample_06.1.fq.gz sample_07.01.1.fq.gz sample_07.02.1.fq.gz sample_09.1.fq.gz sample_06.2.fq.gz sample_07.01.2.fq.gz sample_07.02.2.fq.gz sample_09.2.fq.gz ...

The population map would look like this:

% cat ./treestudy_popmap sample_07.01 redlake sample_07.02 redlake sample_06 blueriver sample_09 blueriver ...

And we supply both of these paths to denovo_map.pl, along with an output directory to store results and some parameter settings for builing loci.

% denovo_map.pl -M 5 -T 8 -o ./stacks --popmap ./treestudy_popmap --samples ./samples --paired

The denovo_map.pl program will handle specifying the single-ends to the core pipeline (ustacks→cstacks→sstacks) and the paired-ends to tsv2bam.
Building on the previous example, it is good practice to filter paired-end data for PCR duplicates (it must be single-digest, sheared data). We can do that by passing the the --rm-pcr-duplicates flag to the denovo_map.pl (which will pass it on to gstacks).
% denovo_map.pl -M 7 -T 8 -o ./stacks --popmap ./treestudy_popmap --samples ./samples --rm-pcr-duplicates --paired

Program Outputs

The main output of denovo_map.pl is the log file, denovo_map.log (and, of course, all of the individual pipeline components will create their own output). The denovo_map.log file will capture all of the outputs from the component programs, so each sample run in ustacks, the cstacks, sstacks, tsv2bam, gstacks, and populations.

The denovo_map.log log file will also provide a table listing the depth of sequencing coverage for each sample processed (as calculated by ustacks) for the single-end reads. It will look similar to this:

sample loci assembled depth of cov max cov number reads incorporated % reads incorporated sample_07.01 41228 19.77 291 303663 81.6 sample_07.02 39101 18.62 212 249467 74.0 sample_06 35506 17.87 231 199709 86.0 sample_09 48445 12.24 295 233270 83.6 ...

A convenient way to extract this information from the large log file is to use the stacks-dist-extract utility, like this: