Please cite one or both of these papers:
J. Catchen, P. Hohenlohe, S. Bassham, A. Amores, and W. Cresko. Stacks: an analysis tool set for population genomics. Molecular Ecology. 2013. [reprint]
J. Catchen, A. Amores, P. Hohenlohe, W. Cresko, and J. Postlethwait. Stacks: building and genotyping loci de novo from short-read sequences. G3: Genes, Genomes, Genetics, 1:171-182, 2011. [reprint]
Stacks is written in C++ and can be built in any UNIX-like environment. To compile on Linux, Stacks requires GCC version 4.9 or higher. The software will also build on Apple’s OS X, which instead uses the CLANG compiler. Stacks relies on the OpenMP libraries to run in multi-threaded mode. OpenMP is built into most compilers, including GCC, but is not yet supported by CLANG, so Stacks will run in single-threaded mode on OS X. (There are several methods to install GCC onto OS X, in which case you could use it to get a multi-threaded Stacks build.)
Stacks provides an optional web-based user interface, which can be very useful to visualize genotypes across a set of populations, or in a genetic cross. Stacks relies on a standard set of open-source programs to provide that interface (commonly called a LAMP stack), including the Apache web server, the PHP scripting language, and the MySQL database. To get data from your local workstation into web interface, Stack’s Perl scripts use the DBD::mysql module, and similarly, if you want to export data from the web as an Excel spreadsheet the Spreadsheet::WriteExcel module is required.
Stacks is meant to track a set of loci, and the alleles present at those loci, in a population of organisms. In the sense of small genetic distance, population could be interpreted as a set of parents and the resulting progeny of a genetic cross for a linkage mapping analysis. In the medium sense, population could be a set of related populations from a single species employing a population genetic analysis, and in the large sense, population could be a set of closely related species in a phylogenetic analysis.
The genome of each individual in the population is sampled (e.g. using a restriction enzyme, such as done in RADseq) and then sequenced (typically using the Illumina platform, although the Ion Torrent platform is also used). The sampled loci from each individual are then reassembled from the sequenced reads (creating Stacks!) either de novo (using ustacks), or by first aligning them to a reference genome and then assembling them using pstacks.
Next, we need to synthesize the loci assembled independently in each individual into a single view of all the loci and alleles present in the population. There is a trade off when constructing this catalog of loci, each individual we add to the catalog will bring a small amount of error with it, so we want to minimize the amount of error while capturing all the true loci segregating in the population. Having a reference genome can simplfy this problem as it acts like a filter, excluding most (but not all) of the error (and some of the true loci as well).
In a genetic cross, constructing the catalog is easily done by matching together the loci from the two parents of the cross, since all the progeny will contain the same loci as the parents. In a set of related populations from the same species, if we have a moderate number of individuals, we will add all of them to the catalog. However, if we have hundreds or thousands of individuals, we may choose to only add a fraction of indiviudals from each sub-population -- just enough to include (almost) all of the loci segregating in the populations. In a phylogenetic analysis, where we have a small number of individuals from a number of different species, we will likely have to add all individuals to the catalog to maximize the number of shared loci found across the species. This process is implemented by the cstacks program.
Finally, all individuals in the experiment are matched against the catalog (using sstacks), resulting in a map of all the alleles present in the set of loci sampled from across the population. Once the above process is completed, we can interpret the loci in a number of different ways, without recalculating the above steps. So, the process is to run the components of the main pipeline, and then once complete, we can repeatedly interpret them. Using photography as a metaphor, first we develop the negative, then we can print the photo from the negative using a number of different tools and filters.
For a genetic map, we interpret the loci as a set of markers, and export those markers that are mappable, given a particualr cross design (different types of markers are mappable in an F0 cross versus an F2, for example). This is done with the genotypes program.
Given a set of related populations, or closely related species, we use the populations program to interpret the sampled loci. We can choose to filter the loci using a number of criteria, such as how frequently they occur in the population, or based on properties such as heterozygosity. We can choose to group the individuals in different ways, say by sampling location or phenotype, or exclude particular individuals. Each grouping is defined by a population map and we can feed these different population maps to the populations program to change how the set of individuals is being interpreted, and then calculate summary statistics (such as FST or π) based on each of the different groupings. This is very powerful because we don't have to change the set of loci the main pipeline defined, we just repeatedly run populations using these tools to identify and hone the biological signal in our data. For each different frame we define we can export sets of loci for further analysis, in programs like STRUCTURE, GenePop, Adegenet, or RaXML.
In the simplest case, we can use the wrapper programs, denovo_map.pl and ref_map.pl to run the entire above process. Or, we can run all the components by hand to customize exactly how the data are processed.
Yes, Stacks can handle both types of input data. The process_radtags program will demultiplex these types of data and check and correct both single and double-digested restriction enzyme cutsites. See the the process_radtags documentation for examples and the manual for additional information.
Yes! Support for gapped alignments was added in Stacks version 1.38.
Although we recommend using the latest version of Stacks, sometimes bugs in the newer versions make it necessary to revert to an older version. All the versions are still available on the website, here are a few:
Yes, Stacks uses a sliding-window quality filtering algorithm that allows for isolated low-quality base calls while discarding reads that show a degenerating quality level across the read length. The program will also demultiplex data according to a set of barcodes and check for the presence of a restriction enzyme cutsite -- for both single and double-digested data. The process_radtags program can process a number of different file formats including single and paired-end data in FASTQ format (gzipped or not), as well as in FASTA or BAM format. It can also handle inline barcodes or Illumina barcode indexes as well as combinatorial barcodes. The process_radtags program can also filter adapter seqeunce from raw reads while allowing for sequencing error in the adapter sequence. Lots of examples for processing different file formats can be found with the process_radtags documentation.
Yes! We are happy to add restriction enzymes to the process_radtags program. However, we ask that in exchange for adding it you will be willing to try it out on your data and let us know that it is working correctly. We could simply add all the restriction enzymes available, but we want to make sure the code works right and we don’t have access to all those types of data.
In addition, you can specify the --disable_rad_check flag to the process_radtags program to stop it from checking the integrity of the RAD site. Of course, this will affect the quality of the data input to Stacks.
In the de novo case, data is read by the ustacks program and it currently can read either FASTA or FASTQ formats. To save time and disk space, ustacks can also read FASTA or FASTQ files that are gzipped. When a reference genome is available, data is read by the pstacks program and either SAM or BAM formats can be input.
Each of the Stacks programs outputs tab-separated value files and can be loaded into a database or easily parsed with scripts. If you are using the database, the export_sql.pl script can be used to export a compact TSV file or Microsoft Excel spreadsheet, if configured, the spreadsheet can be exported directly through the web interface.
The populations program can output SNP locations in the VCF format. It will also generate summary statistics and compute population genetic measures such as π, heterozygosity, FIS within populations and FST between populations, and these data will be exported in tab-separated format. It can also output data formatted for the GenePop program, for Structure, as well as for standard phylogenetic programs that can read Phylip files. The populations program can also output the alleles from each locus in each individual as a set of FASTA sequences and it can output data for a number of phasing programs such as PHASE, fastPHASE, Beagle, and Plink.
See the Stacks manual.
No, however, Stacks expects data to be strenuously filtered prior to running the pipeline.
If you are running Stacks on Fedora Linux, or another distribution that makes use of SELinux, you may be having a permissions problem. On a Fedora FC14 distribution, we executed the following commands (as root) to enable permissions for this feature:
Allow the Apache server access to the network to send notification emails:
# setsebool -P httpd_can_network_connect 1
Mark both Perl scripts as executable by the Apache server:
# chcon -t httpd_sys_script_exec_t /usr/local/bin/stacks_export_notify.pl # chcon -t httpd_sys_script_exec_t /usr/local/bin/export_sql.pl
Mark the export directory as writable by the Apache web server:
# chcon -t httpd_sys_script_rw_t /usr/local/share/stacks/php/export