ref_map.pl
The ref_map.pl program will execute the Stacks pipeline by running each of the Stacks components individually. It is the
simplest way to run Stacks and it handles many of the details, such as sample numbering and loading data to the MySQL database, if desired. The
ref_map.pl program expects data to have been aligned to a reference genome, and can accept data from any aligner that
can produce SAM or BAM formated files. The program performs several stages, including:
- Running pstacks on each of the samples specified, assembling loci
according to the alignment positions provided for each read, and calling SNPs in each sample.
- Executing cstacks to create a catalog of all loci across the
population (or from just the parents if processing a genetic map). Loci from different samples are matched up across the data set according
to alignment position.
- Next, sstacks will be executed to match each sample against the
catalog. In the case of a genetic map, the parents and progeny are matched against the catalog.
- In the case of a population analysis, the populations program will
be run to generate population-level summary statistics. If you specified a population map (-O option) it will be
supplied to populations. If you are analyzing a genetic map, the
genotypes program will be executed to generate a set of markers
and a set of initial genotypes for export to a linkage mapping program.
- Computation is now complete. If database interaction is enabled, ref_map.pl will upload the results of each stage
of the analysis: individual loci, the catalog, matches against the catalog, and genotypes or sumamry statistics into the database.
- Lastly, if database interaction is enabled, index_radtags.pl
will be run to build a database index to speed up access to the database and enable web-based filtering.
Specifying Samples
The raw data for each sample in the analysis has to be specified to Stacks. All of your samples have to be
sepcified together for a single run of the pipeline. There are two ways to do this in
ref_map.pl. First, each sample can be specified separately on the command
line. If processing a genetic map, each parent and/or progeny can be specified using -p
and -r, respectively. If analyzing a population, each sample can be specified using
-s (see examples below).
The second way to specify samples is to use a population map. In this case, you specify the path to the directory
containing all samples using the --samples option, and then ref_map.pl
will read the contents of the population map file (specified with the -O option) and search
for each specified sample in the --samples directory.
Using a population map
A population map is an optional file that contains assignments of each of your samples to a particular population. See the
manual for more information on how they work. The ref_map.pl
program will not directly use the file (beyond potentially reading the sample names out of it), however it will upload the
population mappings to the database for use in the web interface, if enabled. It is the populations
program that acutally uses the population map for statistical calculations, and ref_map.pl will provide
the map to the populations program. You can run the populations program by
hand, specifying other population maps as you like, after the pipeline completes its first execution.
Using the Database (and Web Interface)
If you have the database and web interface installed (MySQL and the Apache Webserver) then ref_map.pl
can upload the output from the pipeline to the database for viewing in a web browser. To do so:
- Pick a batch ID, a number to represent the batch of data processed by Stacks. Specify
it to the pipeline with the -b option.
- Choose a database name and specify it to ref_map.pl using the
-B option. For the database to be displayed by the web interface
it needs to have a "_radtags" suffix, e.g. fishstudy_radtags
(this allows the web interface to ignore other databases that may be present in your server
unrelated to Stacks).
- You can specify a description of this batch of data by specifying it with the
-D option. The description will be displayed in the web interface,
it can help you differentiate between different runs of the pipeline, with for example, different
parameter choices.
- If the database already exists, you are ready to start ref_map.pl. If
the database does not yet exist, you can specify the --create_db option
to create it. If it already exists, but you want to delete and recreate it, specify the
--overw_db option.
Different batches can be loaded into the database by running ref_map.pl repeatedly with different
batch IDs and specifying the same database. This is convenient to collect a set of pipeline runs, with different parameter
settings, in one place for comparison.
If you do not wish to use the database, specify the -S option to
ref_map.pl and all database interactions will be disabled. If you
want to load data into the database after a ref_map.pl execution,
you can do so using the load_radtags.pl
program.
Passing additional arguments to Stacks component programs
The Stacks component programs contain a lot of possible options that can be invoked. It would be impractical to expose them all
througth the ref_map.pl wrapper program. Instead, you can pass additional options to internal
programs that ref_map.pl will execute using the -X. To use this option,
you specify (in quotes) the program the option goes to, followed by a colon (":"), followed by the option itself. For example,
-X "populations:--fstats" will pass the --fstats option to the
populations program. Another example, -X "populations:-r 0.8" will pass the
-r option, with the argument 0.8, to the populations
program. Each option should be specified separately with -X. See
below for examples.
Some notes on alignments
The ref_map.pl program takes as input aligned reads. It does not provide the assembly parameters that
denovo_map.pl does and this is because the job of assembling the loci is being taken over by your
aligner program (e.g. BWA or GSnap). You must take care that you have good alignmnets -- discarding reads with multiple alignments,
making sure that you do not allow too many gaps in your sequences (otherwise loci with repeat elements can easily be collapsed
during alignments), and take care not to allow soft-masking in the alignments. This occurs when an aligner can not make
a full alignment and instead soft-masks the portion of the read that could not be aligned (pretending that this part of
the read does not exist). These factors, if not cared for, can cause spurious SNP calls and problems in the downstream analysis.
When not to use ref_map.pl
There are a few reasons to run the pipeline manually instead of using the ref_map.pl wrapper.
- If you have a very large number of samples, you may not want to put them all in the catalog. In a population analysis,
all of the samples specified to ref_map.pl will be loaded into the catalog. In a de novo
analysis, each sample added to the catalog will also add a small fraction of error to the catalog. With a very large
number of samples, the error can overwhelm true loci in the population. In this case you may only want to load a subset
of each population in your analysis.
- Again, if you have a lot of samples, you may want to speed your analysis by splitting up your samples and running them
on a number of nodes in a cluster. In this case, you would have to queue up pstacks to run on
different nodes with different samples. This can't be done using ref_map.pl.
Other Pipeline Programs
Raw Reads
|
Core
|
Execution control
|
Utilities
|
The process_radtags program examines raw reads from an Illumina sequencing run and
first, checks that the barcode and the restriction enzyme cutsite are intact (correcting minor errors).
Second, it slides a window down the length of the read and checks the average quality score within the window.
If the score drops below 90% probability of being correct, the read is discarded. Reads that pass quality
thresholds are demultiplexed if barcodes are supplied.
The process_shortreads program performs the same task as process_radtags
for fast cleaning of randomly sheared genomic or transcriptomic data. This program will trim reads that are below the
quality threshold instead of discarding them, making it useful for genomic assembly or other analyses.
The clone_filter program will take a set of reads and reduce them according to PCR
clones. This is done by matching raw sequence or by referencing a set of random oligos that have been included in the sequence.
The kmer_filter program allows paired or single-end reads to be filtered according to the
number or rare or abundant kmers they contain. Useful for both RAD datasets as well as randomly sheared genomic or
transcriptomic data.
The ustacks program will take as input a set of short-read sequences and align them into
exactly-matching stacks. Comparing the stacks it will form a set of loci and detect SNPs at each locus using a
maximum likelihood framework.
The pstacks program will extract stacks that have been aligned to a reference genome by a
program such as Bowtie or GSnap and identify SNPs at each locus using a maximum likelihood framework.
A catalog can be built from any set of samples processed
by the ustacks program. It will create a set of consensus loci, merging alleles together. In the case
of a genetic cross, a catalog would be constructed from the parents of the cross to create a set of
all possible alleles expected in the progeny of the cross.
Sets of stacks constructed by the ustacks
or pstacks programs can be searched against a catalog produced by the cstacks program. In the case of a
genetic map, stacks from the progeny would be matched against the catalog to determine which progeny
contain which parental alleles.
This populations program will compute population-level summary statistics such
as π, FIS, and FST. It can output site level SNP calls in VCF format and
can also output SNPs for analysis in STRUCTURE or in Phylip format for phylogenetics analysis.
The genotypes program exports Stacks data as a set of observed haplotypes, or with
the haplotypes encoded into genotypes, in either a generic format or for a particular linkage mapper such as
JoinMap, OneMap, or R/QTL. It also provides methods for making automated corrections to certain types of loci.
The rxstacks program makes corrections to individual genotypes and haplotypes based
on data from a population of samples.
The denovo_map.pl program executes each of the Stacks components to create a genetic
linkage map, or to identify the alleles in a set of populations. It also handles uploading data to the database.
The ref_map.pl program takes reference-aligned input data and executes each of the Stacks
components, using the reference alignment to form stacks, and identifies alleles. It can be used in a genetic map
of a set of populations. It also handles uploading data to the database.
The load_radtags.pl program takes a set of data produced by either the denovo_map.pl or
ref_map.pl progams (or produced by hand) and loads it into the database. This allows the data to be generated on
one computer, but loaded from another. Or, for a database to be regenerated without re-executing the pipeline.
The index_radtags.pl program indexes the database to speed execution in the web interface
and enable web-based filtering after all of the data has been loaded. It will be run by
denovo_map.pl or ref_map.pl at the end of execution.
The export_sql.pl program provides for the export of all of the Stacks data in a compact form.
For each locus, the script reports the consensus sequence, the number of parents and progeny that Stacks could find the
locus in, the number of SNPs found at that locus and a listing of the individual SNPs and the observed alleles, followed
by the particular allele observed in each individual. The script allows you to specify a number of filters for the data,
the same filters as available through the web interface.
The sort_read_pairs.pl program collates paired-end sequences for each catalog locus, creating
a FASTA file of paired-reads for each catalog locus. These FASTA files can then be assembled to make paired-end contigs.
The exec_velvet.pl program will execute Velvet on the collated set of reads generated by
sort_read_pairs.pl.