The ref_map.pl program will execute the Stacks pipeline by running each of the Stacks components individually. It is the simplest way to run Stacks and it handles many of the details, such as sample numbering. The ref_map.pl program expects data to have been aligned to a reference genome, and can accept data from any aligner that can produce BAM formated files. The program performs several stages, including:
The raw data for each sample in the analysis has to be specified to Stacks. All of your samples have to be specified together for a single run of the pipeline. Samples are specified to ref_map.pl by supplying a population map (--popmap) and the path to the directory containing all samples (--samples option). ref_map.pl will read the contents of the population map file and search for each specified sample in the --samples directory.
For ref_map.pl, samples must be aligned to a reference genome and the resulting BAM files must be sorted. To process paired-end reads, in addition to single-end reads, make sure the aligner is given both single- and paired-end reads when you align the reads to the reference genome. The resulting BAM file will then contain both sets of reads and gstacks will detect that and continue with a paired-end analysis.
A population map contains assignments of each of your samples to a particular population. See the manual for more information on how they work. The ref_map.pl program will not directly use the file, beyond reading the sample names out of it. It is the populations program that acutally uses the population map for statistical calculations, and ref_map.pl will provide the map to the populations program. You can run the populations program by hand, specifying other population maps as you like, after the pipeline completes its first execution.
The Stacks component programs contain a lot of possible options that can be invoked. It would be impractical to expose them all througth the ref_map.pl wrapper program. Instead, you can pass additional options to internal programs that ref_map.pl will execute using the -X. To use this option, you specify (in quotes) the program the option goes to, followed by a colon (":"), followed by the option itself. For example, -X "populations:--fstats" will pass the --fstats option to the populations program. Another example, -X "populations:-r 0.8" will pass the -r option, with the argument 0.8, to the populations program. Each option should be specified separately with -X.
The ref_map.pl program takes as input aligned reads. It does not provide the assembly parameters that denovo_map.pl does and this is because the job of assembling the loci is being taken over by your aligner program (e.g. BWA or Bowtie2). You must take care that you have good alignmnets — discarding reads with multiple alignments, making sure that you do not allow too many gaps in your sequences (otherwise loci with repeat elements can easily be collapsed during alignments), and take care not to allow soft-masking in the alignments. This occurs when an aligner can not make a full alignment and instead soft-masks the portion of the read that could not be aligned (pretending that this part of the read does not exist). These factors, if not cared for, can cause spurious SNP calls and problems in the downstream analysis.
Stacks is designed so that you run the core pipeline once, and the core information (assembled loci, genotyped and phased into haplotypes across the meta-population) is complete and does not need to be further modified. It is common to then run the populations program multiple times. populations will read the core pipeline outputs and can then be used to filter the data in many ways, to export the data in different formats, or to apply different population maps so the population genetics statistics are calculated according to those different maps (e.g. by geography, phenotype, or environmnetal vairable).
ref_map.pl --samples dir --popmap path [-s spacer] --out-path path [--rm-pcr-duplicates] [-X prog:"opts" ...]
Your samples directory should look similar to this, after cleaning/demultiplexing with process_radtags and then aligning and sorting them with, e.g. BWA-MEM and samtools:
% ls aligned/ blue_1827.19.bam blue_1827.25.bam blue_1827.05.bam red_2827.01.bam red_2827.07.bam red_2827.13.bam ...
The population map would look like this:
% cat ./fishes_popmap red_2827.01 redlake red_2827.07 redlake blue_1827.19 blueriver blue_1827.25 blueriver ...
And we supply both of these paths to ref_map.pl:
% ref_map.pl -T 8 -o ./stacks --popmap ./fishes_popmap --samples ./aligned
% ref_map.pl -o ./stacks --popmap ./fishes_popmap --samples ./aligned -X "populations:--fstats"
The main output of ref_map.pl is the log file, ref_map.log (and, of course, all of the individual pipeline components will create their own output). The ref_map.log file will capture all of the outputs from the component programs, including gstacks and populations.
Two key pieces of information from this file are output from gstacks and include how many of the alignments could be incorporated by Stacks, which looks similar to this:
Read 29454283 BAM records: kept 26626342 primary alignments (92.5%), of which 12643825 reverse reads skipped 273219 primary alignments with insufficient mapping qualities (0.9%) skipped 1049409 excessively soft-clipped primary alignments (3.6%) skipped 840054 unmapped reads (2.9%) skipped some suboptimal (secondary/supplementary) alignment records
The mean coverage per-sample can also be found in this file, however per-sample coverages are found in the gstacks.log.distribs file. The mean coverage looks like this:
Genotyped 99505 loci: effective per-sample coverage: mean=15.1x, stdev=1.3x, min=12.9x, max=17.1x
Raw reads |
Core |
Execution control |
Utility programs |