The populations program will analyze a population of individual samples computing a number of population genetics statistics as
well as exporting a variety of standard output formats. A map specifying which individuals belong to which population is submitted to the program and the
program will then calculate population genetics statistics such as expected/observed heterozygosity, π, and FIS at each nucleotide position.
The populations program will compare all populations pairwise to compute FST. If a set of data is reference aligned,
then a kernel-smoothed FST will also be calculated. The populations program can also compute a number of
haplotype-based population genetics statistics including haplotype diversity, ΦST, and FST’.
populations -P dir -b batch_id [-O dir] [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats)
populations -V vcf -O dir [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats)
- -P,--in_path — path to the directory containing the Stacks files.
- -b,--batch_id — Batch ID to examine when exporting from the catalog (required by -P).
- -V,--in_vcf — path to an input VCF file.
- -O,--out_path — path to a directory where to white the output files. (Required by -V; otherwise defaults to value of -P.)
- -M,--popmap — path to a population map. (Format is 'SAMPLE1POP1\n...'.)
- -t,--threads — number of threads to run in parallel sections of code.
- -s,--sql_out — output a file to import results into an SQL database.
- -p [int] — minimum number of populations a locus must be present in to process a locus.
- -r [float] — minimum percentage of individuals in a population required to process a locus for that population.
- --min_maf [float] — specify a minimum minor allele frequency required to process a nucleotide site at a locus (0 < min_maf < 0.5).
- --max_obs_het [float] — specify a maximum observed heterozygosity required to process a nucleotide site at a locus.
- -m [int] — specify a minimum stack depth required for individuals at a locus.
- --lnl_lim [float] — filter loci with log likelihood values below this threshold.
- --write_single_snp — restrict data analysis to only the first SNP per locus.
- --write_random_snp — restrict data analysis to one random SNP per locus.
- -B — path to a file containing Blacklisted markers to be excluded from the export.
- -W — path to a file containing Whitelisted markers to include in the export.
Merging and Phasing:
- -e,--renz — restriction enzyme name.
- --merge_sites — merge loci that were produced from the same restriction enzyme cutsite (requires reference-aligned data).
- --merge_prune_lim — when merging adjacent loci, if at least X% samples posses both loci prune the remaining samples out of the analysis.
- --fstats — enable SNP and haplotype-based F statistics.
- --fst_correction — specify a correction to be applied to Fst values: 'p_value', 'bonferroni_win', or 'bonferroni_gen'. Default: off.
- --p_value_cutoff [float] — maximum p-value to keep an Fst measurement. Default: 0.05. (Also used as base for Bonferroni correction.)
- -k,--kernel_smoothed — enable kernel-smoothed π, FIS, FST, FST', and ΦST calculations.
- --sigma [int] — standard deviation of the kernel smoothing weight distribution. Default 150kb.
- --bootstrap — turn on boostrap resampling for all smoothed statistics.
- -N,--bootstrap_reps [int] — number of bootstrap resamplings to calculate (default 100).
- --bootstrap_pifis — turn on boostrap resampling for smoothed SNP-based Pi and Fis calculations.
- --bootstrap_fst — turn on boostrap resampling for smoothed Fst calculations based on pairwise population comparison of SNPs.
- --bootstrap_div — turn on boostrap resampling for smoothed haplotype diveristy and gene diversity calculations based on haplotypes.
- --bootstrap_phist — turn on boostrap resampling for smoothed Phi_st calculations based on haplotypes.
- --bootstrap_wl [path] — only bootstrap loci contained in this whitelist.
File output options:
- --ordered_export — if data is reference aligned, exports will be ordered; only a single representative of each overlapping site.
- --genomic — output each nucleotide position (fixed or polymorphic) in all population members to a file (requires --renz).
- --fasta — output full sequence for each unique haplotype, from each sample locus in FASTA format, regardless of plausibility.
- --fasta_strict — output full sequence for each haplotype, from each sample locus in FASTA format, only for biologically plausible loci.
- --vcf — output SNPs in Variant Call Format (VCF).
- --vcf_haplotypes — output haplotypes in Variant Call Format (VCF).
- --genepop — output results in GenePop format.
- --structure — output results in Structure format.
- --phase — output genotypes in PHASE format.
- --fastphase — output genotypes in fastPHASE format.
- --beagle — output genotypes in Beagle format.
- --beagle_phased — output haplotypes in Beagle format.
- --plink — output genotypes in PLINK format.
- --hzar — output genotypes in Hybrid Zone Analysis using R (HZAR) format.
- --phylip — output nucleotides that are fixed-within, and variant among populations in Phylip format for phylogenetic tree construction.
- --phylip_var — include variable sites in the phylip output encoded using IUPAC notation.
- --phylip_var_all — include all sequence as well as variable sites in the phylip output encoded using IUPAC notation.
- --treemix — output SNPs in a format useable for the TreeMix program (Pickrell and Pritchard).
- -h,--help— display this help messsage.
- -v,--version— print program version.
- --verbose— turn on additional logging.
- --log_fst_comp— log components of FST/ΦST calculations to a file.
Individuals run through the Stacks pipeline can be split into different population groups for the purposes of
computing summary statistics such as π, FIS and FST. By feeding the populations program different population maps
for the same individuals you can split the data in different dimensions. For example, if I collect 25 individuals
from one environment, I can run all 25 as a single population, or I could split the population according to
phenotypic groups present in that environment by changing the population map. I can also exclude individuals from an analysis,
or process subsets of the full dataset by excluding samples from the population map.
Statistics such as FIS and FST will
be computed relative to the population groups. You can have as many groups as you like, summary statistics will be
computed for each population, then FST will be computed across every pair of populations specified. Here are a few
examples of a population map for six individuals, in the first case treated as a single population, in the second
case treated as two populations. These are simple, tab-separated text files:
A single population in the population map
Two populations in the population map
We can also use short strings to represent the populations
See the manual for additional information on using population maps.
Whitelists and Blacklists
The populations program
allows the user to specify a list of catalog locus IDs (also referred to as markers)
to the program. In the case of a whitelist, the program will only process the loci
provided in the list, ignoring all other loci. In the case of a blacklist, the listed loci
will be excluded, while all other loci will be processed. These lists apply to the entire locus,
including all SNPs within a locus if they exist.
A whitelist or blacklist are simple files containing one catalog locus per line, like this:
% more whitelist
In the populations program it is possible to specify a whitelist that contains
catalog loci and specific SNPs within those loci. To create a SNP-specific whitelist, simply add
a second column (separated by tabs) to the standard whitelist where the second column represents the
column within the locus where the SNP can be found. Here is an example:
% more whitelist
See the manual for additional information on using whitelists.
The bootstrap resampling procedures are designed to determine the statistical significance of a particular sliding window
value relative to the generated empirical distribution. Bootstrap resampling will generate a p-value describing the
statistical significance of a particular sliding window and therefore requires a reference genome.
The bootstrap resampling process will center a window on each variable nucleotide position in the population and resample it X
times (with replacement), and then calculate a p-value. Bootstrap resampling can be applied to all smoothed values, including
the population summary statistics FIS, Π, and haplotype diversity, as well as the calculation of FST
and ΦST between pairs of populations. If you have tens of thousands of variable sites (not unusual) and lots
of populations, this calculation has to be repeated for every variable site in each population to bootstrap the summary
statistics and for all variable sites between each pair of populations for FST and ΦST. So, bootstrap
resampling can take a while.
Since bootstrapping is so computationally intensive, there are several command line options to the
populations program to allow one to turn bootstrapping on for only a subset of the statistics. In
addition, a bootstrap "whitelist" is available so you can choose to only bootstrap certain loci (say the loci on a single
chromosome). This allows one to take the following strategy for bootstrapping to appropriate levels:
- Bootstrap all loci (for example) to 1,000 repetitions.
- Identify those loci that are below some p-value threshold (say 0.05).
- Add these loci to the bootstrapping whitelist.
- Bootstrap again to 10,000 repetitions (now only those loci in the whitelist will be bootstrapped).
- Identify those loci that are below some p-value threshold (say 0.005)
- Add these loci to the bootstrapping whitelist.
- Bootstrap again to 100,000 repetitions (now only those loci in the whitelist will be bootstrapped).
- And so on to the desired level of significance...
If instead you are interested in the statistical significance of a particular point estimate of an FST
measure, you will want to use the p-value from Fisher's Exact Test, which is calculated for each variable position
between pairs of populations and is provided in the FST output files.
- Calculate population statistics in a single population and output a Variant Call Format (VCF) SNP file. Run populations on 36 processors:
~/% populations -P ./stacks/ -b 1 --vcf -t 36
- Include multiple populations using a population map, and turn on kernel smoothing for π, FIS, and FST:
~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -t 36
- Filter input so that to process a locus it must be present in 10 of the populations and in 75% of individuals in each population:
~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -t 36
- Include a p-value correction to FST scores, if an FST score isn't significantly different from 0 (according to Fisher's Exact Test), set the value to 0:
~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36
- Output data for STRUCTURE and in the GenePop format. Only write the first SNP from any RAD
locus, to prevent linked data from being processed by STRUCTURE:
~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36 --structure --genepop --write_single_snp
- Include a whitelist of 1,000 random loci so that we output a computationally manageable amount of data to STRUCTURE:
~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36 --structure --genepop --write_single_snp -W ./wl_1000
Here is one method to generate a list of random loci from a populations summary statistics file (this command goes all on one line):
~/% grep -v "^#" batch_1.sumstats.tsv |
cut -f 2 |
head -n 1000 |
sort -n > whitelist.tsv
This command does the following at each step:
- Grep pulls out all the lines in the sumstats file, minus the commented header lines. The sumstats file contains all the polymorphic loci in the analysis.
- cut out the second column, which contains locus IDs
- sort those IDs
- reduce them to a unique list of IDs (remove duplicate entries)
- randomly shuffle those lines
- take the first 1000 of the randomly shuffled lines
- sort them again and capture them into a file.
Other Pipeline Programs
The process_shortreads program performs the same task as process_radtags
for fast cleaning of randomly sheared genomic or transcriptomic data. This program will trim reads that are below the
quality threshold instead of discarding them, making it useful for genomic assembly or other analyses.
The clone_filter program will take a set of reads and reduce them according to PCR
clones. This is done by matching raw sequence or by referencing a set of random oligos that have been included in the sequence.
The kmer_filter program allows paired or single-end reads to be filtered according to the
number or rare or abundant kmers they contain. Useful for both RAD datasets as well as randomly sheared genomic or
The ustacks program will take as input a set of short-read sequences and align them into
exactly-matching stacks. Comparing the stacks it will form a set of loci and detect SNPs at each locus using a
maximum likelihood framework.
The pstacks program will extract stacks that have been aligned to a reference genome by a
program such as Bowtie or GSnap and identify SNPs at each locus using a maximum likelihood framework.
A catalog can be built from any set of samples processed
by the ustacks program. It will create a set of consensus loci, merging alleles together. In the case
of a genetic cross, a catalog would be constructed from the parents of the cross to create a set of
all possible alleles expected in the progeny of the cross.
Sets of stacks constructed by the ustacks
or pstacks programs can be searched against a catalog produced by the cstacks program. In the case of a
genetic map, stacks from the progeny would be matched against the catalog to determine which progeny
contain which parental alleles.
This populations program will compute population-level summary statistics such
as π, FIS, and FST. It can output site level SNP calls in VCF format and
can also output SNPs for analysis in STRUCTURE or in Phylip format for phylogenetics analysis.
The genotypes program exports Stacks data as a set of observed haplotypes, or with
the haplotypes encoded into genotypes, in either a generic format or for a particular linkage mapper such as
JoinMap, OneMap, or R/QTL. It also provides methods for making automated corrections to certain types of loci.
The rxstacks program makes corrections to individual genotypes and haplotypes based
on data from a population of samples.
The denovo_map.pl program executes each of the Stacks components to create a genetic
linkage map, or to identify the alleles in a set of populations. It also handles uploading data to the database.
The ref_map.pl program takes reference-aligned input data and executes each of the Stacks
components, using the reference alignment to form stacks, and identifies alleles. It can be used in a genetic map
of a set of populations. It also handles uploading data to the database.
The load_radtags.pl program takes a set of data produced by either the denovo_map.pl or
ref_map.pl progams (or produced by hand) and loads it into the database. This allows the data to be generated on
one computer, but loaded from another. Or, for a database to be regenerated without re-executing the pipeline.
The index_radtags.pl program indexes the database to speed execution in the web interface
and enable web-based filtering after all of the data has been loaded. It will be run by
denovo_map.pl or ref_map.pl at the end of execution.
The export_sql.pl program provides for the export of all of the Stacks data in a compact form.
For each locus, the script reports the consensus sequence, the number of parents and progeny that Stacks could find the
locus in, the number of SNPs found at that locus and a listing of the individual SNPs and the observed alleles, followed
by the particular allele observed in each individual. The script allows you to specify a number of filters for the data,
the same filters as available through the web interface.
The sort_read_pairs.pl program collates paired-end sequences for each catalog locus, creating
a FASTA file of paired-reads for each catalog locus. These FASTA files can then be assembled to make paired-end contigs.
The exec_velvet.pl program will execute Velvet on the collated set of reads generated by