Stacks

populations

The populations program will analyze a population of individual samples computing a number of population genetics statistics as well as exporting a variety of standard output formats. A map specifying which individuals belong to which population is submitted to the program and the program will then calculate population genetics statistics such as expected/observed heterozygosity, π, and FIS at each nucleotide position. The populations program will compare all populations pairwise to compute FST. If a set of data is reference aligned, then a kernel-smoothed FST will also be calculated. The populations program can also compute a number of haplotype-based population genetics statistics including haplotype diversity, ΦST, and FST’.

Program Options

populations -P dir -b batch_id [-O dir] [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats) populations -V vcf -O dir [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats)

Data Filtering:

Merging and Phasing:

Fstats:

Kernel-smoothing algorithm:

File output options:

Additional options:

Population Map

Individuals run through the Stacks pipeline can be split into different population groups for the purposes of computing summary statistics such as π, FIS and FST. By feeding the populations program different population maps for the same individuals you can split the data in different dimensions. For example, if I collect 25 individuals from one environment, I can run all 25 as a single population, or I could split the population according to phenotypic groups present in that environment by changing the population map. I can also exclude individuals from an analysis, or process subsets of the full dataset by excluding samples from the population map.

Statistics such as FIS and FST will be computed relative to the population groups. You can have as many groups as you like, summary statistics will be computed for each population, then FST will be computed across every pair of populations specified. Here are a few examples of a population map for six individuals, in the first case treated as a single population, in the second case treated as two populations. These are simple, tab-separated text files:

A single population in the population map

indv_01<tab>1 indv_02 1 indv_03 1 indv_04 1 indv_05 1 indv_06 1

Two populations in the population map

indv_01 1 indv_02 1 indv_03 1 indv_04 2 indv_05 2 indv_06 2

We can also use short strings to represent the populations

samp01 red samp02 red samp03 red samp04 green samp05 green samp06 green

See the manual for additional information on using population maps.

Whitelists and Blacklists

The populations program allows the user to specify a list of catalog locus IDs (also referred to as markers) to the program. In the case of a whitelist, the program will only process the loci provided in the list, ignoring all other loci. In the case of a blacklist, the listed loci will be excluded, while all other loci will be processed. These lists apply to the entire locus, including all SNPs within a locus if they exist.

A whitelist or blacklist are simple files containing one catalog locus per line, like this:

% more whitelist 3 7 521 11 46 103 972 2653 22

SNP-specific Whitelists

In the populations program it is possible to specify a whitelist that contains catalog loci and specific SNPs within those loci. To create a SNP-specific whitelist, simply add a second column (separated by tabs) to the standard whitelist where the second column represents the column within the locus where the SNP can be found. Here is an example:

% more whitelist 1916<tab>12 517 14 517 76 1318 1921 13 195 28 260 5 28 44 28 90 5933 19369 18

See the manual for additional information on using whitelists.

Bootstrap resampling

The bootstrap resampling procedures are designed to determine the statistical significance of a particular sliding window value relative to the generated empirical distribution. Bootstrap resampling will generate a p-value describing the statistical significance of a particular sliding window and therefore requires a reference genome.

The bootstrap resampling process will center a window on each variable nucleotide position in the population and resample it X times (with replacement), and then calculate a p-value. Bootstrap resampling can be applied to all smoothed values, including the population summary statistics FIS, Π, and haplotype diversity, as well as the calculation of FST and ΦST between pairs of populations. If you have tens of thousands of variable sites (not unusual) and lots of populations, this calculation has to be repeated for every variable site in each population to bootstrap the summary statistics and for all variable sites between each pair of populations for FST and ΦST. So, bootstrap resampling can take a while.

Since bootstrapping is so computationally intensive, there are several command line options to the populations program to allow one to turn bootstrapping on for only a subset of the statistics. In addition, a bootstrap "whitelist" is available so you can choose to only bootstrap certain loci (say the loci on a single chromosome). This allows one to take the following strategy for bootstrapping to appropriate levels:

  1. Bootstrap all loci (for example) to 1,000 repetitions.
  2. Identify those loci that are below some p-value threshold (say 0.05).
  3. Add these loci to the bootstrapping whitelist.
  4. Bootstrap again to 10,000 repetitions (now only those loci in the whitelist will be bootstrapped).
  5. Identify those loci that are below some p-value threshold (say 0.005)
  6. Add these loci to the bootstrapping whitelist.
  7. Bootstrap again to 100,000 repetitions (now only those loci in the whitelist will be bootstrapped).
  8. And so on to the desired level of significance...

If instead you are interested in the statistical significance of a particular point estimate of an FST measure, you will want to use the p-value from Fisher's Exact Test, which is calculated for each variable position between pairs of populations and is provided in the FST output files.

Example Usage

  1. Calculate population statistics in a single population and output a Variant Call Format (VCF) SNP file. Run populations on 36 processors:

    ~/% populations -P ./stacks/ -b 1 --vcf -t 36

  2. Include multiple populations using a population map, and turn on kernel smoothing for π, FIS, and FST:

    ~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -t 36

  3. Filter input so that to process a locus it must be present in 10 of the populations and in 75% of individuals in each population:

    ~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -t 36

  4. Include a p-value correction to FST scores, if an FST score isn't significantly different from 0 (according to Fisher's Exact Test), set the value to 0:

    ~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36

  5. Output data for STRUCTURE and in the GenePop format. Only write the first SNP from any RAD locus, to prevent linked data from being processed by STRUCTURE:

    ~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36 --structure --genepop --write_single_snp

  6. Include a whitelist of 1,000 random loci so that we output a computationally manageable amount of data to STRUCTURE:

    ~/% populations -P ./stacks/ -M ./samples/popmap -b 1 -k -p 10 -r 0.75 -f p_value -t 36 --structure --genepop --write_single_snp -W ./wl_1000

    Here is one method to generate a list of random loci from a populations summary statistics file (this command goes all on one line):

    ~/% grep -v "^#" batch_1.sumstats.tsv | cut -f 2 | sort | uniq | shuf | head -n 1000 | sort -n > whitelist.tsv

    This command does the following at each step:

    1. Grep pulls out all the lines in the sumstats file, minus the commented header lines. The sumstats file contains all the polymorphic loci in the analysis.
    2. cut out the second column, which contains locus IDs
    3. sort those IDs
    4. reduce them to a unique list of IDs (remove duplicate entries)
    5. randomly shuffle those lines
    6. take the first 1000 of the randomly shuffled lines
    7. sort them again and capture them into a file.

Other Pipeline Programs

Raw Reads

Core

Execution control

Utilities