The populations program will analyze a population of individual samples computing a number
of population genetics statistics as well as exporting a variety of standard output formats. A population map
specifying which individuals belong to which population is submitted to the program and the program will then
calculate population genetics statistics such as expected/observed heterozygosity, π, and F_{IS} at each
nucleotide position. The populations program will compare all populations pairwise to
compute F_{ST}. If a set of data is reference aligned, then a kernel-smoothed F_{ST} will also be
calculated. The populations program can also compute a number of haplotype-based
population genetics statistics including haplotype diversity, Φ_{ST}, and F_{ST}’. For
more information on how to specify a population map, see the manual.

The populations program provides strong filtering options to only include loci or variant sites that occur at
certain frequencies in each population or in the metapopulation. In addition, the program accepts
*whitelists* and *blacklists* if you want to include a specific list of loci (or exclude a specific
list of loci). For more information on whitelists and blacklists, see the manual.

populations -P dir -b batch_id [-O dir] [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats)
populations -V vcf -O dir [-M popmap] (filters) [--fstats] [-k [--window_size=150000] [--bootstrap [-N 100]]] (output formats)
### Data Filtering:

### Merging and Phasing:

### Locus stats:

### Fstats:

### Kernel-smoothing algorithm:

### File output options:

### Genetic map output options (population map must specify a genetic cross):

### Additional options:

- -P,--in_path — path to the directory containing the Stacks files.
- -V,--in_vcf — path to an input VCF file.
- -O,--out_path — path to a directory where to white the output files. (Required by -V; otherwise defaults to value of -P.)
- -M,--popmap — path to a population map. (Format is 'SAMPLE1POP1\n...'.)
- -t,--threads — number of threads to run in parallel sections of code.

- -p,--min-populations [int] — minimum number of populations a locus must be present in to process a locus.
- -r,--min-samples-per-pop [float] — minimum percentage of individuals in a population required to process a locus for that population.
- -R,--min-samples-overall [float] — minimum percentage of individuals across populations required to process a locus.
- -H,--filter-haplotype-wise — apply the above filters haplotype wise (unshared SNPs will be pruned to reduce haplotype-wise missing data).
- --min-maf [float] — specify a minimum minor allele frequency required to process a nucleotide site at a locus (0 < min_maf < 0.5).
- --min-mac [int] — specify a minimum minor allele count required to process a SNP.
- --max-obs-het [float] — specify a maximum observed heterozygosity required to process a nucleotide site at a locus.
- --write-single-snp — restrict data analysis to only the first SNP per locus.
- --write-random-snp — restrict data analysis to one random SNP per locus.
- -B,--blacklist — path to a file containing Blacklisted markers to be excluded from the export.
- -W,--whitelist — path to a file containing Whitelisted markers to include in the export.

- -e,--renz — restriction enzyme name.
- --merge-sites — merge loci that were produced from the same restriction enzyme cutsite (requires reference-aligned data).
- --merge-prune-lim — when merging adjacent loci, if at least X% samples posses both loci prune the remaining samples out of the analysis.

- --hwe — calculate divergence from Hardy-Weinberg equilibrium for each locus.

- --fstats — enable SNP and haplotype-based F statistics.
- --fst_correction — specify a correction to be applied to Fst values: 'p_value', 'bonferroni_win', or 'bonferroni_gen'. Default: off.
- --p_value_cutoff [float] — maximum p-value to keep an Fst measurement. Default: 0.05. (Also used as base for Bonferroni correction.)

- -k,--smooth — enable kernel-smoothed π, F
_{IS}, F_{ST}, F_{ST}', and Φ_{ST}calculations. - --smooth-fstats — enable kernel-smoothed Fst, Fst', and Phi_st calculations.
- --smooth-popstats — enable kernel-smoothed Pi and Fis calculations. (Note: turning on smoothing implies --ordered_export.)
- --sigma [int] — standard deviation of the kernel smoothing weight distribution. Default 150kb.
- --bootstrap — turn on boostrap resampling for all smoothed statistics.
- -N,--bootstrap-reps [int] — number of bootstrap resamplings to calculate (default 100).
- --bootstrap-pifis — turn on boostrap resampling for smoothed SNP-based Pi and Fis calculations.
- --bootstrap-fst — turn on boostrap resampling for smoothed Fst calculations based on pairwise population comparison of SNPs.
- --bootstrap-div — turn on boostrap resampling for smoothed haplotype diveristy and gene diversity calculations based on haplotypes.
- --bootstrap-phist — turn on boostrap resampling for smoothed Phi_st calculations based on haplotypes.
- --bootstrap-wl [path] — only bootstrap loci contained in this whitelist.

- --ordered-export — if data is reference aligned, exports will be ordered; only a single representative of each overlapping site.
- --fasta-loci — output locus consensus sequences in FASTA format..
- --fasta-samples — output the sequences of the two haplotypes of each (diploid) sample, for each locus, in FASTA format.
- --vcf — output SNPs and haplotypes in Variant Call Format (VCF).
- --genepop — output results in GenePop format.
- --structure — output results in Structure format.
- --plink — output genotypes in PLINK format.
- --hzar — output genotypes in Hybrid Zone Analysis using R (HZAR) format.
- --phylip — output nucleotides that are fixed-within, and variant among populations in Phylip format for phylogenetic tree construction.
- --phylip-var — include variable sites in the phylip output encoded using IUPAC notation.
- --phylip-var-all — include all sequence as well as variable sites in the phylip output encoded using IUPAC notation.
- --treemix — output SNPs in a format useable for the TreeMix program (Pickrell and Pritchard).
- --no-hap-exports — omit haplotype outputs.
- --fasta-samples-raw — output all haplotypes observed in each sample, for each locus, in FASTA format.

- --map-type — genetic map type to write. 'CP', 'DH', 'F2', and 'BC1' are the currently supported map types.
- --map-format — mapping program format to write, 'joinmap', 'onemap', and 'rqtl' are currently supported.

- -h,--help— display this help messsage.
- -v,--version— print program version.
- --verbose— turn on additional logging.
- --log-fst-comp— log components of F
_{ST}/Φ_{ST}calculations to a file.

The bootstrap resampling procedures are designed to determine the statistical significance of a particular sliding window value relative to the generated empirical distribution. Bootstrap resampling will generate a p-value describing the statistical significance of a particular sliding window and therefore requires a reference genome.

The bootstrap resampling process will center a window on each variable nucleotide position in the population and resample it X
times (with replacement), and then calculate a p-value. Bootstrap resampling can be applied to all smoothed values, including
the population summary statistics F_{IS}, Π, and haplotype diversity, as well as the calculation of F_{ST}
and Φ_{ST} between pairs of populations. If you have tens of thousands of variable sites (not unusual) and lots
of populations, this calculation has to be repeated for every variable site in each population to bootstrap the summary
statistics and for all variable sites between each pair of populations for F_{ST} and Φ_{ST}. So, bootstrap
resampling can take a while.

Since bootstrapping is so computationally intensive, there are several command line options to the populations program to allow one to turn bootstrapping on for only a subset of the statistics. In addition, a bootstrap "whitelist" is available so you can choose to only bootstrap certain loci (say the loci on a single chromosome). This allows one to take the following strategy for bootstrapping to appropriate levels:

- Bootstrap all loci (for example) to 1,000 repetitions.
- Identify those loci that are below some p-value threshold (say 0.05).
- Add these loci to the bootstrapping whitelist.
- Bootstrap again to 10,000 repetitions (now only those loci in the whitelist will be bootstrapped).
- Identify those loci that are below some p-value threshold (say 0.005)
- Add these loci to the bootstrapping whitelist.
- Bootstrap again to 100,000 repetitions (now only those loci in the whitelist will be bootstrapped).
- And so on to the desired level of significance...

If instead you are interested in the statistical significance of a particular point estimate of an F_{ST}
measure, you will want to use the p-value from Fisher's Exact Test, which is calculated for each variable position
between pairs of populations and is provided in the F_{ST} output files.

- Calculate population statistics in a single population and output a Variant Call Format (VCF) SNP file. Run populations on 8 processors:
~/% populations -P ./stacks/ --vcf -t 8

- Include multiple populations using a population map, and turn on kernel smoothing for π, F
_{IS}, and F_{ST}(data must be reference aligned for smoothing):~/% populations -P ./stacks/ --popmap ./samples/popmap --smooth -t 8

- Filter input so that to process a locus it must be present in 10 of the populations and in 75% of individuals in each population:
~/% populations -P ./stacks/ --popmap ./samples/popmap --smooth -p 10 -r 0.75 -t 8

- Output data for STRUCTURE and in the GenePop format. Only write the first SNP from any RAD
locus, to prevent linked data from being processed by STRUCTURE:
~/% populations -P ./stacks/ --popmap ./samples/popmap --smooth -p 10 -r 0.75 -f p_value -t 8 --structure --genepop --write-single-snp

- Include a whitelist of 1,000 random loci so that we output a computationally manageable amount of data to STRUCTURE:
~/% populations -P ./stacks/ --popmap ./samples/popmap --smooth -p 10 -r 0.75 -f p_value -t 12 --structure --genepop --write-single-snp -W ./wl_1000

Here is one method to generate a list of random loci from a populations summary statistics file (this command goes all on one line):

~/% grep -v "^#" populations.sumstats.tsv | cut -f 1 | sort | uniq | shuf | head -n 1000 | sort -n > whitelist.tsv

This command does the following at each step:

- Grep pulls out all the lines in the sumstats file, minus the commented header lines. The sumstats file contains all the polymorphic loci in the analysis.
- cut out the second column, which contains locus IDs
- sort those IDs
- reduce them to a unique list of IDs (remove duplicate entries)
- randomly shuffle those lines
- take the first 1000 of the randomly shuffled lines
- sort them again and capture them into a file.

The populations program has a number of standard outputs, besides any particular export that is asked for. The formats of these files are described in detail in the manual.

- populations.log: This file contains everything output on the screen, including the version number of Stacks and when it was run. It also includes a lot of useful information describing the data after filters have been applied, including how many loci were discarded/kept, the mean length of loci, and mean number of sites.
- populations.log.distribs: This file contains a number of distributions describing the data. The distribution are
provided in two forms: prior to filtering and after filtering. You can view the file manually, or you can use the
stacks-dist-extract utility we provide to pull out a particular distribution. One distribution that can tell you
a lot about your data is the number of samples per locus, pre- and post-filtering. You can extract it like this:
% stacks-dist-extract populations.log.distribs samples_per_loc_prefilters # Distribution of valid samples matched to a catalog locus prior to filtering. n_samples n_loci 1 17857 2 6103 3 3855 4 4253 5 5591 6 2786 7 3264 8 6518 9 41362 10 40333

You can see in this data set that more than 40k loci each were found in 9 or 10 samples. But that 17k loci were only found in a single sample -- these loci are likely unreliable.

When we look at this distribution after filtering, we can see the effect of filtering out the low-value loci:

% stacks-dist-extract populations.log.distribs samples_per_loc_postfilters # Distribution of valid samples matched to a catalog locus after filtering. n_samples n_loci 8 3040 9 41362 10 40333

- populations.sumstats.tsv: This file contains summary statistics describing every SNP found in every population defined in the
population map. This includes the frequecy of alleles, expected/observed heterozygosity, π, F
_{IS}, and so on. - populations.hapstats.tsv: This file contains summary statistics describing every locus (phased SNPs) found in every population defined in the population map. This includes the frequecy of haplotypes, gene and haplotype diversity.
- populations.sumstats_summary.tsv: This file contains population-level, mean summary statistics.
- If F-statistics was enabled, populations.fst_X-Y.tsv: This file contains SNP-level measures of F
_{ST}. - If F-statistics was enabled, populations.phistats_X-Y.tsv: This file contains haplotype-level measures of F
_{ST}.

## Raw Reads |
## Core |
## Execution control |