Stacks: gstacks

gstacks

The gstacks program will examine a RAD data set one locus at a time, looking at all individuals in the metapopulation for that locus.

For de novo analyses, the gstacks program will start with the results of the core single-end pipeline (ustacks→cstacks→sstacks→tsv2bam), incorporate the paired-end reads (if available), as fetched by tsv2bam, assemble the paired-end reads into a contig, merge the contig with the single-end locus, and align reads from individual samples to the locus.

For reference-aligned analyses, the gstacks program is the first program executed and it will create loci by incorporating single- or paired-end reads that have been aligned to the reference genome and sorted, using a sliding window algorithm.

In either mode, gstacks will identify SNPs within the meta population for each locus and then genotype each individual at each identified SNP. Once SNPs have been identified and genotyped, gstacks will phase the SNPs at each locus, in each individual, into a set of haplotypes.

The gstacks program is able to remove PCR duplicates (pairs of reads with identical insert lengths) if requested.

The gstacks program will output two major files, catalog.fa.gz, which contains the consensus sequence for each assembled locus in the data, as well as catalog.calls, a custom file that contains genotyping data. These files are intended to be read by the populations program, which can apply appropriate filters and export the data.

Program Options

De novo mode:

gstacks -P stacks_dir -M popmap

-P — input directory containg '*.matches.bam' files created by the de novo Stacks pipeline, ustacks-cstacks-sstacks-tsv2bam

Reference-based mode:

gstacks -I bam_dir -M popmap [-S suffix] -O out_dir gstacks -B bam_file [-B ...] -O out_dir

-I — input directory containing BAM files
-S — with -I/-M, suffix to use to build BAM file names: by default this is just '.bam', i.e. the program expects 'SAMPLE_NAME.bam'
-B — input BAM file(s)
The input BAM file(s) must be sorted by coordinate. With -B, records must be assigned to samples using BAM "reads groups" (gstacks uses the ID/identifier and SM/sample name fields). Read groups must be consistent if repeated different files. With -I, read groups are unneeded and ignored.

For both modes:

-M — path to a population map giving the list of samples
-O — output directory (default: none with -B; with -P same as the input directory)
-t,--threads — number of threads to use (default: 1)

SNP Model options:

--model — model to use to call variants and genotypes; one of marukilow (default), marukihigh, or snp
--var-alpha — alpha threshold for discovering SNPs (default: 0.05 for marukilow)
--gt-alpha — alpha threshold for calling genotypes (default: 0.05)

Paired-end options:

--rm-pcr-duplicates — remove read pairs of the same insert length (implies --rm-unpaired-reads)
--rm-unpaired-reads — discard unpaired reads
--ignore-pe-reads — ignore paired-end reads even if present in the input
--unpaired — ignore read pairing (only for paired-end GBS; treat READ2's as if they were READ1's)

Advanced options: (De novo mode)

--kmer-length — kmer length for the de Bruijn graph (default: 31, max. 31)
--max-debruijn-reads — maximum number of reads to use in the de Bruijn graph (default: 1000)
--min-kmer-cov — minimum coverage to consider a kmer (default: 2)
--write-alignments — save read alignments (heavy BAM files)

Advanced options: (Reference-based mode)

--min-mapq — minimum PHRED-scaled mapping quality to consider a read (default: 10)
--max-clipped — maximum soft-clipping level, in fraction of read length (default: 0.20)
--max-insert-len — maximum allowed sequencing insert length (default: 1000)
--details — write a heavier output
--phasing-cooccurrences-thr-range — range of edge coverage thresholds to iterate over when building the graph of allele cooccurrences for SNP phasing (default: 1,2)
--phasing-dont-prune-hets — don't try to ignore dubious heterozygote genotypes during phasing

Example Usage

Processing single-end or paired-end data, de novo.
Your Stacks directory should look similar to this, where the tags/snps/alleles/matches files were produced by the core pipeline (ustacks/cstacks/sstacks) and the matches.bam files were produced by tsv2bam:

% ls stacks/ sample_1020.alleles.tsv.gz sample_1069.alleles.tsv.gz sample_1086.alleles.tsv.gz sample_1095.alleles.tsv.gz sample_1020.matches.tsv.gz sample_1069.matches.tsv.gz sample_1086.matches.tsv.gz sample_1095.matches.tsv.gz sample_1020.matches.bam sample_1069.matches.bam sample_1086.matches.bam sample_1095.matches.bam sample_1020.snps.tsv.gz sample_1069.snps.tsv.gz sample_1086.snps.tsv.gz sample_1095.snps.tsv.gz sample_1020.tags.tsv.gz sample_1069.tags.tsv.gz sample_1086.tags.tsv.gz sample_1095.tags.tsv.gz

% gstacks -P ./stacks -M ./popmap -t 8
Processing single or paired-end data, reference aligned.
Your Stacks direcotry should look similar to this, where sample reads have been aligned and sorted by a standard aligner, such as bwa:

% ls aligned/ sample_1020.bam sample_1069.bam sample_1086.bam sample_1095.bam

% gstacks -I ./aligned -O ./stacks -M ./popmap -t 8
Given paired-end, single-digest data that is reference aligned, perhaps you want to exclude PCR duplicates.
% gstacks -I ./aligned -O ./stacks -M ./popmap --rm-pcr-duplicates -t 8

gstacks

Program Options

De novo mode:

Reference-based mode:

For both modes:

SNP Model options:

Paired-end options:

Advanced options: (De novo mode)

Advanced options: (Reference-based mode)

Example Usage

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs