Stacks

gstacks

The gstacks program will examine a RAD data set one locus at a time, looking at all individuals in the metapopulation for that locus.

For de novo analyses, the gstacks program will start with the results of the core single-end pipeline (ustacks→cstacks→sstacks→tsv2bam), incorporate the paired-end reads (if available), as fetched by tsv2bam, assemble the paired-end reads into a contig, merge the contig with the single-end locus, and align reads from individual samples to the locus.

For reference-aligned analyses, the gstacks program is the first program executed and it will create loci by incorporating single- or paired-end reads that have been aligned to the reference genome and sorted, using a sliding window algorithm.

In either mode, gstacks will identify SNPs within the meta population for each locus and then genotype each individual at each identified SNP. Once SNPs have been identified and genotyped, gstacks will phase the SNPs at each locus, in each individual, into a set of haplotypes.

The gstacks program is able to remove PCR duplicates (pairs of reads with identical insert lengths) if requested.

The gstacks program will output two major files, catalog.fa.gz, which contains the consensus sequence for each assembled locus in the data, as well as catalog.calls, a custom file that contains genotyping data. These files are intended to be read by the populations program, which can apply appropriate filters and export the data.

Program Options

De novo mode:

gstacks -P stacks_dir -M popmap

Reference-based mode:

gstacks -I bam_dir -M popmap [-S suffix] -O out_dir gstacks -B bam_file [-B ...] -O out_dir

For both modes:

SNP Model options:

Paired-end options:

Advanced options: (De novo mode)

Advanced options: (Reference-based mode)

Example Usage

  1. Processing single-end or paired-end data, de novo.

    Your Stacks directory should look similar to this, where the tags/snps/alleles/matches files were produced by the core pipeline (ustacks/cstacks/sstacks) and the matches.bam files were produced by tsv2bam:

    % ls stacks/ sample_1020.alleles.tsv.gz sample_1069.alleles.tsv.gz sample_1086.alleles.tsv.gz sample_1095.alleles.tsv.gz sample_1020.matches.tsv.gz sample_1069.matches.tsv.gz sample_1086.matches.tsv.gz sample_1095.matches.tsv.gz sample_1020.matches.bam sample_1069.matches.bam sample_1086.matches.bam sample_1095.matches.bam sample_1020.snps.tsv.gz sample_1069.snps.tsv.gz sample_1086.snps.tsv.gz sample_1095.snps.tsv.gz sample_1020.tags.tsv.gz sample_1069.tags.tsv.gz sample_1086.tags.tsv.gz sample_1095.tags.tsv.gz

    % gstacks -P ./stacks -M ./popmap -t 8

  2. Processing single or paired-end data, reference aligned.

    Your Stacks direcotry should look similar to this, where sample reads have been aligned and sorted by a standard aligner, such as bwa:

    % ls aligned/ sample_1020.bam sample_1069.bam sample_1086.bam sample_1095.bam

    % gstacks -I ./aligned -O ./stacks -M ./popmap -t 8

  3. Given paired-end, single-digest data that is reference aligned, perhaps you want to exclude PCR duplicates.

    % gstacks -I ./aligned -O ./stacks -M ./popmap --rm-pcr-duplicates -t 8

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs