Stacks

stacks-dist-extract

The stacks-dist-extract script will export a paricular section of a Stacks log or distribs file, either for easy viewing (e.g. using the --pretty option), or for plotting. If you supply a log path alone, stacks-dist-extract will print the available sections to output. The log file can also be supplied via stdin (which then requires the user to supply the --section option).

The Stacks component programs tend to output two types of files, *.log files and *.distribs files. While these files are all plain text files, and can therefore be viewed using standard UNIX tools (e.g., less, more, or cat), these files can be large and can contain a number of differt data sets of interest, and stacks-dist-extract makes it easy to pull out particular data sets.

Program Options

stacks-dist-extract logfile [section] stacks-dist-extract [--pretty] [--out-path path] logfile [section] cat logfile | stacks-dist-extract [--pretty] --section section

Example Usage

  1. If we want to know what sections are available for extraction, we can run the script on a *.log or *.distribs file without specifying a section, and the script will tell us our options:

    % stacks-dist-extract ./stacks/population_r80/populations.log.distribs batch_progress samples_per_loc_prefilters missing_samples_per_loc_prefilters snps_per_loc_prefilters samples_per_loc_postfilters missing_samples_per_loc_postfilters snps_per_loc_postfilters ...

  2. We can then select a distribution to view:

    % stacks-dist-extract ./stacks/population_r80/populations.log.distribs samples_per_loc_prefilters # Distribution of valid samples matched to a catalog locus prior to filtering. n_samples n_loci 1 810 2 362 3 224 4 213 5 202 6 175 7 224 8 542 9 46792 10 49961

  3. Here is another example from the gstacks distribs file:

    % stacks-dist-extract ./stacks/gstacks.log.distribs bam_stats_per_sample effective_coverages_per_sample phasing_rates_per_sample

  4. If we type enough of a section heading to make it unique, the script will print the results:

    % stacks-dist-extract ./stacks/gstacks.log.distribs bam_stats sample records primary_kept kept_frac primary_kept_read2 primary_disc_mapq primary_disc_sclip unmapped secondary supplementary S1_2023.01 2780637 2515438 0.905 1195103 26801 98337 80108 0 59953 S1_2023.07 3156646 2860191 0.906 1359700 27987 110763 89513 0 68192 S2_1999.13 2835542 2574684 0.908 1225169 25379 96962 81343 0 57174 ...

  5. If we add the --pretty flag, it will line up the columns (but remove the tabs, so not useful for plotting e.g., in R):

    % stacks-dist-extract ./stacks/gstacks.log.distribs --pretty bam_stats sample records primary_kept kept_frac primary_kept_read2 primary_disc_mapq primary_disc_sclip unmapped secondary supplementary S1_2023.01 2780637 2515438 0.905 1195103 26801 98337 80108 0 59953 S1_2023.07 3156646 2860191 0.906 1359700 27987 110763 89513 0 68192 S2_1999.13 2835542 2574684 0.908 1225169 25379 96962 81343 0 57174 ...

  6. Lastly, we can use the program in a UNIX pipeline, but then we need to specify the --section option:

    % cat ./stacks/gstacks.log.distribs | stacks-dist-extract --section bam_stats_per_sample

  7. Here is an example of a UNIX pipeline that will compute the average number of records for all samples:

    % cat ./stacks/gstacks.log.distribs | stacks-dist-extract --section bam_stats_per_sample | tail -n +2 | cut -f 2 | awk '{s+=$1} END {print s/NR}'

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs