Stacks: process

Program Options

process_radtags -p in_dir [-P] [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len] process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len] process_radtags --in-path in_dir [--paired] [--barcodes barcode_file] --out-path out_dir --renz-1 enz [--renz-2 enz] [--threads num] [-c] [-q] [-r] [-t len] process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [--threads num] [--basename name] [-c] [-q] [-r] [-t len]

-p,--in-path — path to a directory of files.
-P,--paired — files contained within the directory are paired.
-I,--interleaved — specify that the paired-end reads are interleaved in single files.
-b,--barcodes — path to a file containing barcodes for this run, omit to ignore any barcoding.
-o,--out-path — path to output the processed files.
-f — path to the input file if processing single-end sequences.
-1 — first input file in a set of paired-end sequences.
-2 — second input file in a set of paired-end sequences.
--basename — specify the prefix of the output files when using -f or -1/-2.
--threads — number of threads to run.
-c,--clean — clean data, remove any read with an uncalled base.
-q,--quality — discard reads with low quality scores.
-r,--rescue — rescue barcodes and RAD-Tag cut sites.
-t,--truncate — truncate final read length to this value.
-D,--discards — capture discarded reads to a file.

Barcode options:

--inline-null: barcode is inline with sequence, occurs only on single-end read (default).
--index-null: barcode is provded in FASTQ header, occurs only on single-end read.
--inline-inline: barcode is inline with sequence, occurs on single and paired-end read.
--index-index: barcode is provded in FASTQ header, occurs on single and paired-end read.
--inline-index: barcode is inline with sequence on single-end read, occurs in FASTQ header for paired-end read.
--index-inline: barcode occurs in FASTQ header for single-end read, is inline with sequence on paired-end read.

Restriction enzyme options:

-e [enz], --renz-1 [enz]: provide the restriction enzyme used (cut site occurs on single-end read)
--renz-2 [enz]: if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read).

Currently supported enzymes include:

'aciI', 'aclI', 'ageI', 'aluI', 'apaLI', 'apeKI', 'apoI', 'aseI', 'bamHI', 'bbvCI', 'bfaI', 'bfuCI', 'bgIII', 'bsaHI', 'bspDI', 'bstYI', 'btgI', 'cac8I', 'claI', 'csp6I', 'ddeI', 'dpnII', 'eaeI', 'ecoRI', 'ecoRV', 'ecoT22I', 'haeII', 'haeIII', 'hhaI', 'hinP1I', 'hindIII', 'hpaII', 'hpyCH4IV', 'kpnI', 'mluCI', 'mseI', 'mslI', 'mspI', 'ncoI', 'ndeI', 'ngoMIV', 'nheI', 'nlaIII', 'notI', 'nsiI', 'nspI', 'pacI', 'pspXI', 'pstI', 'pstIshi', 'rsaI', 'sacI', 'sau3AI', 'sbfI', 'sexAI', 'sgrAI', 'speI', 'sphI', 'taqI', 'xbaI', or 'xhoI'

Protocol-specific options:

--bestrad: library was generated using BestRAD, check for restriction enzyme on either read and potentially transpose reads.
--methylrad: library was generated using enzymatic methyl-seq (EM-seq) or bisulphite sequencing.

Adapter options:

--adapter-1 [sequence]: provide adaptor sequence that may occur on the single-end read for filtering.
--adapter-2 [sequence]: provide adaptor sequence that may occur on the paired-read for filtering.
--adapter-mm [mismatches]: number of mismatches allowed in the adapter sequence.

Input/Output options:

-i,--in-type — input file type, either 'fastq', 'gzfastq' (gzipped fastq), 'bam', or 'bustard' (default: guess, or gzfastq if unable to).
-y,--out-type — output type, either 'fastq', 'gzfastq', 'fasta', or 'gzfasta' (default: match input type).
--retain-header: retain unmodified FASTQ headers in the output.
--merge: if no barcodes are specified, merge all input files into a single output file.

Advanced options:

--filter-illumina: discard reads that have been marked by Illumina’s chastity/purity filter as failing.
--disable-rad-check: disable checking if the RAD site is intact.
--force-poly-g-check: force a check for runs of poly-Gs (default: autodetect based on machine type in FASTQ header).
--disable-poly-g-check: disable checking for runs of poly-Gs (default: autodetect based on machine type in FASTQ header).
--encoding — specify how quality scores are encoded, 'phred33' (Illumina 1.8+/Sanger, default) or 'phred64' (Illumina 1.3-1.5).
--window-size — set the size of the sliding window as a fraction of the read length, between 0 and 1 (default 0.15).
--score-limit — set the phred score limit. If the average score within the sliding window drops below this value, the read is discarded (default 10).
--len-limit [limit]: specify a minimum sequence length (useful if your data has already been trimmed).
--barcode-dist-1: the number of allowed mismatches when rescuing single-end barcodes (default 1).
--barcode-dist-2: the number of allowed mismatches when rescuing paired-end barcodes (defaults to --barcode-dist-1).

Example Usage

The process_radtags program is designed to work on several types of data. The latest versions of the Illumina analysis pipeline output all reads from the sequencer in a series of FASTQ formatted files. The FASTQ ID in these files contains a flag as to whether the read passed Illumina’s interal quality filters and may contain a barcode (or index).

If your data do not contain barcodes, simply omit the barcodes file, and process_radtags will place the filtered files in the output directory with the same name as the input files.

Illumina HiSeq Data

If your data are single-end, single- or double-digested Illumina HiSeq data, in a directory called raw:

~/raw% ls lane3_NoIndex_L003_R1_001.fastq.gz lane3_NoIndex_L003_R1_006.fastq.gz lane3_NoIndex_L003_R1_011.fastq.gz lane3_NoIndex_L003_R1_002.fastq.gz lane3_NoIndex_L003_R1_007.fastq.gz lane3_NoIndex_L003_R1_012.fastq.gz lane3_NoIndex_L003_R1_003.fastq.gz lane3_NoIndex_L003_R1_008.fastq.gz lane3_NoIndex_L003_R1_013.fastq.gz lane3_NoIndex_L003_R1_004.fastq.gz lane3_NoIndex_L003_R1_009.fastq.gz lane3_NoIndex_L003_R1_005.fastq.gz lane3_NoIndex_L003_R1_010.fastq.gz

Then you can run process_radtags in the following way:

% process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \ -e sbfI -r -c -q

Note that if your data are double-digested, but only single-end reads were sequenced, then you do not need to specify the second restriction enzyme used.
If your data are paired-end, Illumina HiSeq data, in a directory called raw:

~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz

Then you simply add the -P flag. process_radtags understands the Illumina naming scheme and will figure out how to properly pair the files together:

% process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q
If your data are double-digested, paired-end, Illumina HiSeq data using combinatorial barcodes, in a directory called raw:

~/raw% ls GfddRAD1_005_ATCACG_L007_R1_001.fastq.gz GfddRAD1_005_ATCACG_L007_R2_001.fastq.gz GfddRAD1_005_ATCACG_L007_R1_002.fastq.gz GfddRAD1_005_ATCACG_L007_R2_002.fastq.gz GfddRAD1_005_ATCACG_L007_R1_003.fastq.gz GfddRAD1_005_ATCACG_L007_R2_003.fastq.gz GfddRAD1_005_ATCACG_L007_R1_004.fastq.gz GfddRAD1_005_ATCACG_L007_R2_004.fastq.gz GfddRAD1_005_ATCACG_L007_R1_005.fastq.gz GfddRAD1_005_ATCACG_L007_R2_005.fastq.gz GfddRAD1_005_ATCACG_L007_R1_006.fastq.gz GfddRAD1_005_ATCACG_L007_R2_006.fastq.gz GfddRAD1_005_ATCACG_L007_R1_007.fastq.gz GfddRAD1_005_ATCACG_L007_R2_007.fastq.gz GfddRAD1_005_ATCACG_L007_R1_008.fastq.gz GfddRAD1_005_ATCACG_L007_R2_008.fastq.gz GfddRAD1_005_ATCACG_L007_R1_009.fastq.gz GfddRAD1_005_ATCACG_L007_R2_009.fastq.gz

Then you specify both restriction enzymes using the --renz-1 and --renz-2 flags. You must also specify the type combinatorial barcoding used, such as inline/inline, or inline/index, specifying the type of barcodes to look for on the single and paired-end read:

% process_radtags -P -p ./raw -b ./barcodes/barcodes_lane4 -o ./samples/ \ -c -q -r --inline-index --renz-1 nlaIII --renz-2 mluCI

See below on how to format the barcodes file.
If your data may contain adapter sequence, and are Illumina HiSeq data, in a directory called raw:

~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz

Then you specify the the adapter sequence you expext to be present in the front read and optionally the adapter seqeunce expected to be present on the paired-end read, and the number of mismatches you want to allow in the adapter sequence (if any):

% process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q \ --adapter-1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \ --adapter-2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \ --adapter-mm 2

Generic FASTQ Data

If your data are paired-end but don’t use the Illumina naming scheme, were renamed, or were downloaded from NCBI’s Short Read Archive, you can specify the pairs explicitly. If your data are in a directory called raw:

~/raw% ls Raw_Rad_data_R1.fastq.gz Raw_Rad_data_R2.fastq.gz

Then you use the -1 and -2 parameters to specify a pair of files. If you have multiple pairs of files, you can run process_radtags multiple times (using a shell script) and concatenate the outputs together (or you can concatenate the input files together as well).
% process_radtags -1 ./raw/Raw_Rad_data_R1.fastq.gz -2 ./raw/Raw_Rad_data_R2.fastq.gz \ -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q
If your data are single-end but don’t use the Illumina naming scheme, were renamed, or were downloaded from NCBI’s Short Read Archive, you can specify the single file explicitly. If the file is in a directory called raw:

~/raw% ls rad_data.fq.gz

Then you use the -f parameter.
% process_radtags -f ./raw/rad_data.fq -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q

By default, these generic input options (-f for single-end, and -1 and -2 for paired-end reads) use the name of the input file(s) as the name prefix of all the output files and logs. The user can provide a specific prefix for all output files using the --basename option.

process_radtags

Program Options

Barcode options:

Restriction enzyme options:

Currently supported enzymes include:

Protocol-specific options:

Adapter options:

Input/Output options:

Advanced options:

Example Usage

Illumina HiSeq Data

Generic FASTQ Data

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs