Stacks

process_radtags

This program examines raw reads from an Illumina sequencing run and first, checks that the barcode and the RAD cutsite are intact, and demultiplexes the data. If there are errors in the barcode or the RAD site within a certain allowance process_radtags can correct them. Second, it slides a window down the length of the read and checks the average quality score within the window. If the score drops below 90% probability of being correct (a raw phred score of 10), the read is discarded. This allows for some sequencing errors while eliminating reads where the sequence is degrading as it is being sequenced. By default the sliding window is 15% of the length of the read, but can be specified on the command line (the threshold and window size can be adjusted). By default,process_radtags also looks for poly-G runs at the 3’ end of the reads, which are associated with synthesis termination in two-channel sequencing platforms. Reads with excess poly-G runs are discarded.

The process_radtags program can:

Below you will find additional information on how to:

  1. Run process_radtags with Illumina HiSeq data.
  2. Run process_radtags with generic FASTQ data.

Program Options

process_radtags -p in_dir [-P] [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len] process_radtags -f in_file [-b barcode_file] -o out_dir -e enz [--threads num] [-c] [-q] [-r] [-t len] process_radtags --in-path in_dir [--paired] [--barcodes barcode_file] --out-path out_dir --renz-1 enz [--renz-2 enz] [--threads num] [-c] [-q] [-r] [-t len] process_radtags -1 pair_1 -2 pair_2 [-b barcode_file] -o out_dir -e enz [--threads num] [--basename name] [-c] [-q] [-r] [-t len]

Barcode options:

Restriction enzyme options:

Protocol-specific options:

Adapter options:

Input/Output options:

Advanced options:

Example Usage

The process_radtags program is designed to work on several types of data. The latest versions of the Illumina analysis pipeline output all reads from the sequencer in a series of FASTQ formatted files. The FASTQ ID in these files contains a flag as to whether the read passed Illumina’s interal quality filters and may contain a barcode (or index).

If your data do not contain barcodes, simply omit the barcodes file, and process_radtags will place the filtered files in the output directory with the same name as the input files.

Illumina HiSeq Data

  1. If your data are single-end, single- or double-digested Illumina HiSeq data, in a directory called raw:

    ~/raw% ls lane3_NoIndex_L003_R1_001.fastq.gz lane3_NoIndex_L003_R1_006.fastq.gz lane3_NoIndex_L003_R1_011.fastq.gz lane3_NoIndex_L003_R1_002.fastq.gz lane3_NoIndex_L003_R1_007.fastq.gz lane3_NoIndex_L003_R1_012.fastq.gz lane3_NoIndex_L003_R1_003.fastq.gz lane3_NoIndex_L003_R1_008.fastq.gz lane3_NoIndex_L003_R1_013.fastq.gz lane3_NoIndex_L003_R1_004.fastq.gz lane3_NoIndex_L003_R1_009.fastq.gz lane3_NoIndex_L003_R1_005.fastq.gz lane3_NoIndex_L003_R1_010.fastq.gz

    Then you can run process_radtags in the following way:

    % process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \ -e sbfI -r -c -q

    Note that if your data are double-digested, but only single-end reads were sequenced, then you do not need to specify the second restriction enzyme used.

  2. If your data are paired-end, Illumina HiSeq data, in a directory called raw:

    ~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz

    Then you simply add the -P flag. process_radtags understands the Illumina naming scheme and will figure out how to properly pair the files together:

    % process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q

  3. If your data are double-digested, paired-end, Illumina HiSeq data using combinatorial barcodes, in a directory called raw:

    ~/raw% ls GfddRAD1_005_ATCACG_L007_R1_001.fastq.gz GfddRAD1_005_ATCACG_L007_R2_001.fastq.gz GfddRAD1_005_ATCACG_L007_R1_002.fastq.gz GfddRAD1_005_ATCACG_L007_R2_002.fastq.gz GfddRAD1_005_ATCACG_L007_R1_003.fastq.gz GfddRAD1_005_ATCACG_L007_R2_003.fastq.gz GfddRAD1_005_ATCACG_L007_R1_004.fastq.gz GfddRAD1_005_ATCACG_L007_R2_004.fastq.gz GfddRAD1_005_ATCACG_L007_R1_005.fastq.gz GfddRAD1_005_ATCACG_L007_R2_005.fastq.gz GfddRAD1_005_ATCACG_L007_R1_006.fastq.gz GfddRAD1_005_ATCACG_L007_R2_006.fastq.gz GfddRAD1_005_ATCACG_L007_R1_007.fastq.gz GfddRAD1_005_ATCACG_L007_R2_007.fastq.gz GfddRAD1_005_ATCACG_L007_R1_008.fastq.gz GfddRAD1_005_ATCACG_L007_R2_008.fastq.gz GfddRAD1_005_ATCACG_L007_R1_009.fastq.gz GfddRAD1_005_ATCACG_L007_R2_009.fastq.gz

    Then you specify both restriction enzymes using the --renz-1 and --renz-2 flags. You must also specify the type combinatorial barcoding used, such as inline/inline, or inline/index, specifying the type of barcodes to look for on the single and paired-end read:

    % process_radtags -P -p ./raw -b ./barcodes/barcodes_lane4 -o ./samples/ \ -c -q -r --inline-index --renz-1 nlaIII --renz-2 mluCI

    See below on how to format the barcodes file.

  4. If your data may contain adapter sequence, and are Illumina HiSeq data, in a directory called raw:

    ~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz

    Then you specify the the adapter sequence you expext to be present in the front read and optionally the adapter seqeunce expected to be present on the paired-end read, and the number of mismatches you want to allow in the adapter sequence (if any):

    % process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q \ --adapter-1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \ --adapter-2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \ --adapter-mm 2

Generic FASTQ Data

  1. If your data are paired-end but don’t use the Illumina naming scheme, were renamed, or were downloaded from NCBI’s Short Read Archive, you can specify the pairs explicitly. If your data are in a directory called raw:

    ~/raw% ls Raw_Rad_data_R1.fastq.gz Raw_Rad_data_R2.fastq.gz

    Then you use the -1 and -2 parameters to specify a pair of files. If you have multiple pairs of files, you can run process_radtags multiple times (using a shell script) and concatenate the outputs together (or you can concatenate the input files together as well).

    % process_radtags -1 ./raw/Raw_Rad_data_R1.fastq.gz -2 ./raw/Raw_Rad_data_R2.fastq.gz \ -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q

  2. If your data are single-end but don’t use the Illumina naming scheme, were renamed, or were downloaded from NCBI’s Short Read Archive, you can specify the single file explicitly. If the file is in a directory called raw:

    ~/raw% ls rad_data.fq.gz

    Then you use the -f parameter.

    % process_radtags -f ./raw/rad_data.fq -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q

  3. By default, these generic input options (-f for single-end, and -1 and -2 for paired-end reads) use the name of the input file(s) as the name prefix of all the output files and logs. The user can provide a specific prefix for all output files using the --basename option.

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs