This program examines raw reads from an Illumina sequencing run and first, checks that the barcode and the RAD cutsite are intact, and demultiplexes the data. If there are errors in the barcode or the RAD site within a certain allowance process_radtags can correct them. Second, it slides a window down the length of the read and checks the average quality score within the window. If the score drops below 90% probability of being correct (a raw phred score of 10), the read is discarded. This allows for some sequencing errors while eliminating reads where the sequence is degrading as it is being sequenced. By default the sliding window is 15% of the length of the read, but can be specified on the command line (the threshold and window size can be adjusted). By default,process_radtags also looks for poly-G runs at the 3’ end of the reads, which are associated with synthesis termination in two-channel sequencing platforms. Reads with excess poly-G runs are discarded.
The process_radtags program can:
Below you will find additional information on how to:
The process_radtags program is designed to work on several types of data. The latest versions of the Illumina analysis pipeline output all reads from the sequencer in a series of FASTQ formatted files. The FASTQ ID in these files contains a flag as to whether the read passed Illumina’s interal quality filters and may contain a barcode (or index).
If your data do not contain barcodes, simply omit the barcodes file, and process_radtags will place the filtered files in the output directory with the same name as the input files.
~/raw% ls lane3_NoIndex_L003_R1_001.fastq.gz lane3_NoIndex_L003_R1_006.fastq.gz lane3_NoIndex_L003_R1_011.fastq.gz lane3_NoIndex_L003_R1_002.fastq.gz lane3_NoIndex_L003_R1_007.fastq.gz lane3_NoIndex_L003_R1_012.fastq.gz lane3_NoIndex_L003_R1_003.fastq.gz lane3_NoIndex_L003_R1_008.fastq.gz lane3_NoIndex_L003_R1_013.fastq.gz lane3_NoIndex_L003_R1_004.fastq.gz lane3_NoIndex_L003_R1_009.fastq.gz lane3_NoIndex_L003_R1_005.fastq.gz lane3_NoIndex_L003_R1_010.fastq.gz
Then you can run process_radtags in the following way:
% process_radtags -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane3 \ -e sbfI -r -c -q
Note that if your data are double-digested, but only single-end reads were sequenced, then you do not need to specify the second restriction enzyme used.
~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz
Then you simply add the -P flag. process_radtags understands the Illumina naming scheme and will figure out how to properly pair the files together:
% process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q
~/raw% ls GfddRAD1_005_ATCACG_L007_R1_001.fastq.gz GfddRAD1_005_ATCACG_L007_R2_001.fastq.gz GfddRAD1_005_ATCACG_L007_R1_002.fastq.gz GfddRAD1_005_ATCACG_L007_R2_002.fastq.gz GfddRAD1_005_ATCACG_L007_R1_003.fastq.gz GfddRAD1_005_ATCACG_L007_R2_003.fastq.gz GfddRAD1_005_ATCACG_L007_R1_004.fastq.gz GfddRAD1_005_ATCACG_L007_R2_004.fastq.gz GfddRAD1_005_ATCACG_L007_R1_005.fastq.gz GfddRAD1_005_ATCACG_L007_R2_005.fastq.gz GfddRAD1_005_ATCACG_L007_R1_006.fastq.gz GfddRAD1_005_ATCACG_L007_R2_006.fastq.gz GfddRAD1_005_ATCACG_L007_R1_007.fastq.gz GfddRAD1_005_ATCACG_L007_R2_007.fastq.gz GfddRAD1_005_ATCACG_L007_R1_008.fastq.gz GfddRAD1_005_ATCACG_L007_R2_008.fastq.gz GfddRAD1_005_ATCACG_L007_R1_009.fastq.gz GfddRAD1_005_ATCACG_L007_R2_009.fastq.gz
Then you specify both restriction enzymes using the --renz-1 and --renz-2 flags. You must also specify the type combinatorial barcoding used, such as inline/inline, or inline/index, specifying the type of barcodes to look for on the single and paired-end read:
% process_radtags -P -p ./raw -b ./barcodes/barcodes_lane4 -o ./samples/ \ -c -q -r --inline-index --renz-1 nlaIII --renz-2 mluCI
See below on how to format the barcodes file.
~/raw% ls lane4_NoIndex_L004_R1_001.fastq.gz lane4_NoIndex_L004_R1_009.fastq.gz lane4_NoIndex_L004_R2_005.fastq.gz lane4_NoIndex_L004_R1_002.fastq.gz lane4_NoIndex_L004_R1_010.fastq.gz lane4_NoIndex_L004_R2_006.fastq.gz lane4_NoIndex_L004_R1_003.fastq.gz lane4_NoIndex_L004_R1_011.fastq.gz lane4_NoIndex_L004_R2_007.fastq.gz lane4_NoIndex_L004_R1_004.fastq.gz lane4_NoIndex_L004_R1_012.fastq.gz lane4_NoIndex_L004_R2_008.fastq.gz lane4_NoIndex_L004_R1_005.fastq.gz lane4_NoIndex_L004_R2_001.fastq.gz lane4_NoIndex_L004_R2_009.fastq.gz lane4_NoIndex_L004_R1_006.fastq.gz lane4_NoIndex_L004_R2_002.fastq.gz lane4_NoIndex_L004_R2_010.fastq.gz lane4_NoIndex_L004_R1_007.fastq.gz lane4_NoIndex_L004_R2_003.fastq.gz lane4_NoIndex_L004_R2_011.fastq.gz lane4_NoIndex_L004_R1_008.fastq.gz lane4_NoIndex_L004_R2_004.fastq.gz lane4_NoIndex_L004_R2_012.fastq.gz
Then you specify the the adapter sequence you expext to be present in the front read and optionally the adapter seqeunce expected to be present on the paired-end read, and the number of mismatches you want to allow in the adapter sequence (if any):
% process_radtags -P -p ./raw/ -o ./samples/ -b ./barcodes/barcodes_lane4 \ -e sbfI -r -c -q \ --adapter-1 GATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG \ --adapter-2 AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \ --adapter-mm 2
~/raw% ls Raw_Rad_data_R1.fastq.gz Raw_Rad_data_R2.fastq.gz
Then you use the -1 and -2 parameters to specify a pair of files. If you have multiple pairs of files, you can run process_radtags multiple times (using a shell script) and concatenate the outputs together (or you can concatenate the input files together as well).
% process_radtags -1 ./raw/Raw_Rad_data_R1.fastq.gz -2 ./raw/Raw_Rad_data_R2.fastq.gz \ -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q
~/raw% ls rad_data.fq.gz
Then you use the -f parameter.
% process_radtags -f ./raw/rad_data.fq -o ./samples/ -b ./barcodes/barcodes -e sbfI -r -c -q
By default, these generic input options (-f for single-end, and -1 and -2 for paired-end reads) use the name of the input file(s) as the name prefix of all the output files and logs. The user can provide a specific prefix for all output files using the --basename option.
Raw reads |
Core |
Execution control |
Utility programs |