Stacks: clone

clone_filter

The clone_filter program is designed to identify PCR clones. This can be done in two different ways. In the simplest case, if you have a set of RAD data that is paired-end and is randomly sheared (single-digest RAD or similar), you can identify clones by comparing the single and paired-end reads to find identical sequence. More than one set of identically matching single and paired-end reads will be considered a clone and only one representative of that set will be output. This method will likely overestimate the number of actual clones in a data set.

A second method to identify PCR clones is to include a short random sequence (an 'oligo') with each molecule during molecular library construction. After sequencing we can then compare oligo sequences and identify PCR clones as those sequences with identical oligos. An oligo sequnce can be part of on inline barcode ligated onto each molecule (the program assumes the oligo is at the most 5' end of the read, while an inline barcode will come after the oligo in the sequenced read). An oligo sequence can also be added as either the i5 or the i7 index barcode of the Illumina TruSeq kits. The clone_filter program can work with any combination of these types of data and will reduce each set of identical oligos to a single representative in the output.

The clone_filter program is designed to work with the process_radtags or process_shortreads programs. Depending on how unique your oligos are (they might be unique to an entire library or only unique to a single individual) you can first demultiplex your data and then run clone_filter or vice-versa.

The clone_filter program can also be run multiple times with different subsets of the data. This allows you to filter for clones in increments if necessary due to a lack of memory on your computer.

Program Options

clone_filter [-f in_file | -p in_dir [-P] [-I] | -1 pair_1 -2 pair_2] -o out_dir [-i type] [-y type] [-D] [-h]

f — path to the input file if processing single-end sequences.
p — path to a directory of files.
P — files contained within directory specified by '-p' are paired.
1 — first input file in a set of paired-end sequences.
2 — second input file in a set of paired-end sequences.
i — input file type, either 'bustard' for the Illumina BUSTARD output files, 'fastq', 'fasta', 'gzfasta', or 'gzfastq' (default 'fastq').
o — path to output the processed files.
y — output type, either 'fastq', 'gzfastq', 'fasta' or 'gzfasta' (default is same as input type).
D — capture discarded reads to a file.
h — display this help messsage.
--oligo_len_1 len — length of the single-end oligo sequence in data set.
--oligo_len_2 len — length of the paired-end oligo sequence in data set.
--retain_oligo — do not trim off the random oligo sequence (if oligo is inline).

Oligo sequence options:

--inline_null — random oligo is inline with sequence, occurs only on single-end read (default).
--null_index — random oligo is provded in FASTQ header (Illumina i7 read if both i5 and i7 read are provided).
--index_null — random oligo is provded in FASTQ header (Illumina i5 or i7 read).
--inline_inline — random oligo is inline with sequence, occurs on single and paired-end read.
--index_index — random oligo is provded in FASTQ header (Illumina i5 and i7 read).
--inline_index — random oligo is inline with sequence on single-end read and second oligo occurs in FASTQ header.
--index_inline — random oligo occurs in FASTQ header (Illumina i5 or i7 read) and is inline with sequence on single-end read (if single read data) or paired-end read (if paired data).

Example Usage

Here we run one set of paired-end files through clone_filter. It will identify clones soley based on matching sequence of single and paired-end reads.
% clone_filter -1 ./Sample_ATTACTCG-ATAGAGGC.R1.fastq.gz -2 ./Sample_ATTACTCG-ATAGAGGC.R2.fastq.gz -i gzfastq -o ./filtered/
Here we can run a pair of files that have a 6bp inline oligo on both the single and paired-end read.
% clone_filter -1 ./Sample_1.R1.fastq.gz -2 ./Sample_1.R2.fastq.gz -i gzfastq -o ./filtered/ --inline_inline --oligo_len_1 6 --oligo_len_2 6
Here we can run a set of paired reads (all of them are in the raw subdirectory and are named with the default Illumina scheme) that have a random oligo in the index barcode position.
% clone_filter -P -p ./raw/ -i gzfastq -o ./filtered/ --index_null --oligo_len_1 8
In this case, their is a set of paired reads (all of them are in the raw subdirectory and are named with the default Illumina scheme) that have a barcode in the first index position and a random oligo in the second index barcode position. In this case, the barcode will be output by Illumina's software in the FASTQ header as "AAAAAAAA+BBBBBBBB". We instruct clone_filter to ignore the barcode (AAAAAAAA) and use the random oligo (BBBBBB) to identify clones.
% clone_filter -P -p ./raw/ -i gzfastq -o ./filtered/ --null_index --oligo_len_2 8

clone_filter

Program Options

Oligo sequence options:

Example Usage

Other Pipeline Programs

Raw reads

Core

Execution control

Utility programs