The clone_filter program is designed to identify PCR clones. This can be done in two different ways. In the simplest case, if you have a set of RAD data that is paired-end and is randomly sheared (single-digest RAD or similar), you can identify clones by comparing the single and paired-end reads to find identical sequence. More than one set of identically matching single and paired-end reads will be considered a clone and only one representative of that set will be output. This method will likely overestimate the number of actual clones in a data set.

A second method to identify PCR clones is to include a short random sequence (an 'oligo') with each molecule during molecular library construction. After sequencing we can then compare oligo sequences and identify PCR clones as those sequences with identical oligos. An oligo sequnce can be part of on inline barcode ligated onto each molecule (the program assumes the oligo is at the most 5' end of the read, while an inline barcode will come after the oligo in the sequenced read). An oligo sequence can also be added as either the i5 or the i7 index barcode of the Illumina TruSeq kits. The clone_filter program can work with any combination of these types of data and will reduce each set of identical oligos to a single representative in the output.

The clone_filter program is designed to work with the process_radtags or process_shortreads programs. Depending on how unique your oligos are (they might be unique to an entire library or only unique to a single individual) you can first demultiplex your data and then run clone_filter or vice-versa.

The clone_filter program can also be run multiple times with different subsets of the data. This allows you to filter for clones in increments if necessary due to a lack of memory on your computer.

Program Options

clone_filter [-f in_file | -p in_dir [-P] [-I] | -1 pair_1 -2 pair_2] -o out_dir [-i type] [-y type] [-D] [-h]

Oligo sequence options:

Example Usage

  1. Here we run one set of paired-end files through clone_filter. It will identify clones soley based on matching sequence of single and paired-end reads.

    % clone_filter -1 ./Sample_ATTACTCG-ATAGAGGC.R1.fastq.gz -2 ./Sample_ATTACTCG-ATAGAGGC.R2.fastq.gz -i gzfastq -o ./filtered/

  2. Here we can run a pair of files that have a 6bp inline oligo on both the single and paired-end read.

    % clone_filter -1 ./Sample_1.R1.fastq.gz -2 ./Sample_1.R2.fastq.gz -i gzfastq -o ./filtered/ --inline_inline --oligo_len_1 6 --oligo_len_2 6

  3. Here we can run a set of paired reads (all of them are in the raw subdirectory and are named with the default Illumina scheme) that have a random oligo in the index barcode position.

    % clone_filter -P -p ./raw/ -i gzfastq -o ./filtered/ --index_null --oligo_len_1 8

  4. In this case, their is a set of paired reads (all of them are in the raw subdirectory and are named with the default Illumina scheme) that have a barcode in the first index position and a random oligo in the second index barcode position. In this case, the barcode will be output by Illumina's software in the FASTQ header as "AAAAAAAA+BBBBBBBB". We instruct clone_filter to ignore the barcode (AAAAAAAA) and use the random oligo (BBBBBB) to identify clones.

    % clone_filter -P -p ./raw/ -i gzfastq -o ./filtered/ --null_index --oligo_len_2 8

Other Pipeline Programs

Raw Reads


Execution control