Tutorial: building mini-contigs from paired-end sequences

  1. Preparing the environment
  2. Running the pipeline
  3. Collating and assembling paired-end reads
  4. Importing data into the database
Collate Paired-end Reads

During the creation of RAD libraries prior to sequencing, genomic DNA is digested with a restriction enzyme and the resulting DNA fragments are sheared to a uniform length. However, that length is not exact, and we can take advantage of this fact to assemble the seqeunced paired-end reads into small contigs.

This tutorial will describe how to take a set of sequenced, paired-end Illumina reads and assemble them into contigs, associating these contigs with a catalog locus within Stacks. These contigs can then be viewed from within the web interface of Stacks and used for associating ESTs with catalog loci, or for doing comparitive genomics: using the contigs to search the genomes of other organisms.

Our test data set will consist of a single paired-end lane of Illumina sequencing containing RAD libraries from the male and female parents of an F1 genetic cross. After assembling the loci with Stacks, we will collate paired-end reads according to the stack their single-end read ended up within, assemble these collections of paired-end reads with Velvet, and finally upload the results to the Stacks database.

For this tutorial, we will be working in a directory ~/tutorial/, located in the home directory. We also assume that Stacks is installed in its default location, with binaries in /usr/local/bin and the support files installed in /usr/local/share/stacks. We also assume that /usr/local/bin is on the $PATH.

Preparing the environment

  1. First, we will create a set of directories to place data in. At each step of the analysis, we will transform the data moving it from the raw/ to samples/ to stacks/.

    ~/tutorial% mkdir raw samples stacks paired assembled ~/tutorial% ls assembled/ paired/ raw/ samples/ stacks/

  2. For this tutorial, we will start with sequences that have already been cleaned and renamed to their proper sample names. We will assume that you have placed the downloaded data into the samples/ directory. After decompressing, it should look like this:

    ~/tutorial/samples% ls f0_female.1.fq f0_female.2.fq f0_male.1.fq f0_male.2.fq

Running the pipeline

  1. The first task is to create a MySQL database to hold the results. The first command sends a bit of SQL to the server to create the database. We always use a constant suffix on our Stacks databases ('_radtags') to make it easy to set up blanket database permissions. Here we will name the database after the paired-end tags. Next, we send the database table definitions to the server to create all the necessary components of the database.

    ~/tutorial% mysql -e "CREATE DATABASE pe_radtags" ~/tutorial% mysql pe_radtags < /usr/local/share/stacks/sql/stacks.sql

  2. We are ready to run the pipeline. The script will run each of the Stacks components: first, running ustacks on each of the two samples, building loci and calling SNPs in each. Second, cstacks will be run to create a catalog of all loci from the parents of the cross, and finally, sstacks will be executed to match loci against the catalog.

    ~/tutorial% -m 3 -M 3 -T 15 -B pe_radtags -b 1 -t \ -D "Tutorial Paired-end RAD-Tags" \ -o ./stacks \ -p ./samples/f0_male.1.fq \ -p ./samples/f0_female.1.fq

    After the pipeline completes our stacks/ directory will look like this:

    ~/tutorial% cd stacks ~/tutorial/stacks% ls batch_1.catalog.alleles.tsv batch_1.markers.tsv f0_female.matches.tsv f0_male.alleles.tsv f0_male.tags.tsv batch_1.catalog.snps.tsv denovo_map.log f0_female.snps.tsv f0_male.matches.tsv batch_1.catalog.tags.tsv f0_female.alleles.tsv f0_female.tags.tsv f0_male.snps.tsv

Collating and assembling paired-end reads

  1. We could assemble the paired-end mini-contig for every RAD locus in the catalog. However, only those loci that contain at least one SNP will be included in our genetic map, so our first task is to create a whitelist of loci that we want to assemble: those that have SNPs. This information is easily obtained from the Stacks database:

    ~/tutorial% -D pe_radtags -b 1 -a haplo -f haplotypes.tsv -o tsv -F snps_l=1

    We only need a list of catalog IDs for the loci we want to assemble, so we can just take a portion of the file we just downloaded (we will extract the first column):

    ~/tutorial% cut -f 1 haplotypes.tsv > whitelist.tsv

    Our directory should now look like this:

    ~/tutorial% ls assembled/ haplotypes.tsv paired/ raw/ samples/ stacks/ whitelist.tsv

    While the contents of our whitelist file should look like this:

    ~/tutorial% head whitelist.tsv 1 4 5 7 9 14 17 18 19 22 ...

  2. Now we will use the program to collate the paired-end sequences for each catalog locus. The program will load the *.tags.tsv files for each sample to know which reads were assembled into which stacks, the *.matches.tsv files for each sample to know which stacks were matched to the same catalog locus, and the paired-end reads, which it will bin according to which catalog locus they belong.

    ~/tutorial% -p ./stacks/ -s ./samples/ -o ./paired/ -w whitelist.tsv

    The -w parameter will cause the program to only collate reads for those loci listed in the file whitelist.tsv.

    The paired directory is now full of thousands of FASTA files, one for each catalog locus:

    ~/tutorial% ls paired 10000.fa 15810.fa 21860.fa 28227.fa 35001.fa 42028.fa 50694.fa 59967.fa 72415.fa 10004.fa 15814.fa 21864.fa 28232.fa 35003.fa 42029.fa 50697.fa 59970.fa 72441.fa 10006.fa 15816.fa 21866.fa 28238.fa 35011.fa 42038.fa 50703.fa 59974.fa 72451.fa 10008.fa 15818.fa 21871.fa 2823.fa 35014.fa 4203.fa 50708.fa 5997.fa 72456.fa 10011.fa 15821.fa 21877.fa 28253.fa 35016.fa 42051.fa 50709.fa 59987.fa 72459.fa 10014.fa 15825.fa 2187.fa 28254.fa 3501.fa 42054.fa 50716.fa 5999.fa 7245.fa ...

    And each file should look something like this:


    The FASTA ID lists the catalog locus ID, the sample tag ID, and the ID of the raw sequence read.

  3. Now it’s time to assemble the data using velvet:

    ~/tutorial% -s ./paired/ -o ./assembled/ -c -e /usr/local/bin -M 200

    The program will execute Velvet on the collated set of reads for each catalog locus. It will then extract out the assembled contigs and write them to a file called collated.fa. The script allows you to control all the major parameters to Velvet, although describing them is beyond this tutorial. The -M 200 parameter will cause the program to only record assembled contigs with a length of 200 nucleotides or more.

Importing contigs into the database

  1. We can import and view these contigs in the Stacks database. These contigs can later be exported and used for BLAST analysis to connect catalog loci to ESTs (perhaps from an RNA-seq experiment), or used for comparative genomics, to connect loci to the genomes of other organisms (perhaps to determine conserved synteny).

    ~/tutorial% -b 1 -D pe_radtags -f ./assembled/collated.fa -t pe_radtag

  2. Finally, we need to re-index the database so that it picks up the newly added contigs.

    ~/tutorial% -D pe_radtags -c

  3. We can now view data in the web interface:


    The database detects that you have uploaded either sequence data or BLAST hits and makes available a new set of filters. The "Contains Paired-end RAD-Tags" filter will only show those loci for which a paired-end contig could be assembled.

    The number of sequences stored is displayed in the newly added column. By clicking on "blast hits" a list of sequences can be viewed. If BLAST hits were also loaded, they would be displayed in the same window, corresponding to the sequence used as a BLAST query. Clicking on the name of a sequence will display the actual sequence in a new window.

  4. Clicking on one of the contig IDs will show us the actual sequence in a new window.