For this tutorial, we will be working in a directory ~/tutorial/, located in the home directory. We also assume that Stacks is installed in its default location, with binaries in /usr/local/bin and the support files installed in /usr/local/share/stacks. We also assume that /usr/local/bin is on the $PATH.
The first part of this tutorial will cover how to set up the work environment for Stacks and how to clean a set of raw data that was received from an Illumina sequencer. Stacks can process data in FASTA or FASTQ format (among others), so if you already have a method to clean and prepare your data, you may want to skip ahead to the next section.
Our test data set will consist of a single lane of Illumina sequencing (lane 1), although what we show below can easily be expanded to accommodate many lanes. Our goal is to genotype a cross consisting of a set of F0 parents and three F1 progeny. We assume that the five samples have been digested with an SbfI restriction enzyme and have been barcoded and multiplexed together prior to sequencing. So, our mapping looks like this:
First, we will create a set of directories to place data in. At each step of the analysis, we will transform the data moving it from the raw/ to samples/ to stacks/.
~/tutorial% mkdir raw samples stacks ~/tutorial% ls raw/ samples/ stacks/
We will assume that you have placed the raw sequencing data into the raw/ directory. If you are using the raw files output by the BUSTARD part of the pipeline your data will look like this:
~/tutorial/raw% ls s_1_1_0001_qseq.txt s_1_1_0025_qseq.txt s_1_1_0049_qseq.txt s_1_1_0073_qseq.txt s_1_1_0097_qseq.txt s_1_1_0002_qseq.txt s_1_1_0026_qseq.txt s_1_1_0050_qseq.txt s_1_1_0074_qseq.txt s_1_1_0098_qseq.txt s_1_1_0003_qseq.txt s_1_1_0027_qseq.txt s_1_1_0051_qseq.txt s_1_1_0075_qseq.txt s_1_1_0099_qseq.txt s_1_1_0004_qseq.txt s_1_1_0028_qseq.txt s_1_1_0052_qseq.txt s_1_1_0076_qseq.txt s_1_1_0100_qseq.txt s_1_1_0005_qseq.txt s_1_1_0029_qseq.txt s_1_1_0053_qseq.txt s_1_1_0077_qseq.txt s_1_1_0101_qseq.txt s_1_1_0006_qseq.txt s_1_1_0030_qseq.txt s_1_1_0054_qseq.txt s_1_1_0078_qseq.txt s_1_1_0102_qseq.txt s_1_1_0007_qseq.txt s_1_1_0031_qseq.txt s_1_1_0055_qseq.txt s_1_1_0079_qseq.txt s_1_1_0103_qseq.txt s_1_1_0008_qseq.txt s_1_1_0032_qseq.txt s_1_1_0056_qseq.txt s_1_1_0080_qseq.txt s_1_1_0104_qseq.txt s_1_1_0009_qseq.txt s_1_1_0033_qseq.txt s_1_1_0057_qseq.txt s_1_1_0081_qseq.txt s_1_1_0105_qseq.txt s_1_1_0010_qseq.txt s_1_1_0034_qseq.txt s_1_1_0058_qseq.txt s_1_1_0082_qseq.txt s_1_1_0106_qseq.txt s_1_1_0011_qseq.txt s_1_1_0035_qseq.txt s_1_1_0059_qseq.txt s_1_1_0083_qseq.txt s_1_1_0107_qseq.txt s_1_1_0012_qseq.txt s_1_1_0036_qseq.txt s_1_1_0060_qseq.txt s_1_1_0084_qseq.txt s_1_1_0108_qseq.txt s_1_1_0013_qseq.txt s_1_1_0037_qseq.txt s_1_1_0061_qseq.txt s_1_1_0085_qseq.txt s_1_1_0109_qseq.txt s_1_1_0014_qseq.txt s_1_1_0038_qseq.txt s_1_1_0062_qseq.txt s_1_1_0086_qseq.txt s_1_1_0110_qseq.txt s_1_1_0015_qseq.txt s_1_1_0039_qseq.txt s_1_1_0063_qseq.txt s_1_1_0087_qseq.txt s_1_1_0111_qseq.txt s_1_1_0016_qseq.txt s_1_1_0040_qseq.txt s_1_1_0064_qseq.txt s_1_1_0088_qseq.txt s_1_1_0112_qseq.txt s_1_1_0017_qseq.txt s_1_1_0041_qseq.txt s_1_1_0065_qseq.txt s_1_1_0089_qseq.txt s_1_1_0113_qseq.txt s_1_1_0018_qseq.txt s_1_1_0042_qseq.txt s_1_1_0066_qseq.txt s_1_1_0090_qseq.txt s_1_1_0114_qseq.txt s_1_1_0019_qseq.txt s_1_1_0043_qseq.txt s_1_1_0067_qseq.txt s_1_1_0091_qseq.txt s_1_1_0115_qseq.txt s_1_1_0020_qseq.txt s_1_1_0044_qseq.txt s_1_1_0068_qseq.txt s_1_1_0092_qseq.txt s_1_1_0116_qseq.txt s_1_1_0021_qseq.txt s_1_1_0045_qseq.txt s_1_1_0069_qseq.txt s_1_1_0093_qseq.txt s_1_1_0117_qseq.txt s_1_1_0022_qseq.txt s_1_1_0046_qseq.txt s_1_1_0070_qseq.txt s_1_1_0094_qseq.txt s_1_1_0118_qseq.txt s_1_1_0023_qseq.txt s_1_1_0047_qseq.txt s_1_1_0071_qseq.txt s_1_1_0095_qseq.txt s_1_1_0119_qseq.txt s_1_1_0024_qseq.txt s_1_1_0048_qseq.txt s_1_1_0072_qseq.txt s_1_1_0096_qseq.txt s_1_1_0120_qseq.txt
If, instead, you are using the output of the GERALD part of the Illumina pipeline, your data will be a little simpler:
~/tutorial/raw% ls s_1_sequence.txt
The next step is to create a file containing our five barcodes. Using a convenient editor, place the barcodes alone in a file, one per line.
~/tutorial% vi barcodes ~/tutorial% more barcodes CGATA GAAGC TGACC AACCC ACTGC
After the last two steps, our working directory should look like this:
~/tutorial% ls barcodes raw/ samples/ stacks/
Finally, we are ready to run the cleaning script. This program examines every read and does several things: it checks that the barcode and the RAD cutsite are correct. If there is a single error in the barcode or the RAD site the script will correct the error. Next, it slides a window down the length of the read and checks the average quality score within the window. If the score drops below 90%, the read is discarded. This allows for some sequencing errors while eliminating reads where the sequence is degrading as it is being sequenced. (These functions are all configurable with command line options to the process_radtags program.
~/tutorial% process_radtags -p ./raw/ -o ./samples/ -b ./barcodes -c -q -r -e sbfI ~/tutorial% cd samples ~/tutorial/samples% ls sample_AACCC.fq sample_ACTGC.fq sample_CGATA.fq sample_GAAGC.fq sample_TGACC.fq
If you used GERALD data in the above step instead of the raw, BUSTARD files you can specify the -f option:
~/tutorial% process_radtags -f ./raw/s_1_sequence.txt -o ./samples/ -b ./barcodes -c -q -r ~/tutorial% cd samples ~/tutorial/samples% ls sample_AACCC.fq sample_ACTGC.fq sample_CGATA.fq sample_GAAGC.fq sample_TGACC.fq
Next, we will rename the files so they are more easily understood. These names will be used in the pipeline and will identify samples in the web interface, so we want to use descriptive names. If you have multiple lanes of sequencing you will likely be re-using barcodes, so it is important to rename them after you process each lane.
~/tutorial/samples% mv sample_CGATA.fq f0_male.fq ~/tutorial/samples% mv sample_GAAGC.fq f0_female.fq ~/tutorial/samples% mv sample_TGACC.fq progeny_01.fq ~/tutorial/samples% mv sample_AACCC.fq progeny_02.fq ~/tutorial/samples% mv sample_ACTGC.fq progeny_03.fq
This is easy enough to do by hand when we have a small number of samples. But as a project grows in size, it pays to use a shell script. The script helps prevent errors and enables you to remember what was done and reproduce the results at a later time. We will put the commands above into the build_samples.sh script, which you can download here.
Here is the layout of our workspace up until this point:
~/tutorial% ls barcodes build_samples.sh* raw/ samples/ stacks/
The first task is to create a MySQL database to hold the results. The first command sends a bit of SQL to the server to create the database. We always use a constant suffix on our Stacks databases ('_radtags') to make it easy to set up blanket database permissions. Here we will name the database after the tutorial. Next, we send the database table definitions to the server to create all the necessary components of the database.
~/tutorial% mysql -e "CREATE DATABASE tut_radtags" ~/tutorial% mysql tut_radtags < /usr/local/share/stacks/sql/stacks.sql
We are ready to run the pipeline. The denovo_map.pl script will run each of the Stacks components: first, running ustacks on each of the five samples, building loci and calling SNPs in each. Second, cstacks will be run to create a catalog of all loci from the parents of the cross, and finally, sstacks will be executed to match each progeny against the catalog. A bit more detail on this process can be found in the FAQ. The denovo_map.pl script will also load the results of each stage of the analysis: individual loci, the catalog, and matches against the catalog into the database (although this can be disabled). After matching to the samples, the script will identify loci that represent mappable markers (genotypes) and will build a database index to speed up access (index_radtags.pl).
~/tutorial% denovo_map.pl -m 3 -M 2 -n 3 -T 15 -B tut_radtags -b 1 -t -a 2010-11-30 \ > -D "Tutorial cross Genetic Map RAD-Tag Samples" \ > -o ./stacks \ > -p ./samples/f0_male.fq \ > -p ./samples/f0_female.fq \ > -r ./samples/progeny_01.fq \ > -r ./samples/progeny_02.fq \ > -r ./samples/progeny_03.fq
Some of the more important command line parameters include:
|-m 3||The minimum stack depth parameter is passed to ustacks and controls the number of exactly matching reads that must be found to create a stack in an individual.|
|-M 2||This parameter is also passed to ustacks and controls the maximum distance allowed between stacks for them to be merged into a putative locus in an individual.|
|-n 3||This parameter is passed to cstacks and provides for fuzzy catalog matching. If a locus is homozygous in both parents, but heterozygous between the parents, this parameter will allow for mismatches between tags when constructing the catalog, allowing these types of loci to be seen as homologous within the catalog. This is useful if your cross contains two distant populations, or is composed of a hybrid cross.|
|-T 15||The number of threads to spawn for each of the constituent programs|
|-t||Remove, or break up, highly repetitive RAD-Tags. In the ustacks program, remove lumberjack stacks from the analysis and break up moderately oversized stacks.|
|-b 1||Batch ID:specifies a batch ID for the database. Stacks can be run multiple times on the same dataset and the results can be stored in the same database by specifying different batch IDs.|
Again, it is useful to put the pipeline command in a shell script, particularly when a project gets large, or when you want to run multiple batches, perhaps with different command line parameters. Here is an example. This is the final configuration of our working directory:
~/tutorial% ls barcodes build_samples.sh* build_tags.sh* raw/ samples/ stacks/
After the pipeline completes our stacks/ directory will look like this:
~/tutorial% cd stacks ~/tutorial/stacks% ls batch_1.catalog.alleles.tsv f0_female.snps.tsv progeny_01.matches.tsv progeny_03.alleles.tsv batch_1.catalog.snps.tsv f0_female.tags.tsv progeny_01.snps.tsv progeny_03.matches.tsv batch_1.catalog.tags.tsv f0_male.alleles.tsv progeny_01.tags.tsv progeny_03.snps.tsv batch_1.markers.tsv f0_male.matches.tsv progeny_02.alleles.tsv progeny_03.tags.tsv denovo_map.log f0_male.snps.tsv progeny_02.matches.tsv f0_female.alleles.tsv f0_male.tags.tsv progeny_02.snps.tsv f0_female.matches.tsv progeny_01.alleles.tsv progeny_02.tags.tsv
The pipeline output is captured in a log file:
~/tutorial/stacks% more denovo_map.log
Once the pipeline has completed execution, the data can be viewed in the web interface. If you named your database with a _radtags suffix, then you can go to a URL similar to the following (although you need to replace stackshost.edu with the name of your webserver) in your web browser and select the tut_radtags database:
Otherwise, you can enter the URL for the dataset directly in the web browser:
Select the catalog viewer.
Once, the catalog data is displayed, you can filter the data in a number of different ways. You can view the distribution alleles in the progeny, or the genotypes. (The image below represents a larger and richer dataset than what we have worked with in the tutorial, in order to display the visual features of the web interface more effectively.) Move your mouse over the image to focus in on a particular feature.
There are two primary ways to export data from the Stacks database. The first is to export a compact form of the data as it is represented in the database, either as a tab-separated values file, or as an Excel spreadsheet.
~/tutorial% export_sql.pl -D tut_radtags -b 1 -f ./stacks/tut_alleles.tsv -o tsv -F snps=1
~/tutorial% export_sql.pl -D tut_radtags -b 1 -f ./stacks/tut_alleles.xls -o xls -F snps=1
The export_sql.pl script allows you to export all of the Stacks data in a compact form. For each locus, the script reports the consensus sequence, the number of parents and progeny that Stacks could find the locus in, the number of SNPs found at that locus and a listing of the individual SNPs and the observed alleles, followed by the particular allele observed in each individual. This is illustrated in the example output seen below (some columns have been left out of the data file for brevity):
The script allows you to specify a number of filters to the data (which are described here). These are the same filters that are available through the web interface, and if you configured it, this script is hooked up directly to the web interface to do the exporting.
The second method to export data from the Stacks database is to generate genotypes specific to a mapping cross, such as a double haploid cross (this command will output all mappable markers present in at least three progeny with genotypes encoded for a double haploid mapping type and it will include any manual corrections made in the web interface):
~/tutorial/% export_sql.pl -D tut_radtags -b 1 -a geno -m DH -f ./stacks/tut_genotypes.tsv -F mark=Any -F vprog=3 -C
If annotations were added in the web interface they will be included in the output -- merged into the catalog ID (in the image above, rows 6 and 16 include an annotation). They will then be visible in any genetic linkage map built from this data set.
The genotypes program can be used to obtain a set of observed haplotypes, or a set of encoded genotypes from a Stacks data set. Genotypes may be exported in a generic format, which contains all recognizable markers, or a map type can be specified and the genotypes will be encoded for that specific map (e.g. certain markers trackable in an F1 cross are not trackable in a back cross, and vice versa). If ouputing a genetic map, and if enabled with the -c option, the genotypes program will make automated corrections to the data. Since loci are matched up in the population, the script can correct false-negative heterozygote alleles, as well as false-positive homozygous alleles, since it knows the existence of alleles at a particular locus in the other individuals. Corrections are shown in capital letters (colored in this image to improve visibility). If the -s option is specified, a second file will be output containing the genotypes in SQL format -- which can be imported back in to the database.
If observed haplotypes are instead exported, data quality can be controlled with -r, which requires a locus to be present in a certain number of individuals, and with -m which requires that a locus for a particular individual have a minimum depth of reads before it is included in the export.