Angel Rivera-Colón1, Nicolas Rochette1, and Julian Catchen1
1Department of Evolution, Ecology, and Behavior University of Illinois at Urbana-Champaign Urbana, Illinois, 61820 USA |
RADinitio is a pipeline for the assessment of RADseq experiments via prospective and ]retrospective data simulation. Sequencing data is generated de novo from a population of individuals via a coalescent simulation under a user-defined demographic model using msprime. The genetic variants in each sample are simulated in a genomic context that can impact downstream generation of RAD loci and sequencing reads. The per-individual sequences are then treated to an in silico RADseq library preparation. The components of the library are defined by the user, allowing for the exploration of parameters including restriction enzyme selection, insert size, sequencing coverage, and PCR duplicate distribution (generated using the decoratio software). RADinitio simulations can also be applied retrospectively by comparing and modelling sources of error in empirical datasets. The purpose of RADinitio is for researchers to fully explore possible variables in their data generation process to ensure that their protocol selection and library preparation is performed optimally, within the limitations of technical and experimental error.
RADinitio can be installed from the Python Package Index using:
% python3 -m pip install radinitio
More information regarding this installation can be obtained at the RADinitio PyPI Project page.
Users can install RADinitio in their local Python directories using the --user flag:
% python3 -m pip install radinitio --user
Alternatively, the software can be downloaded from the Catchen Lab website and installed using the following commands:
% python3 -m pip install radinitio-version.tar.gz
Or for a local installation:
% python3 -m pip install radinitio-version.tar.gz --user
For more information regarding the pip installation, please visit the pip user-guide.
RADinitio requires Python 3.7 or higher.
RADinitio depends on the following non-default Python packages: msprime (Kelleher et al. 2016; Baumdicker et al. 2022; [docs]), decoratio (Rochette et al. 2023; [docs]), numpy (Virtanen et al. 2020; [docs]), and scipy (Harris et al. 2020; [docs]).
Running the pip installation, from both PyPI or the the direct installation will take care of all dependencies, including msprime, decoratio, numpy, and scipy.
RADinitio is designed as a series of independent functions that simulate different aspects of the RADseq library preparation process. The pipeline can be broken down into three main steps:
Variants are generated with msprime from a user-defined demographic model. Independent simulations are run for different chromosomes/scaffolds present in a user-provided reference genome. The simulated variants are then projected against the reference genome to obtain the reference alleles, which are then converted into alternative alleles using a user-defined model. By default, RADinitio simulates using msprime's island models; however, users can simulate more complex models using the Python API.
The reference genome is in silico digested to obtain a series of reference RAD loci. This can be done using a single or double enzyme digestion, for single- and double-digest RADseq, respectively. The positions of the reference loci in the genome are then used to filter the simulated variants for all samples to include only variants present in RAD loci, improving downstream performance. When RAD alleles are extracted, the reference RAD loci are modified to include the corresponding genetic variants for each sample. This process can alter a cutsite's sequence, resulting in a dropped allele for that sample.
Once extracted, the alleles for each sample are randomly sampled to obtain paired-end sequences for each allele template. The alleles are sampled with replacement, proportional to the desired sequencing coverage of the library. Each iteration of the sampling is treated as an independent sequence template that is truncated to a random length sampled from a simulated insert size distribution (or digested with a second enzyme, as is the case with ddRAD). An optional PCR duplicate distribution, as specified using the decoratio software, can be applied to the sampling process, resulting in the duplicate sampling of unique template sequences. This process can also introduce random error to the sequence, simulating the generation of PCR errors. Finally, paired end sequences are generated from each template, each with its own unique (simulated) sequencing error.
The corresponding functions for each stage can be run independently. We do provide a wrapper script, radinitio, that calls the top-level radinitio.simulate() function which runs the pipeline from start to finish. Advanced users can run the pipeline through the Python API, which allows for the generation of more complex demographic models, define finer details of the library preparation process, and run components of the pipeline independently.
The simplest way to run RADinitio is to execute the module via the radinitio command line wrapper, which is available with the default pip installation. By default, this wrapper calls the top-level radinitio.simulate() function, which runs all the stages of the pipeline.
The radinitio program options are the following:
radinitio --genome path --out-dir dir [pipeline-stage] [--chromosomes path] [(demographic model options)] [(library options)] [(pcr options)] [(advanced options)]
The simplest RADinitio simulation using the radinitio wrapper requires just the specification of a reference genome fasta (--genome) and an output directory (--out-dir). The reference sequence (reference.fa.gz in this example) is a fasta file containing the genomic sequences of one or more chromosomes/scaffolds. The reference sequences can be at a chromosome-level, or broken into smaller scaffold sequences. The file can be compressed (gzipped) or uncompressed.
Using the default parameters runs the whole simulation pipeline and generates all possible outputs (--simulate-all). This will simulate a standard island model for 2 populations, sequencing 10 individuals per population (20 total). The single-digest library will be made with the enzyme SbfI, using the default insert size distribution (mean 350 bp and stdev 35bp) and generating 2x150bp reads at 20x coverage.
% radinitio --simulate-all \ # Execute the complete pipeline (default) --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simple-simulation/ # Path to output directory
The radinitio command line options allow the user to control which sequences in the reference genome are used for the simulations, without the need to modify the input fasta file. The next example uses a chromosome list file (specified with --chromosomes). This file contains a list with the sequence IDs that the user wants to use for the simulation, one per line. The IDs in this file must be as they appear on the reference.fa (NOTE: without the ">", as this is part of the fasta header!). The file follows the following structure:
% head chrom_list.txt chr1 chr2 chr3 chr4 chr5
Only the sequences on the list will be used as input for the simulations. This is important when working with highly fragmented assemblies or those with many small unplaced scaffolds, allowing the user to simulate only over the chromosome-scale sequences, for example. If this option is not provided, radinitio will simulate over all the sequences in the input fasta.
Here we run the same default radinitio simulation as before, but instead we provide a chrom_list.txt file containing the IDs of the target sequences of interest.
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simple-simulation/ \ # Path to output directory --chromosomes ./genome/chrom_list.txt # Path to the list of the target chromosomes
In addition to the use of --chromosomes, the input sequences can be filtered according to their length using the --min-seq-len option. The simulations will be performed only for sequences larger than the specified value. Here, we run the default radinitio simulation, keeping only sequences larger than 1 Mbp.
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simple-simulation/ \ # Path to output directory --min-seq-len 1000000 # Only keep reference sequences larger than 1,000,000 bp
The radinitio.log file reports the number of sequences read from the reference genome fasta file, as well as those kept after selection/filtering. For example, here the reference genome contained 1,844 total sequences; however, we only selected 24 (e.g., 24 chromosomes) for downstream analysis.
Loading the genome... Read 1,844 sequences from the input FASTA. Loaded 24 chromosome/scaffold sequences.
RADinitio allows the user to specify the parameters of their simulated RADseq library. By default, as done in the previous examples, the radinitio wrapper simulates a single-digest RADseq (sdRAD) library, following the protocols published by Baird et al. (2008) and Etter et al. (2011). This library uses the SbfI restriction enzyme to generate a library with an insert size distribution of an mean length of 350bp (standard deviation 37bp), coverage of 20x, read length of 150bp (2x150bp paired-end), and no PCR amplification. The default simulation is equivalent to:
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simulations_sdRAD/ \ # Path to output directory --library-type sdRAD \ # Simulate a single-digest library --enz SbfI \ # Use the SbfI enzyme --insert-mean 350 \ # Insert size mean of 350bp --insert-stdev 37 \ # Insert size stdev of 37bp --coverage 20 \ # Depth of coverage of 20x --read-length 150 # Read length of 150bp (paired-end 2x150bp)
Instead, we can generate a library under completely different parameters. Here, we generate a double-digest RADseq (ddRAD) library, based on the Peterson et al. (2012) protocol. This library uses the PstI restriction enzyme as the main (or rare) cutter and MspI as the secondary (or common) cutter. Instead of the default insert size distribution (which uses insert size mean and standard deviation), we are specifying a fixed insert size range of 225-475 bp. We are also sequencing this library using 2x100bp paired-end reads (instead of the default 150bp) at a higher coverage of 30x. Additionally, we are retaining only sequences larger than 10 Kbp for the simulations.
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simulations_ddRAD/ \ # Path to output directory --min-seq-len 10000 # Only keep reference sequences larger than 10,000 bp --library-type ddRAD \ # Simulate a double-digest library --enz PstI \ # Use the PstI enzyme as the main (rare) cutter --enz2 MspI \ # Use the MspI enzyme as the secondary (common) cutter --min-insert 225 \ # Min insert size of 225 bp --max-insert 475 \ # Max insert size of 475 bp --coverage 30 \ # Depth of coverage of 30x --read-length 100 # Read length of 100bp (paired-end 2x100bp)
More advanced parameters for the library preparation options are available in the radinitio command line options (e.g., --lib-bar1-len to specify the length of barcode sequences removed from the reads, or --lib-5-error/--lib-3-error to control sequencing errors). However, these options can be left default for the vast majority of simulations. Advanced users can additionally control these options in the radinitio.LibraryOptions() class in the Python API.
RADinitio can be used to simulate and study PCR duplicates, a common artifact in empirical RADseq libraries, which have wide-spread implications for allelic dropout and genotyping accuracy. For further details on the effects of PCR duplicates in RADseq libraries, see Rochette et al. (2023), Rivera-Colón et al. (2021), and Rivera-Colón & Catchen (2022).
The RADinitio pipeline can simulate a PCR model, which generates a distribution of PCR clone sizes and errors, which are then used when sampling sequencing reads. This model introduces both PCR duplicate reads and mutations in the sequencing reads, mimicking the error patterns seen in real-life RADseq libraries. The generation of this PCR model is off by default; however, the user can define the parameters of an amplification model from the radinitio wrapper command line options, or with the radinitio.PCRDups() class in the Python API.
When no --pcr-model is selection, the radinitio.log reports the following information:
PCR model options: PCR amplification model: None (No PCR) PCR cycles: None PCR duplicate rate: 0.00%
We see that no PCR model was selected (defaults to None). Thus, there were no PCR cycles applied, and the PCR duplicate rate is 0% (i.e., each individual sequencing reads was sampled from an unique template).
In contrast, the user can choose and apply a PCR model to the downstream data. Since version 1.2.0, RADinitio uses the PCR models described by the decoratio software (Rochette et al. 2023). These PCR models are largely comprised of two steps: (1) an amplification model (which describes the amplification probability and noise applied over some number of PCR cycles) and (2) a depth/complexity ratio which describes how the sequenced reads are sampled from the pool of amplified molecules.
The different PCR models can be specified in radinitio using the --pcr-model flag. Two main types of models are available, each with two main variants (4 models total). First, the inherited efficiency models described in Best et al. (2015). When choosing --pcr-model inheff the user controls the amplification probability and noise by controlling the mean (--pcr-inheff-mean) and standard deviation (--pcr-inheff-sd) of a normal distribution defining the amplification probability of the virtual molecules across a number of PCR amplification cycles (--pcr-cycles). A related model, --inheffbeta, uses a beta distribution for parametrization purposes instead of the normal distribution, and produces similar results. By default, radinitio uses 0.7 and 0.1 for the mean and standard deviation of these distributions, respectively.
The second class of amplification models defined the amplification probability based on a log-normal distribution. Both models (lognormal and logskewnormal) use the standard deviation of the distribution (specified with --pcr-lognormal-sd) as the main parameter controlling amplification probability and noise. The logskewnormal model uses the skew of the distribution (--pcr-lognormal-skew) as an additional parameter. For more information of these models and their specification please check the decoratio manuscript (Rochette et al. 2023) and documentation.
In addition to the amplification model, the other (and more important parameter) controlling the PCR duplicate distribution is the depth/complexity ratio. This value describes the ratio of sequenced molecules (the depth of coverage) to the molecular complexity (the number of unique molecules). In other words, if sequencing at 30x, but the complexity is 10 molecules, the library has a depth/complexity ratio of 3:1. This means that, while it is possible to sequence 30 reads, only 10 of them can have unique information and the remaining 20 reads will be duplicates. radinitio users can specify this ratio using the --pcr-deco-ratio option. Smaller values will generally reduce the rate of PCR duplicates seen in the simulated library. See Rochette et al. (2023) for further information regarding the depth/complexity ratio.
Using the models and parameters described above, radinitio users can simulate a RADseq library containing a low-to-moderate number of PCR duplicates. We are using the default library construction parameters (single-digest with SbfI, 20x coverage, 2x150bp reads, 350bp insert mean and 37bp insert stdev). Additionally, we are applying an inherited efficiency amplification model (--pcr-model inheff). The mean amplification probability is 45% (--pcr-inheff-mean 0.45) with a standard deviation of 20% (--pcr-inheff-sd 0.2) for 12 cycles of PCR (--pcr-cycles 12). This means that a DNA molecule would have a 45% chance of being amplified in each of the 12 PCR cycles. This distribution produces both lower and noisier amplification probabilities than the default inherited efficiency model (mean of 0.7 and stdev of 0.1). Finally, we are also setting the depth/complexity ratio to 1:10 (--pcr-deco-ratio 0.10).
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simulations_sdRAD/ \ # Path to output directory --library-type sdRAD \ # Simulate a single-digest library --enz SbfI \ # Use the SbfI enzyme --insert-mean 350 \ # Insert size mean of 350bp --insert-stdev 37 \ # Insert size stdev of 37bp --coverage 20 \ # Depth of coverage of 20x --read-length 150 \ # Read length of 150bp (paired-end 2x150bp) --pcr-model inheff \ # Inherited efficiency amplification model --pcr-cycles 12 \ # Amplify for 12 PCR cycles --pcr-inheff-mean 0.45 \ # Mean amplification of 45% --pcr-inheff-sd 0.2 \ # Stdev amplification of 20% --pcr-deco-ratio 0.10 # De/Co ratio of 1:10 or 0.1
After running this simulation, we can see the description of the PCR model on the radinitio.log file:
PCR model options: PCR amplification model: inheff:12:0.45:0.2 PCR cycles: 12 depth/complexity ratio: 0.1 PCR duplicates rate: 17.432%
This PCR model generated a PCR duplicate rate of 17.4%. We can see a per-clone size breakdown of the frequency of these duplicates (i.e., the proportion of clones with a given number of duplicates) in the sequenced_clone_distrib.tsv file. Additionally, the log reports the parameters of the model, including the number of cycles, depth/complexity ratio, and the "model string". This "model string" (inheff:12:0.45:0.2 in this example) is the definition of the model, as specified by decoratio. In this example, the string describes (in order, separated by ":"): (1) the selected model (inheff), (2) number of PCR cycles (12), (3) mean amplification probability (0.45), and (4) the amplification probability standard deviation (0.2). Different PCR models used by RADinitio have different s pecifications and model string. See the decoratio documentation for additional details on PCR model specifications.
RADinitio users can also change the PCR amplification models to simulated RADseq libraries with elevated duplicate rates. While this is something researchers want to avoid in their empirical libraries, experimenting with high PCR duplicates in simulations may be useful (e.g., to observe how the added noise affects the recovery and genotyping of loci and the reconstruction of biological information).
In this example, we repeat the general library construction process from above (single-digest with SbfI, 20x coverage, 2x150bp reads, 350bp insert mean); however, we modify the PCR parameters to (1) increase the noise of the inherited efficiency amplification model, and (2) drastically increase the depth/complexity ratio (from 0.1 to 10).
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./simulations_sdRAD/ \ # Path to output directory --library-type sdRAD \ # Simulate a single-digest library --enz SbfI \ # Use the SbfI enzyme --insert-mean 350 \ # Insert size mean of 350bp --insert-stdev 37 \ # Insert size stdev of 37bp --coverage 20 \ # Depth of coverage of 20x --read-length 150 \ # Read length of 150bp (paired-end 2x150bp) --pcr-model inheff \ # Inherited efficiency amplification model --pcr-cycles 12 \ # Amplify for 12 PCR cycles --pcr-inheff-mean 0.25 \ # Mean amplification of 25% --pcr-inheff-sd 0.5 \ # Stdev amplification of 50% --pcr-deco-ratio 10 # De/Co ratio of 10:1 or 10
After running this simulation, we can see the description of the PCR model on the radinitio.log file:
PCR model options: PCR amplification model: inheff:12:0.25:0.5 PCR cycles: 12 depth/complexity ratio: 10 PCR duplicates rate: 94.019%
We obtain 94% PCR duplicates! Notice how we didn't change the number of PCR cycles, after all, the individual contribution of the number of cycles to PCR duplicate is negligible (see Rochette et al. 2023). This increase in PCR duplicates is mainly the product of the depth/complexity ratio, as we are sequencing 10x more reads than there are available molecules in the library. This is further exacerbated by the increased noisiness in the PCR amplification. While this is an extreme case, it provides an example of the behavior of the PCR models in RADinitio and provides useful guidelines for the monitoring of empirical RADseq experiments when elevated PCR duplicate rates are observed.
WARNING: Very noisy PCR amplification models can take a long time to run. Similarly, models with very extreme values can also fail due to several reasons. For example, if amplified clones reach very large sizes, they can trigger infinity-related errors when extracting the amplified clone size distribution.
The radinitio wrapper script simulates a population using an island model from msprime (Kelleher et al. 2016; Baumdicker et al. 2022). By default, this model simulates two populations, each with an effective population size (Ne) of 5,000 individuals. Ten individuals are sampled and sequenced per population, for a total of 20 individuals in the library.
The wrapper script allow the user to modify some parameters of this model. For example, the user can change the number of populations (--n-pops), the effective size of the populations (--pop-eff-size), and the number of sampled individuals per population (--n-seq-indv). Other parameters, e.g., controlling the migration rate (--pop-mig-rate), growth rate (--pop-growth-rate), and mutation rate (--genome-mut-rate), are available in the advanced options.
As an example, we can run a new simulation (running the whole pipeline with --simulate-all) simulating a new island model with 4 populations, each with an effective population size (Ne) of 2,500 individuals. We sampled 25 individuals per population, for a total of 100 sequenced individuals. We will also increase the total per-population, per-generation immigration rate between these populations to 2.5%. The library is generated with default parameters (e.g., single-digest with SbfI, coverage of 20x, read length of 150bp, insert distribution of 350bp, stdev 35bp, no PCR).
% radinitio --simulate-all \ # Execute the complete pipeline --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./sims_sdRAD_4pops/ \ # Path to output directory --n-pops 4 \ # Number of populations in the island model --pop-eff-size 2500 \ # Effective size (Ne) of the populations --n-seq-indv 25 \ # Number of individuals to sequence per-population (100 total) --pop-mig-rate 0.025 # Per generation immigration rate --library-type sdRAD \ # Simulate a single-digest library --enz SbfI # Use the SbfI enzyme
The options in the radinitio wrapper are useful for more general simulations, e.g., studying the effects of genetic diversity over the final RADseq library. However, they are limited to a relatively simple island model and the parameters are applied equaly to all populations (i.e., all populations will be of the same size and migration rates will be symmetrical). More detailed demographic models are available through the msprime documentation. RADinitio users can implement these complex simulations using the msprime Python API, after which they can generate in silico RADseq libraries from the simulated populations (see section on radinitio.make-library-seq() below).
The most common application of RADinitio is simulating an in silico digestion to estimate the number of RADseq loci in an experiment. This estimation is commonly used to estimate sequencing coverage and individuals in a library during experimental design (see section 2.3 in Rivera-Colón & Catchen (2022) for an example).
An estimation of the number of RADseq loci in a genome can be easily generated using the --tally-rad-loci option in the radinitio script. This option only identifies and filters the RADseq loci found in the genome based on the library properties, skipping the simulation of the population(s) and the generation of sequencing reads.
For example, here we tally the number of RADseq loci in the genome using the default library parameters (single-digest library with SbfI).
% radinitio --tally-rad-loci \ # Execute just the tallying of loci --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./tally_sbfi_loci/ \ # Path to output directory --library-type sdRAD \ # Simulate a single-digest library --enz SbfI # Use the SbfI enzyme
The radinitio.log file will report the distribution of RAD loci found in the genome. In this case, based on the single-digest SbfI simulation we find the following information in the radinitio.log (for versions 1.2.0 or higher):
Extracting RAD loci... Found a total of 6,120 SbfI loci in the genome (use this number for coverage calculations). Removed 3 loci (0.0%) too close to the ends of the reference sequences. Removed 2 loci (0.0%) containing excess (>10.0%) of Ns in sequence. Removed 428 loci (7.0%) in close proximity (<1000bp) to another locus. Retained 5,687 loci (92.9%) for downstream analysis. See `./output/tally-rad-loci/ri_sdRAD_sbfI/ref_loci_vars/reference_rad_loci.stats.gz` for a detailed description of the per-cutsite position and status.
In this example, RADinitio found 6,120 SbfI loci in the genome. Since this is a single-digest library, this means 3,060 instances of the SbfI recognition sequence (CC|TGCAGG) were found in the genome. Each cutsite yields two loci: one in the negative and one in the positive DNA strand. The two loci originating from a single restriction enzyme cutsite are treated as separate objects and are independently filtered. As stated by the log, these 6k loci can be used to estimate sequencing coverage and number of pooled individuals in the real library.
However, not all loci are retained for downstream analysis. RADinitio filters certain loci based on the properties of the sequence. For example, here 3 loci are removed from the analysis as they are too close to the end of the sequence. This distance is equal to the base size of each reference locus. By default, this distance is 1Kbp, and can be controlled with the --lib-base-len advanced option. These loci are removed since it would not be possible to extract a complete sequence from the reference. Also, loci can be removed if their reference sequence contains an excess of uncalled bases (or Ns), by default 10% (controlled with the --lib-max-prop-n advanced option). These loci are removed to prevent extracting loci spanning large gaps or regions with uncertain sequences.
More commonly, loci are removed due to their proximity to one another, as happened for 428 (or 7% of loci here). A locus is removed when it is under 1,000 bp (by default, controlled by --lib-min-dist) away from a another locus originating from a different restriction enzyme cutsite. This filtering process serves two purposes: first, it mimics the real-life process of shearing efficiency during library preparation, where shearing efficiency decreases for smaller DNA molecules, reducing the chance of recovering a molecule of the right size. Secondly, it removes loci originating from tandemly duplicated and/or repetitive cutsites. Many of these loci would be overlapping with one another in a real RADseq library, and their enzymatic digestion and shearing would be impacted by their close proximity.
Lastly, the reference_rad_loci.stats.gz file (full path provided in the radinitio.log) provides a locus-by-locus description of the status of each loci. The file reports the ID for the sequence in which the locus is located (#chrom_id), the basepair coordinate in which the main cutsite was found (cut_pos), the direction of the tag (tag_dir) in the DNA strand, and the status of each locus. An individual locus can be kept for downstream analysis (kept) or removed for one of the reasons described above, e.g., too close the end of the sequence (rm_chrom_end) or too close to an adjacent locus (rm_too_close).
#chrom_id cut_pos tag_dir status chr1 11811 neg kept chr1 11811 pos kept chr2 329 neg rm_chrom_end chr2 329 pos kept chr3 24972 neg kept chr3 24972 pos kept chr4 3940 neg kept chr4 3940 pos kept chr4 15514 neg kept chr4 15514 pos kept chr5 5137 neg kept chr5 5137 pos kept chr5 22829 neg kept chr5 22829 pos rm_too_close chr5 22954 neg rm_too_close chr5 22954 pos kept chr6 14834 neg rm_excess_Ns chr6 14834 pos rm_excess_Ns chr6 27346 neg kept chr6 27346 pos kept ...
We can alter the properties of the simulated library to look at how the number of estimated loci changes. In this example, we will simulate a new single-digest library changing the restriction enzyme from the default SbfI (which was an 8-bp recognition sequence) to EcoRI (which has the 6-bp recognition sequence G|AATTC).
% radinitio --tally-rad-loci \ # Execute just the tallying of loci --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./tally_ecoRI_loci/ \ # Path to output directory --library-type sdRAD \ # Simulate a single-digest library --enz EcoRI # Use the EcoRI enzyme
From this simulation we obtain:
Extracting RAD loci... Found a total of 25,836 EcoRI loci in the genome (use this number for coverage calculations). Removed 14 loci (0.0%) too close to the ends of the reference sequences. Removed 26 loci (0.1%) containing excess (>10.0%) of Ns in sequence. Removed 4,648 loci (18.0%) in close proximity (<1000bp) to another locus. Retained 21,148 loci (81.8%) for downstream analysis.
As expected, given the smaller restriction enzyme recognition sequence, we find far more EcoRI loci in the genome - 25.8 thousand compared to the 6 thousand found with SbfI. When preparing an empirical library, the user would have to take into account how this order of magnitude difference in the number of loci would impact sequencing coverage and the number of samples in a library. Most importantly, we see how the increase in the number of loci impacts their filtering, as a higher proportion of loci are removed due to close proximity.
The identification of loci in single-digest loci in RADseq libraries depends only on the distribution of the restriction enzyme recognition sequence in the genome. However, in ddRAD libraries the retained loci are impacted by the enzyme combination in addition to the insert size range (see Figure 4 in Rivera-Colón et al. 2021). RADinitio allows users to simulate both the enzyme selection and insert size range when tallying loci from ddRAD libraries.
Here, we tally the number of loci in a ddRAD library generated using the EcoRI enzyme as the main cutter and MseI as the secondary (common) cutter. We are using the default insert size distribution (mean of 350bp and stdev of 37bp). This distribution corresponds to a mimumum insert size of 276bp and a max insert size of 424bp (i.e., can also be specified with the --min-insert/--max-insert options).
% radinitio --tally-rad-loci \ # Execute just the tallying of loci --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./count_ddrad_loci/ \ # Path to output directory --library-type ddRAD \ # Simulate double-digest library --enz EcoRI \ # Use the EcoRI enzyme as the main (rare) cutter --enz2 MseI \ # Use the MseI enzyme as the secondary (common) cutter --min-insert 276 \ # Minimum insert size of 276 bp --min-insert 424 # Maximum insert size of 424 bp
From this simulation we find:
Extracting RAD loci... Found a total of 16,064 EcoRI-MseI loci (insert size range: 276-424bp) in the genome (use this number for coverage calculations). Note: an additional 9,550 loci were found outside the target insert size range. Removed 4 loci (0.0%) too close to the ends of the reference sequences. Removed 11 loci (0.1%) containing excess (>10.0%) of Ns in sequence. Removed 2,898 loci (18.0%) in close proximity (<1000bp) to another locus. Retained 13,151 loci (81.9%) for downstream analysis.
As for the single-digest examples above, for ddRAD libraries the radinitio.log file reports a total number of loci found in the genome (16 thousand in this example) which can be used for coverage estimations. Similarly, the log reports the filter of loci due to proximity with other loci and/or to the edge of sequences. A key difference between the single- and double-digest simulations is that radinitio reports the presence of additional ddRAD loci present outside the size range. The line starting with "Note" reports that, in addition to our 16 thousand loci, an additional 9,550 are found outside the 276-424bp size range. These loci are reported with the status of rm_no_dd_cuts in the reference_rad_loci.stats.gz file. By changing the size range, the user can determine the number of additional loci retained under different library conditions.
Here, we simulate the same library as before (ddRAD with EcoRI and MseI); however, we increase the size selection range by about 100bp (50bp in each direction), from 225bp to 475bp.
% radinitio --tally-rad-loci \ # Execute just the tallying of loci --genome ./genome/reference.fa.gz \ # Path to reference genome --out-dir ./count_ddrad_loci/ \ # Path to output directory --library-type ddRAD \ # Simulate double-digest library --enz EcoRI \ # Use the EcoRI enzyme as the main (rare) cutter --enz2 MseI \ # Use the MseI enzyme as the secondary (common) cutter --min-insert 225 \ # Minimum insert size of 225 bp --min-insert 475 # Maximum insert size of 475 bp
After simulating a ddRAD library with this increased insert size range we find:
Extracting RAD loci... Found a total of 20,465 EcoRI-MseI loci (insert size range: 225-475bp) in the genome (use this number for coverage calculations). Note: an additional 5,149 loci were found outside the target insert size range. Removed 4 loci (0.0%) too close to the ends of the reference sequences. Removed 11 loci (0.1%) containing excess (>10.0%) of Ns in sequence. Removed 3,667 loci (17.9%) in close proximity (<1000bp) to another locus. Retained 16,783 loci (82.0%) for downstream analysis.
The number of found loci increased from 16 thousand to over 20 thousand. The 100bp-increase in the insert size range lead to about 4 thousand more loci being included in the library. We see that over 5 thousand more loci are still found outside the size range, and thus this parameter could be further optimized to recover more loci.
Warning: Many of these additional ddRAD loci might not be recoverable experimentally, e.g., too small to be feasibly sequenceable (if they are smaller than the desired read length) or larger than the capacity of the PCR and/or sequencer. Thus, the objective is not to recover all the available loci in the genome.
The --make-population option from the radinitio wrapper allows the user to run the initial stage of the RADinitio pipeline, simulating a population using the **msprime** island model and generating a genome-wide VCF with the simulated genetic variants. This mode is library agnostic, and does not extract RADseq loci nor generates sequencing reads. The purpose of this mode is to generate a fixed set of genetic variants that can then be processed under a variety of library configurations (see --make-library-seq example below).
In this example, we will run radinitio under the --make-population mode. We will simulate an island model of 6 populations or demes, each with an effective population size of 10,000. We will sequence 25 individuals per population, or 150 individuals in total. Additionally, we will filter the input genome to only keep sequences larger than 1 Mbp.
% radinitio --make-population \ # Simulate the population only --genome ./genome/reference.fa.gz \ # Path to the reference genome --min-seq-len 1000000 \ # Only keep reference sequences larger than 1,000,000 bp --out-dir ./mpop_p6_ne10k_n25/ \ # Path to output directory --n-pops 6 \ # Simulate 6 populations --pop-eff-size 10000 \ # Each population has an Ne of 10,000 --n-seq-indv 25 # Sample 25 individuals per population
This mode will generate three main outputs (see description of outputs below):
Using the --make-library-seq option from the radinitio wrapper, the user can generate an in silico RADseq library using an existing population simulation. Most commonly, this existing population will be the product of an radinitio --make-populations run; however, --make-library-seq can be generated from a custom msprime (or other simulators such as SLiM or stdpopsim) run made by the user.
To run radinitio in the --make-library-seq pipeline stage, the user must provide a directory containing an existing population simulation (--make-pop-sim-dir). This directory must not be the same output directory for the run, and must contain a genome-wide master VCF as well as a population map, as described above for --make-population.
Here, we take our previous simulation of 6 populations and 150 individuals and simulate a single-digest SbfI RADseq library. The library will have a mean insert size of 450 bp (stdev of 50bp), will be PCR amplified using a default inherited efficiency model for 9 cycles, and sequenced to 30x of coverage. The output of 2x100 bp reads will be exported in fastq format (--read-out-fmt fastq). We will additionally simulate over a specific set of chromosomes in the reference sequence.
% radinitio --make-library-seq \ # Simulate the population only --genome ./genome/reference.fa.gz \ # Path to the reference genome --chromosomes ./genome/chrom.list \ # Path to the list of the target chromosomes --out-dir ./SbfI_library/ \ # Path to the output directory --make-pop-sim-dir ./mpop_p6_ne10k_n25/ \ # Path to the previous `make-populations` run --library-type sdRAD \ # Simulate single-digest library --enz SbfI \ # Use the SbfI enzyme --insert-mean 450 \ # Insert size mean of 450bp --insert-stdev 50 \ # Insert size std deviation of 50bp --pcr-model inheff \ # Inherited efficiency amplification model --pcr-cycles 9 \ # Amplify for 9 PCR cycles --coverage 30 \ # Sequence to 30x coverage --read-length 100 \ # Read length of 100bp (2x100bp paired-end) --read-out-fmt fastq # Export reads in FASTQ format
Additionally, we will simulate a double-digest library over that same existing population. The ddRAD library will be generated with the PstI and MspI restriction enzymes, and size selected for a 250-450bp insert size range. The library will be PCR amplified using an inherited efficiency model for 9 cycles, and sequenced to 30x of coverage using 2x100bp reads, in fastq format. We will additionally simulate over a specific set of chromosomes in the reference sequence.
% radinitio --make-library-seq \ # Simulate the population only --genome ./genome/reference.fa.gz \ # Path to the reference genome --chromosomes ./genome/chrom.list \ # Path to the list of the target chromosomes --out-dir ./PstI-MspI_library/ \ # Path to the output directory --make-pop-sim-dir ./mpop_p6_ne10k_n25/ \ # Path to the previous `make-populations` run --library-type ddRAD \ # Simulate double-digest library --enz PstI \ # Use the PstI enzyme as the main (rare) cutter --enz2 MspI \ # Use the MspI enzyme as the secondary (common) cutter --min-insert 250 \ # Minimum insert size of 250 bp --min-insert 450 # Maximum insert size of 450 bp --pcr-model inheff \ # Inherited efficiency amplification model --pcr-cycles 9 \ # Amplify for 9 PCR cycles --coverage 30 \ # Sequence to 30x coverage --read-length 100 \ # Read length of 100bp (2x100bp paired-end) --read-out-fmt fastq # Export reads in FASTQ format
The simplest way to generate a chromosome list file is to extract the sequence IDs from the desired fasta. The IDs must not begin with the ">" character as this is part of the fasta header convention and not part of the ID. This can be done by piping a few commands in Bash:
# Create a chrom_list.txt from all the sequences in the gzipped reference % zcat reference.fa.gz | grep '^>' | cut -f1 -d ' ' | tr -d '>' > chrom_list.txt
The user can apply other filters to remove additional sequences.
Another way of generating a chromosome list file is by parsing the genomic index (*.fai) generated by samtools faidx. See SAMtools documentation. From the gzipped fasta...
# Uncompress genome % gunzip reference.fa.gz # Run samtools faidx % samtools faidx reference.fa # Re-compress the genome % gzip reference.fa
The generated index follows the following structure (see documentation):
% head reference.fa.fai scaf1 37000 17 60 61 scaf2 60500 37651 60 61 chr1 37057500 99163 60 61 chr2 31613927 37774292 60 61 chr3 29895000 69915122 60 61 chr4 29166904 100308375 60 61 chr5 29222005 129961399 60 61
The first and second columns correspond to the sequence IDs (notice they don't have the ">") and lengths, respectively. We can generate a chromosome list by, for example, filtering the sequences by length. Here we filter the reference.fa.fai file to generate a chromosome list containing only sequences larger than 1Mbp using the program awk:
% cat reference.fa.fai | awk '$2 > 1000000 {print $1}' > chrom_list.1mbp.txt
The outputs of a default radinitio --simulate-all run are:
% ls * output_directory: popmap.tsv radinitio.log sequenced_clone_distrib.tsv output_directory/msprime_vcfs: chr1.vcf.gz chr7.vcf.gz chr13.vcf.gz chr19.vcf.gz chr2.vcf.gz chr8.vcf.gz chr14.vcf.gz chr20.vcf.gz chr3.vcf.gz chr9.vcf.gz chr15.vcf.gz chr21.vcf.gz chr4.vcf.gz chr10.vcf.gz chr16.vcf.gz chr22.vcf.gz chr5.vcf.gz chr11.vcf.gz chr17.vcf.gz chr23.vcf.gz chr6.vcf.gz chr12.vcf.gz chr18.vcf.gz chr24.vcf.gz output_directory/rad_alleles: msp_00.alleles.fa.gz msp_10.alleles.fa.gz msp_01.alleles.fa.gz msp_11.alleles.fa.gz msp_02.alleles.fa.gz msp_12.alleles.fa.gz msp_03.alleles.fa.gz msp_13.alleles.fa.gz msp_04.alleles.fa.gz msp_14.alleles.fa.gz msp_05.alleles.fa.gz msp_15.alleles.fa.gz msp_06.alleles.fa.gz msp_16.alleles.fa.gz msp_07.alleles.fa.gz msp_17.alleles.fa.gz msp_08.alleles.fa.gz msp_18.alleles.fa.gz msp_09.alleles.fa.gz msp_19.alleles.fa.gz dropped_alleles.tsv.gz output_directory/rad_reads: msp_00.1.fa.gz msp_07.1.fa.gz msp_14.1.fa.gz msp_00.2.fa.gz msp_07.2.fa.gz msp_14.2.fa.gz msp_01.1.fa.gz msp_08.1.fa.gz msp_15.1.fa.gz msp_01.2.fa.gz msp_08.2.fa.gz msp_15.2.fa.gz msp_02.1.fa.gz msp_09.1.fa.gz msp_16.1.fa.gz msp_02.2.fa.gz msp_09.2.fa.gz msp_16.2.fa.gz msp_03.1.fa.gz msp_10.1.fa.gz msp_17.1.fa.gz msp_03.2.fa.gz msp_10.2.fa.gz msp_17.2.fa.gz msp_04.1.fa.gz msp_11.1.fa.gz msp_18.1.fa.gz msp_04.2.fa.gz msp_11.2.fa.gz msp_18.2.fa.gz msp_05.1.fa.gz msp_12.1.fa.gz msp_19.1.fa.gz msp_05.2.fa.gz msp_12.2.fa.gz msp_19.2.fa.gz msp_06.1.fa.gz msp_13.1.fa.gz msp_06.2.fa.gz msp_13.2.fa.gz output_directory/ref_loci_vars: reference_rad_loci_SbfI.fa.gz ri_master.vcf.gz reference_rad_loci_SbfI.stats.gz
The main (or top level) output directory of a radinitio simulation run. Contains three text files, in addition to several subdirectories described below.
A population map file (popmap.tsv) containing the ID of each individual and assignment to a given population. The popmap file is a tab-delimited file, with sample IDs in the first column, and population IDs in the second column. When running the radinitio wrapper, the sample IDs are labelled according to the msprime simulation have the msp_ prefix (or tsk_, in msprime 1.0.0+ versions).
msp_00 pop0 msp_01 pop0 msp_02 pop0 msp_03 pop1 msp_04 pop1 msp_05 pop1
The sequenced_clone_distrib.tsv contains information regarding the sequenced clone distribution. This distribution is the product of decoratio's PCR model and is representative of the PCR duplicates observed in the simulated reads, if any. By default, no PCR cycles are enabled, thus the model returns:
clone_size n_errors clone_prob 0 0 0 1 0 1.0 1 1 0
In this distribution, 100% of reads originate from a sequenced clone of size 1 (a clone containing a single molecule) without any errors. Since there was no PCR amplification, the original template molecule is sampled and "sequenced". However, if PCR cycles are defined by the user using the PCR models more complex clones can be sampled, resulting in PCR duplicates.
clone_size n_errors clone_prob 0 0 0 1 0 0.722149 1 1 8.58233e-05 2 0 0.211999 2 1 3.77304e-05 2 2 1.25639e-05 3 0 0.0521028 3 1 1.10092e-05 3 2 5.04165e-06 3 3 1.72311e-06
Here, the clone_size describes the number of molecules in a clone. A clone size of n means that n molecules where sampled from a single clone. n copies of the same sequence will be saved as reads, with n - 1 being PCR duplicates. Moreover, clones can also contain molecules with PCR error (n_errors). A clone of size n can contain up to n molecules with error. clone_prob is the probability of sampling a clone with given size and error properties, and describes the distribution of PCR duplicates and errors in the simulated data.
The msprime_vcfs/ directory containing the genetic variants simulated by msprime (derived from the tskit.TreeSequence.write_vcf() function) for each individual chromosome.
The ref_loci_vars contains the reference information for the simulated RAD loci as well as the genome-wide variants generated from the population simulation.
A genome-wide VCF containing the combined sequences from all simulated chromosomes (ri_master.vcf.gz). Here, the variants simulated by msprime, which may encode unspecified alleles, are projected against the sequence of the reference genome to determine the identify of the reference alleles. To determine alterative alleles, RADinito implements a MutationModel(), which can mutate the reference alleles based on a substitution model (specified with --mut-subs-model) and introduce indels at a specified rate (--mut-indel-prob).
The reference loci file reports the sequence of all the reference loci present in the genome, i.e., the genomic sequence corresponding to the 1 Kbp spanning the extracted loci. These sequenced are exported as a gzipped fasta file (reference_rad_loci.fa.gz).
>t0000n ref_pos=chr1:10818-11817 TGCAGGACTCTGTACAGCAGTTCACACACTTATACCTCCTGAGACGTGTTGTACATTTGTC... >t0001p ref_pos=chr1:11814-12813 TGCAGGATCAGACCATTGTGTTGTTATTGAGAGCTAGCGTTAGCTCCACCTCAAACCGGTA... >t0002p ref_pos=chr2:332-1331 TGCAGGGAGGGTGGGAGAACTTTCTCCGTTAACAAAAAATGAGGCTCCGTCTCTGTGATTC... >t0003n ref_pos=chr3:23979-24978 TGCAGGTCAGTGAACTCACCCCCACTGCAGGCGTCCTCTGGACTGAACCAACAGAAGCTGG... >t0004n ref_pos=chr3:24974-25973 TGCAGGCGATCCAGTCCACAGACCGGCTACCTGGATCATGCCGACGTCCGACAGTGTCTCC...
The sequence IDs of each locus are composed of t for "true locus", number representing the ordering of the cutsite in the genome (e.g., 0000 for the first cutsite on position 11,811 of chr1), and either n or p to denote loci in the negative and positive strand, respectively. The fasta header also provides the coordinates of the locus sequence in the reference. Note that locus from the negative strand have been reverser complimented, with the restriction enzyme overhang present at the 5' of the sequence.
The reference_rad_loci.stats.gz file provides a locus-by-locus description of the status of each loci. The file reports the ID for the sequence in which the locus is located (#chrom_id), the basepair coordinate in which the main cutsite was found (cut_pos), the direction of the tag (tag_dir) in the DNA strand, and the status of each locus. An individual locus can be kept for downstream analysis (kept) or removed for one of the reasons described above, e.g., too close the end of the sequence (rm_chrom_end) or too close to an adjacent locus (rm_too_close). When simulating ddRAD libraries, the file will also report loci with additional secondary cutsites outside the insert size range (rm_no_dd_cuts).
#chrom_id cut_pos tag_dir status chr1 11811 neg kept chr1 11811 pos kept chr2 329 neg rm_chrom_end chr2 329 pos kept chr3 24972 neg kept chr3 24972 pos kept chr4 3940 neg kept chr4 3940 pos kept chr4 15514 neg kept chr4 15514 pos kept chr5 5137 neg kept chr5 5137 pos kept chr5 22829 neg kept chr5 22829 pos rm_too_close chr5 22954 neg rm_too_close chr5 22954 pos kept chr6 14834 neg rm_excess_Ns chr6 14834 pos rm_excess_Ns chr6 27346 neg kept chr6 27346 pos kept
rad_alleles/ contains per-individual gzipped fasta files contining the RAD alleles for each simulated sample. For each allele, the fasta headers contains the following information:
>reference_locus_id:sample_id:allele_i
Here, reference_locus_id referes to the original reference locus the allele was generated. sample_id is the msprime sample id, and allele_i refers to the allele number (a1 or a2 in diploid individuals).
>t0000n:msp_00:a1 TGCAGGCACGCAAGCGGCTCTAGGGATTCCTTGAGAGGAAAAATGCAACGCTGGTTTAACG... >t0000n:msp_00:a2 TGCAGGCACGCAAGCGGCTCTAGGGATTCCTTGAGAGGAAAAATGCAACGCTGGTTTAACG... >t0001p:msp_00:a1 TGCAGGGGCCCCCGCACCACACGTTGGGAACCCCTGATTGACCAGAAACATATTAAAGTTG... >t0001p:msp_00:a2 TGCAGGGGCCCCCGCACCACACGTTGGGAACCCCTGATTGACCACAAACATATTAAAGTTG...
The rad_reads/ directory contains per-individual gzipped paired-end fasta (or fastq) files contining the RAD reads for each simulated sample. For each individual read, the sequence headers contain the following information (for fasta output in this example):
>reference_locus_id:sample_id:allele_i:clone_id:duplicate_i/read_pair
The basename of the read, which contains >reference_locus_id:sample_id:allele_i, comes from the source locus of the reads. clone_id referes to the source template id, while duplicate_i is the duplicate number for each read in the clone. read_pair is the standard designation for paired-end reads, /1 for forward read and /2 for the corresponding paired read.
Notice in the fasta example below that for reads 3 and 4, they both have a clone_id of 3. Read 3 has a clone_i of 1, meaning is the first read in the clone. Read 4 has a clone_i of 2, meaning it is the second read in the clone and thus a duplicate product of PCR.
>t0005n:msp_00:a2:1:1/1 # Read 1 TGCAGGCTTAGGCTACACCCAACGAGCCACTGGGTGCAGTTGGACCCGAGGGATCTGTAT... >t0004n:msp_00:a1:2:1/1 # Read 2 TGCAGGTCTCCTGCTACCACCATTCCACCTGTGCAGCTCATCCAGAATGCAGCAGCTCCA... >t0002n:msp_00:a2:3:1/1 # Read 3 TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC... >t0002n:msp_00:a2:3:2/1 # Read 4 TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC... >t0005p:msp_00:a1:4:1/1 # Read 5 TGCAGGGCTTCCACTCCATACTTGTAAAATTAGGGGAAAACACTTTTTTTAACCTCCCAG...
The contents of rad_reads/ contain the main output of RADinitio. These reads behave as empirical paired-end reads and are fully compatible with the majority of bioinformatic software. Reads can be exported in both fasta (default) and fastq formats, as specified by the --read-out-fmt option in the radinitio wrapper. Both output formats report the sample allele information, as well as the identity of PCR clones. For an example in fastq format:
@t0005n:msp_00:a2:1:1/1 # Read 1 TGCAGGCTTAGGCTACACCCAACGAGCCACTGGGTGCAGTTGGACCCGAGGGATCTGTAT... + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF... @t0004n:msp_00:a1:2:1/1 # Read 2 TGCAGGTCTCCTGCTACCACCATTCCACCTGTGCAGCTCATCCAGAATGCAGCAGCTCCA... + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF... @t0002n:msp_00:a2:3:1/1 # Read 3 TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC... + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF... @t0002n:msp_00:a2:3:2/1 # Read 4 TGCAGGAAGAGACGAACCCATCGCCAGTACCACCCCCTGTACATCTGACATCACTCTTAC... + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF... @t0005p:msp_00:a1:4:1/1 # Read 5 TGCAGGGCTTCCACTCCATACTTGTAAAATTAGGGGAAAACACTTTTTTTAACCTCCCAG... + FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF...
If you use RADinitio on your work, please cite our Mol Ecol Resour manuscript:
Rivera-Colón AG, Rochette NC, Catchen JM. (2021) Simulation with RADinitio improves RADseq experimental design and sheds light on sources of missing data. Mol Ecol Resour 21: 363-378. DOI: 10.1111/1755-0998.13163
If you use the PCR amplification models, please cite the Mol Ecol Resour decoratio manuscript:
Rochette NC, Rivera-Colón AG, Walsh J, Sanger TJ, Campbell-Staton SC, & Catchen JM (2023). On the causes, consequences, and avoidance of PCR duplicates: Towards a theory of library complexity. Mol Ecol Resour, 00, 1-20. DOI: 10.1111/1755-0998.13800
If your RADinitio simulations use msprime's island models, please cite:
Kelleher J, Etheridge AM, McVean G (2016) Efficient Coalescent simulation and genealogical analysis for large sample sizes. PLoS Comput Biol 12(5): e1004842. DOI: 10.1371/journal.pcbi.1004842
Baumdicker F, Bisschop G, Goldstein D, Gower G, Ragsdale AP, et al. (2022) Efficient ancestry and mutation simulation with msprime 1.0. Genetics (220)3, iyab229. DOI: 10.1093/genetics/iyab229
RADinitio requires Python 3.7 or higher.
If you have any questions or are encountering any problems when running RADinitio, please drop your questions in the Stacks-users mailing list.
The example below described a complete simulation run, more or less equivalent to a radinitio.simulate() run.
import radinitio, os # General inputs and outputs genome_fa = './genome/reference.fa.gz' out_dir = './radinitio_api_sims' chrom_list = './genome/chrom_list.txt' # 1. Prepare the input genome # Read the chromosome list chroms = [] with open(chrom_list) as fh: for line in fh: chroms.append(line.strip('\n')) chroms = set(chroms) # Load the reference into a genome dictionary genome_dict = radinitio.load_genome( fasta_f=genome_fa, chrom_selection=chroms, min_seq_len=1_000_000) # Assign a recombination rate for chrom in genome_dict: genome_dict[chrom].rec = 3e-8 # 2. Prepare the population simulations # Set parameters for the island model msprime_island_model = radinitio.simple_msp_island_model( n_seq_indv=10, pop_eff_size=2500.0, n_pops=4, mutation_rate=7e-8, pop_immigration_rate=0.025) # Simulate with msprime msp_vcf_dir = f'{out_dir}/msprime_vcfs' os.mkdir(msp_vcf_dir) radinitio.sim_all_chromosomes( msprime_simulate_args=msprime_island_model, genome_dict=genome_dict, out_dir=msp_vcf_dir) # Make popmap popmap_f = f'{out_dir}/popmap.tsv' radinitio.write_popmap( popmap_file=popmap_f, population_configurations=msprime_island_model['population_configurations']) # 3. Process the genomic variants # Make mutation model mutation_params = radinitio.MutationModel( substitution_matrix='equal', indel_prob=0.05, ins_del_ratio=1.0) # Process and merge genomic variants ref_dir = f'{out_dir}/ref_loci_vars' os.mkdir(ref_dir) radinitio.merge_vcf( genome_dict=genome_dict, msp_vcf_dir=msp_vcf_dir, out_dir=ref_dir, mutation_opts=mutation_params) # 4. Simulate library and extract loci # Library Parameters library_params = radinitio.LibraryOptions( library_type='sdRAD', renz_1='SbfI', insert_mu=450, insert_sigma=45, coverage=25, read_len=125, output_format='fastq') # Find loci ref_loci_dict = radinitio.extract_reference_rad_loci( genome_dict=genome_dict, out_dir=ref_dir, library_opts=library_params) # 5. Process RAD alleles # Filter to obtain RAD-specific variants loci_pos_set = radinitio.create_rad_pos_set_dict( loci_dict=ref_loci_dict) rad_variants = radinitio.filter_rad_variants( loci_dict=ref_loci_dict, pos_set_dict=loci_pos_set, m_vcf_dir=ref_dir, library_opts=library_params) # Extract alleles alleles_dir = f'{out_dir}/rad_alleles' os.mkdir(alleles_dir) radinitio.extract_rad_alleles( loci_dict=ref_loci_dict, rad_variants_dict=rad_variants, popmap_file=popmap_f, out_dir=alleles_dir, library_opts=library_params) # 6. Amplify and sequence # PCR model pcr_params = radinitio.PCRDups( pcr_main_model='inheff', pcr_cycles=9, inheff_mean=0.45, inheff_sd=0.2, deco_ratio=0.25, library_opts=library_params) # Sequence reads_dir = f'{out_dir}/rad_reads' os.mkdir(reads_dir) radinitio.sequence_library( out_dir=reads_dir, popmap_file=popmap_f, alleles_dir=alleles_dir, library_opts=library_params, mutation_opts=mutation_params, pcr_opts=pcr_params) # Done!