2.3. Read simulation¶
In this chapter, we show on several basic example how to simulate reads using a component of RNFtools called MISmash.
2.3.1. Basic example¶
First, let us show how to simulate reads from a single genome, stored in a FASTA file, using a single simulator. A corresponding RNFtools configuration script can look as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import rnftools
rnftools.mishmash.sample("simple_example",reads_in_tuple=1)
# put your reference genome here
fa="../../../example1.fa"
rnftools.mishmash.ArtIllumina(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
include: rnftools.include()
rule: input: rnftools.input()
|
Lines 1, 15, 16 have been already described in the previous chapter. Function
rnf.mishmash.sample
(line 3) initializes a new sample of simulated
“single-end reads”. When a new instance of rnftools.mishmash.ArtIllumina
is
created (line 8), it is automatically registered to the last created sample.
This class is used for simulating reads by the Art Illumina read simulator. The
parameters signalize parameters of the simulation: fasta
is the reference
file, number_of_read_tuples
sets number of simulated read tuples, and
read_length_1
and read_length_2
indicate lengths of simulated reads.
Within the RNF framework, a single simulated unit is a read tuple, which
consists of one or more reads. For more details, see the RNFtools paper.
read_length_2=0
implies “single-end” simulation (in our notation: a single
read in every read tuple).
When we run snakemake
, reads are simulated and we obtain the final
simple_example.fq
file with all the simulated reads.
RNFtools supports several different read simulators. Their use is similar, though their interfaces are slightly different. A full documentation of all the supported simulators with all their parameters is available on the page MIShmash.
2.3.2. Simulation of ‘paired-end’ reads¶
To simulate “paired-end reads” (i.e., read tuples of two reads), two minor
changes must be made in the original Snakefile
. First,
rnftools.mishmash.sample
must be called with reads_in_tuple=2
, then
the length of second reads of a tuple must be set to a non-zero value.
Then the final Snakefile
can be:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import rnftools
rnftools.mishmash.sample("simple_example",reads_in_tuple=2)
fa="../../../example1.fa"
rnftools.mishmash.ArtIllumina(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=100,
)
include: rnftools.include()
rule: input: rnftools.input()
|
2.3.3. Different simulator¶
To change a simulator in our example, we just replace
rnftools.mishmash.ArtIllumina
by another simulator, e.g.,
rnftools.mishmash.ArtIllumina
. Parameters as fasta
, read_length_1
,
read_length_2
, and number_of_read_tuples
are the same for all the
simulators, but with several limitations:
- CuReSim supports only “single-end reads”.
- ART Illumina in “paired-end” mode can simulate only reads of equal lengths.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import rnftools
rnftools.mishmash.sample("simple_example",reads_in_tuple=1)
fa="../../../example1.fa"
rnftools.mishmash.DwgSim(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
include: rnftools.include()
rule: input: rnftools.input()
|
2.3.4. More genomes¶
To simulate reads from multiples reference within a single sample (in order to simulate, e.g., a metagenome or a contamination), we create a new instance of class of a simulator for each reference.
In the example below, we are simulating 10.000 read tuples from two reference genomes.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import rnftools
rnftools.mishmash.sample("simple_example",reads_in_tuple=1)
fa1="../../../example1.fa"
fa2="../../../example2.fa"
rnftools.mishmash.ArtIllumina(
fasta=fa1,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
rnftools.mishmash.ArtIllumina(
fasta=fa2,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
include: rnftools.include()
rule: input: rnftools.input()
|
Once reads are simulated for each of the references, they are mixed and put into the resulting FASTQ file.
2.3.5. More samples¶
We can also simulated multiple samples using a single Snakemake
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import rnftools
rnftools.mishmash.sample("simple_end_simulation",reads_in_tuple=1)
fa="../../../example1.fa"
rnftools.mishmash.ArtIllumina(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
rnftools.mishmash.sample("paired_end_simulation",reads_in_tuple=2)
rnftools.mishmash.ArtIllumina(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=100,
)
include: rnftools.include()
rule: input: rnftools.input()
|
2.3.6. Sequence extraction¶
It may be sometimes useful to extract certain sequences from the reference file
before the simulation itself. For instance, reads from each sequence could be
simulated with a different coverage. For this purpose, we can use the
sequences
parameter. Sequences for extraction can be specified either by
their number in the original FASTA file (starting from 0), or using their name.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | import rnftools
rnftools.mishmash.sample("reads_first_sequence",reads_in_tuple=1)
rnftools.mishmash.ArtIllumina(
fasta="example.fa",
sequences=[0],
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
rnftools.mishmash.sample("reads_second_sequence",reads_in_tuple=1)
rnftools.mishmash.ArtIllumina(
fasta="example.fa",
sequences=['seq2'],
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
)
include: rnftools.include()
rule: input: rnftools.input()
|
2.3.7. Non-standard parameters¶
Not all command-line parameters of every simulator are directly supported by
RNFtools. However, such parameters can still be passed through the parameter
other_params
like in this example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | import rnftools
rnftools.mishmash.sample("simple_example",reads_in_tuple=1)
fa="../../../example1.fa"
rnftools.mishmash.ArtIllumina(
fasta=fa,
number_of_read_tuples=10000,
read_length_1=100,
read_length_2=0,
other_params="-qs 5",
)
include: rnftools.include()
rule: input: rnftools.input()
|