2.3. Read simulation

In this chapter, we show on several basic example how to simulate reads using a component of RNFtools called MISmash.

2.3.1. Basic example

First, let us show how to simulate reads from a single genome, stored in a FASTA file, using a single simulator. A corresponding RNFtools configuration script can look as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import rnftools

rnftools.mishmash.sample("simple_example",reads_in_tuple=1)

# put your reference genome here
fa="../../../example1.fa"

rnftools.mishmash.ArtIllumina(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

include: rnftools.include()
rule: input: rnftools.input()

Lines 1, 15, 16 have been already described in the previous chapter. Function rnf.mishmash.sample (line 3) initializes a new sample of simulated “single-end reads”. When a new instance of rnftools.mishmash.ArtIllumina is created (line 8), it is automatically registered to the last created sample. This class is used for simulating reads by the Art Illumina read simulator. The parameters signalize parameters of the simulation: fasta is the reference file, number_of_read_tuples sets number of simulated read tuples, and read_length_1 and read_length_2 indicate lengths of simulated reads.

Within the RNF framework, a single simulated unit is a read tuple, which consists of one or more reads. For more details, see the RNFtools paper. read_length_2=0 implies “single-end” simulation (in our notation: a single read in every read tuple).

When we run snakemake, reads are simulated and we obtain the final simple_example.fq file with all the simulated reads.

RNFtools supports several different read simulators. Their use is similar, though their interfaces are slightly different. A full documentation of all the supported simulators with all their parameters is available on the page MIShmash.

2.3.2. Simulation of ‘paired-end’ reads

To simulate “paired-end reads” (i.e., read tuples of two reads), two minor changes must be made in the original Snakefile. First, rnftools.mishmash.sample must be called with reads_in_tuple=2, then the length of second reads of a tuple must be set to a non-zero value.

Then the final Snakefile can be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import rnftools

rnftools.mishmash.sample("simple_example",reads_in_tuple=2)

fa="../../../example1.fa"

rnftools.mishmash.ArtIllumina(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=100,
)

include: rnftools.include()
rule: input: rnftools.input()

2.3.3. Different simulator

To change a simulator in our example, we just replace rnftools.mishmash.ArtIllumina by another simulator, e.g., rnftools.mishmash.ArtIllumina. Parameters as fasta, read_length_1, read_length_2, and number_of_read_tuples are the same for all the simulators, but with several limitations:

  • CuReSim supports only “single-end reads”.
  • ART Illumina in “paired-end” mode can simulate only reads of equal lengths.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import rnftools

rnftools.mishmash.sample("simple_example",reads_in_tuple=1)

fa="../../../example1.fa"

rnftools.mishmash.DwgSim(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

include: rnftools.include()
rule: input: rnftools.input()

2.3.4. More genomes

To simulate reads from multiples reference within a single sample (in order to simulate, e.g., a metagenome or a contamination), we create a new instance of class of a simulator for each reference.

In the example below, we are simulating 10.000 read tuples from two reference genomes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import rnftools

rnftools.mishmash.sample("simple_example",reads_in_tuple=1)

fa1="../../../example1.fa"
fa2="../../../example2.fa"

rnftools.mishmash.ArtIllumina(
	fasta=fa1,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

rnftools.mishmash.ArtIllumina(
	fasta=fa2,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

include: rnftools.include()
rule: input: rnftools.input()

Once reads are simulated for each of the references, they are mixed and put into the resulting FASTQ file.

2.3.5. More samples

We can also simulated multiple samples using a single Snakemake.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import rnftools

rnftools.mishmash.sample("simple_end_simulation",reads_in_tuple=1)

fa="../../../example1.fa"

rnftools.mishmash.ArtIllumina(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

rnftools.mishmash.sample("paired_end_simulation",reads_in_tuple=2)

rnftools.mishmash.ArtIllumina(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=100,
)

include: rnftools.include()
rule: input: rnftools.input()

2.3.6. Sequence extraction

It may be sometimes useful to extract certain sequences from the reference file before the simulation itself. For instance, reads from each sequence could be simulated with a different coverage. For this purpose, we can use the sequences parameter. Sequences for extraction can be specified either by their number in the original FASTA file (starting from 0), or using their name.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import rnftools

rnftools.mishmash.sample("reads_first_sequence",reads_in_tuple=1)

rnftools.mishmash.ArtIllumina(
	fasta="example.fa",
	sequences=[0],
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)

rnftools.mishmash.sample("reads_second_sequence",reads_in_tuple=1)

rnftools.mishmash.ArtIllumina(
	fasta="example.fa",
	sequences=['seq2'],
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
)


include: rnftools.include()
rule: input: rnftools.input()

2.3.7. Non-standard parameters

Not all command-line parameters of every simulator are directly supported by RNFtools. However, such parameters can still be passed through the parameter other_params like in this example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import rnftools

rnftools.mishmash.sample("simple_example",reads_in_tuple=1)

fa="../../../example1.fa"

rnftools.mishmash.ArtIllumina(
	fasta=fa,
	number_of_read_tuples=10000,
	read_length_1=100,
	read_length_2=0,
	other_params="-qs 5",
)

include: rnftools.include()
rule: input: rnftools.input()