3.1. MIShmash

Here you can find documentation for all supported NGS read simulators. Please note that you can use simplified names of classes like rnftools.mishmash.ArtIllumina (instead of longer rnftools.mishmash.artIllumina.ArtIllumina). General information about read simulators can be found on a special page called Exhaustive list of read simulators.

3.1.1. Classes for individual simulators

3.1.1.1. Art-Illumina: rnftools.mishmash.ArtIllumina

class rnftools.mishmash.ArtIllumina.ArtIllumina(fasta, sequences=None, coverage=0, number_of_read_tuples=0, read_length_1=100, read_length_2=0, distance=500, distance_deviation=50.0, rng_seed=1, other_params='')[source]

Bases: rnftools.mishmash.Source.Source

Class for the ART Illumina (http://www.niehs.nih.gov/research/resources/software/biostatistics/art/).

Single-end reads and pair-end reads simulations are supported. For pair-end simulations, lengths of both ends must be equal.

Parameters:
  • fasta (str) – File name of the genome from which read tuples are created (FASTA file). Corresponding ART parameter: -i --in.
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • coverage (float) – Average coverage of the genome. Corresponding ART parameter: -f --fcov.
  • number_of_read_tuples (int) – Number of read tuples.
  • read_length_1 (int) – Length of the first read. Corresponding ART parameter: -l --len.
  • read_length_2 (int) – Length of the second read (if zero, then single-end reads are simulated). Corresponding ART parameter: -l --len.
  • distance (int) – Mean inner distance between reads. Corresponding ART parameter: -m --mflen.
  • distance_deviation (int) – Standard devation of inner distances between reads. Corresponding ART parameter: -s --sdev.
  • rng_seed (int) – Seed for simulator’s random number generator. Corresponding ART parameter: -rs --rndSeed.
  • other_params (str) – Other parameters which are used on commandline.
Raises:

ValueError

create_fq()[source]

Simulate reads.

get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list

3.1.1.2. CuReSim: rnftools.mishmash.CuReSim

class rnftools.mishmash.CuReSim.CuReSim(fasta, sequences=None, coverage=0, number_of_read_tuples=0, read_length_1=100, read_length_2=0, random_reads=False, rng_seed=1, other_params='')[source]

Bases: rnftools.mishmash.Source.Source

Class for CuReSim (http://www.pegase-biosciences.com/curesim-a-customized-read-simulator).

Only single-end reads simulations are supported.

Parameters:
  • fasta (str) – File name of the genome from which reads are created (FASTA file). Corresponding CuReSim parameter: -f.
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • coverage (float) – Average coverage of the genome (if number_of_reads specified, then it must be equal to zero).
  • number_of_read_tuples (int) – Number of read tuples (if coverage specified, then it must be equal to zero). Corresponding CuReSim parameter: -n.
  • read_length_1 (int) – Length of the first read. Corresponding CuReSim parameter: -m.
  • read_length_2 (int) – Length of the second read. Fake parameter (unsupported by CuReSim).
  • random_reads (bool) – Simulate random reads (see CuReSim documentation for more details).
  • rng_seed (int) – Seed for simulator’s random number generator. Fake parameter (unsupported by CuReSim).
  • other_params (str) – Other parameters which are used on command-line.
Raises:

ValueError

create_fq()[source]

Simulate reads.

get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list
static recode_curesim_reads(curesim_fastq_fo, rnf_fastq_fo, fai_fo, genome_id, number_of_read_tuples=1000000000, recode_random=False)[source]

Recode CuReSim output FASTQ file to the RNF-compatible output FASTQ file.

Parameters:
  • curesim_fastq_fo (file object) – File object of CuReSim FASTQ file.
  • fastq_rnf_fo (file object) – File object of RNF FASTQ.
  • fai_fo (file object) – File object for FAI file of the reference genome.
  • genome_id (int) – RNF genome ID to be used.
  • number_of_read_tuples (int) – Expected number of read tuples (to estimate number of digits in RNF).
  • recode_random (bool) – Recode random reads.
Raises:

ValueError

3.1.1.3. DwgSim: rnftools.mishmash.DwgSim

class rnftools.mishmash.DwgSim.DwgSim(fasta, sequences=None, coverage=0, number_of_read_tuples=0, read_length_1=100, read_length_2=0, distance=500, distance_deviation=50.0, rng_seed=1, haploid_mode=False, error_rate_1=0.02, error_rate_2=0.02, mutation_rate=0.001, indels=0.15, prob_indel_ext=0.3, estimate_unknown_values=False, other_params='', vcf=None)[source]

Bases: rnftools.mishmash.Source.Source

Class for DWGsim (https://github.com/nh13/DWGSIM/wiki).

Both single-end and paired-end simulations are supported. In paired-end simulations, reads can have different lengths. Note that there is a bug in DWGsim documentation: coordinates are 1-based.

Parameters:
  • fasta (str) – File name of the genome from which reads are created (FASTA file).
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • coverage (float) – Average coverage of the genome (if number_of_reads specified, then it must be equal to zero). Corresponding DWGsim parameter: -C.
  • number_of_read_tuples (int) – Number of read tuples (if coverage specified, then it must be equal to zero). Corresponding DWGsim parameter: -N.
  • read_length_1 (int) – Length of the first read. Corresponding DWGsim parameter: -1.
  • read_length_2 (int) – Length of the second read (if zero, then single-end simulation performed). Corresponding DWGsim parameter: -2.
  • distance (int) – Mean inner distance between reads. Corresponding DWGsim parameter: -d.
  • distance_deviation (int) – Standard deviation of inner distances between both reads. Corresponding DWGsim parameter: -s.
  • rng_seed (int) – Seed for simulator’s random number generator. Corresponding DWGsim parameter: -z.
  • haploid_mode (bools) – Simulate reads in haploid mode. Corresponding DWGsim parameter: -H.
  • error_rate_1 (float) – Sequencing error rate in the first read. Corresponding DWGsim parameter: -e.
  • error_rate_2 (float) – Sequencing error rate in the second read. Corresponding DWGsim parameter: -E.
  • mutation_rate (float) – Mutation rate. Corresponding DWGsim parameter: -e.
  • indels (float) – Rate of indels in mutations. Corresponding DWGsim parameter: -R.
  • prob_indel_ext (float) – Probability that an indel is extended. Corresponding DWGsim parameter: -X.
  • estimate_unknown_values (bool) – Estimate unknown values (coordinates missing in DWGsim output).
  • other_params (str) – Other parameters which are used on command-line.
  • vcf (str) – File name of the list of mutations (VCF output of DWGSIM).
Raises:

ValueError

create_fq()[source]

Simulate reads.

get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list
static recode_dwgsim_reads(dwgsim_prefix, fastq_rnf_fo, fai_fo, genome_id, estimate_unknown_values, number_of_read_tuples=1000000000)[source]

Convert DwgSim FASTQ file to RNF FASTQ file.

Parameters:
  • dwgsim_prefix (str) – DwgSim prefix of the simulation (see its commandline parameters).
  • fastq_rnf_fo (file) – File object of RNF FASTQ.
  • fai_fo (file) – File object for FAI file of the reference genome.
  • genome_id (int) – RNF genome ID to be used.
  • estimate_unknown_values (bool) – Estimate unknown values (right coordinate of each end).
  • number_of_read_tuples (int) – Estimate of number of simulated read tuples (to set width).

3.1.1.4. Mason (Illumina mode): rnftools.mishmash.MasonIllumina

class rnftools.mishmash.MasonIllumina.MasonIllumina(fasta, sequences=None, coverage=0, number_of_read_tuples=0, read_length_1=100, read_length_2=0, distance=500, distance_deviation=50, rng_seed=1, other_params='')[source]

Bases: rnftools.mishmash.Source.Source

Class for the Mason - Illumina mode (https://www.seqan.de/projects/mason/).

Single-end reads and pair-end reads simulations are supported. For pair-end simulations, lengths of both ends must be equal.

Parameters:
  • fasta (str) – File name of the genome from which read tuples are created (FASTA file). Corresponding Mason parameter: -ir, --input-reference.
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • coverage (float) – Average coverage of the genome (if number_of_reads specified, then it must be equal to zero).
  • number_of_read_tuples (int) – Number of read tuples (if coverage specified, then it must be equal to zero). Corresponding Mason parameter: -n, --num-fragments.
  • read_length_1 (int) – Length of the first read. Corresponding Mason parameter: --illumina-read-length.
  • read_length_2 (int) – Length of the read read (if zero, then single-end reads are simulated). Corresponding Mason parameter: --illumina-read-length.
  • distance (int) – Mean inner distance between reads. Corresponding Mason parameter: --fragment-mean-size.
  • distance_deviation (int) – Standard devation of inner distances between reads. Corresponding Mason parameter: --fragment-size-std-dev.
  • rng_seed (int) – Seed for simulator’s random number generator. Corresponding Mason parameter: --seed.
  • other_params (str) – Other parameters which are used on command-line.
Raises:

ValueError

create_fq()[source]

Simulate reads.

get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list

3.1.1.5. WgSim: rnftools.mishmash.WgSim

class rnftools.mishmash.WgSim.WgSim(fasta, sequences=None, coverage=0, number_of_read_tuples=0, read_length_1=100, read_length_2=0, distance=500, distance_deviation=50.0, rng_seed=1, haploid_mode=False, error_rate=0.02, mutation_rate=0.001, indels=0.15, prob_indel_ext=0.3, other_params='')[source]

Bases: rnftools.mishmash.Source.Source

Class for the WGsim (https://github.com/lh3/wgsim).

Single-end and pair-end simulations are supported. For pair-end simulations, reads can have different lengths.

Parameters:
  • fasta (str) – File name of the genome from which reads are created (FASTA file).
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • coverage (float) – Average coverage of the genome (if number_of_read_tuples specified, then it must be equal to zero).
  • number_of_read_tuples (int) – Number of read tuples (if coverage specified, then it must be equal to zero). Corresponding WGsim parameter: -N.
  • read_length_1 (int) – Length of the first read. Corresponding WGsim parameter: -1.
  • read_length_2 (int) – Length of the second read (if zero, then single-end reads are simulated). Corresponding WGsim parameter: -2.
  • distance (int) – Mean outer distance of reads. Corresponding WGsim parameter: -d.
  • distance_deviation (int) – Standard deviation of outer distances of reads. Corresponding WGsim parameter: -s.
  • rng_seed (int) – Seed for simulator’s random number generator. Corresponding WGsim parameter: -S.
  • haploid_mode (bools) – Simulate reads in haploid mode. Corresponding WGsim parameter: -h.
  • error_rate (float) – Sequencing error rate (sequencing errors). Corresponding WGsim parameter: -e.
  • mutation_rate (float) – Mutation rate. Corresponding WGsim parameter: -r.
  • indels (float) – Rate of indels in mutations. Corresponding WGsim parameter: -R.
  • prob_indel_ext (float) – Probability that an indel is extended. Corresponding WGsim parameter: -X.
  • other_params (str) – Other parameters on commandline.
Raises:

ValueError

create_fq()[source]

Simulate reads.

get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list
static recode_wgsim_reads(rnf_fastq_fo, fai_fo, genome_id, wgsim_fastq_1_fn, wgsim_fastq_2_fn=None, number_of_read_tuples=1000000000)[source]

Convert WgSim FASTQ files to RNF FASTQ files.

Parameters:
  • rnf_fastq_fo (file) – File object of the target RNF file.
  • fai_fo (file) – File object of FAI index of the reference genome.
  • genome_id (int) – RNF genome ID.
  • wgsim_fastq_1_fn (str) – File name of the first WgSim FASTQ file.
  • wgsim_fastq_2_fn (str) – File name of the second WgSim FASTQ file.
  • number_of_read_tuples (int) – Expected number of read tuples (to estimate widths).

3.1.2. Abstract class for a simulator: rnftools.mishmash.Source

class rnftools.mishmash.Source.Source(fasta, reads_in_tuple, rng_seed, sequences, number_of_required_cores=1)[source]

Bases: object

Abstract class for a genome from which read tuples are simulated.

Parameters:
  • fasta (str) – File name of the genome from which reads are created (FASTA file).
  • reads_in_tuple (int) – Number of reads in each read tuple.
  • rng_seed (int) – Seed for simulator’s random number generator.
  • sequences (set of int or str) – FASTA sequences to extract. Sequences can be specified either by their ids, or by their names.
  • number_of_required_cores (int) – Number of cores used by the simulator. This parameter is used to prevent running other threads or programs at the same time.
clean()[source]

Clean working directory.

create_fa()[source]

Create a FASTA file with extracted sequences.

create_fq()[source]

Simulate reads.

fa0_fn()[source]

Get input FASTA file.

Returns:Input FASTA file.
Return type:str
fa_fn()[source]

Get output FASTA file (with selected chromosomes).

Returns:Output FASTA file.
Return type:str
fq_fn()[source]

Get file name of the output FASTQ file.

Returns:Output FASTQ file
Return type:str
get_dir()[source]

Get working directory.

Returns:Working directory.
Return type:str
get_genome_id()[source]

Get genome ID.

Returns:Genome ID.
Return type:int
get_input()[source]

Get list of input files (required to do simulation).

Returns:List of input files
Return type:list
get_number_of_required_cores()[source]

Get number of required cores.

Returns:Number of required cores.
Return type:int
get_output()[source]

Get list of output files (created during simulation).

Returns:List of input files
Return type:list
get_reads_in_tuple()[source]

Get number of entries in a read tuple.

Returns:Number of reads in a read tuple.
Return type:int
static recode_sam_reads(sam_fn, fastq_rnf_fo, fai_fo, genome_id, number_of_read_tuples=1000000000, simulator_name=None, allow_unmapped=False)[source]

Transform a SAM file to RNF-compatible FASTQ.

Parameters:
  • sam_fn (str) – SAM/BAM file - file name.
  • fastq_rnf_fo (str) – Output FASTQ file - file object.
  • fai_fo (str) – FAI index of the reference genome - file object.
  • genome_id (int) – Genome ID for RNF.
  • number_of_read_tuples (int) – Expected number of read tuples (to set width of read tuple id).
  • simulator_name (str) – Name of the simulator. Used for comment in read tuple name.
  • allow_unmapped (bool) – Allow unmapped reads.
Raises:

NotImplementedError