Comparing fixed sampling with minimizer sampling when using kmer indexes to find maximal exact matches
February
Comparing fixed sampling with minimizer sampling when using kmer indexes to find maximal exact matches
Meznah Almutairy 0 1
Eric Torng 0 1
0 Department of Computer Science and Engineering, Michigan State University , East Lansing, Michigan , United States of America, 2 Department of Computer Science, College of Computer and Information Sciences, Imam Muhammad ibn Saud Islamic University , Riyadh , Saudi Arabia
1 Editor: Ruslan Kalendar, University of Helsinki , FINLAND
Bioinformatics applications and pipelines increasingly use kmer indexes to search for similar sequences. The major problem with kmer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query kmers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting kmer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger kmers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same kmer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any kmer size for each method, then we can choose a kmer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query kmers, the number of shared kmer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared kmer occurrence must be processed, fixed sampling is the right sampling method.

Competing interests: The authors have declared
that no competing interests exist.
Introduction
With each passing year, advances in sequencing technologies, such as ROCHE/454, Illumina/
Solexa, and Pacific Biosciences (PacBio), are producing DNA sequences both faster and
cheaper. Right now, the average number of sequences generated from one sequencing run is
on the order of hundreds of millions to billions. While this explosive growth in DNA datasets
yields exciting new possibilities for biologists, the vast size of the datasets also presents
significant challenges for many computeintensive biology applications. These applications include
homogenous search [1±4], detection of single nucleotide polymorphisms (SNP) [5±7],
mapping cDNA sequences against the corresponding genome [8±10]. sequence assembly [11±13],
sequence clustering [14±16], and sequence classification [17±19]. A core operation in all these
applications is to search the dataset for sequences that are similar to a given query sequence.
This search process is often sped up by finding shared exact matches (EM) of a minimum
length L. The value of L is usually set to ensure all the desired similar sequences are recognized.
Depending on the application, the shared EMs can be extended to longer EMs [17]. In some
applications, the EMs are extended to maximal exact matches (MEMs) [
20, 21
], and MEMs are
often extended further to local alignments by allowing mismatches and/or gaps [
4, 8, 22, 23
].
Finding EMs, MEMs, and local alignments is often sped up using kmer indexes (k < L).
One of the biggest problems with using kmer indexes is that the size of the index is
significantly larger than the underlying database/ datasets. One of the most effective and widely used
ways of mitigating kmer index size and query time is to perform sampling, in which we omit
some kmer occurrences from the index. A sampling strategy can be classified based on how it
chooses kmers. There are two major ways to choose kmers: fixed sampling and minimizer
sampling.
It has widely been assumed that fixed sampling produces smaller indexes than minimizer
sampling but that minimizer sampling leads to faster query processing than fixed sampling.
However, these beliefs have never been empirically verified. In fact, few studies have
empirically tested either method on its own [
4, 23
]. In this paper, we fill this gap by systematically
evaluating and comparing fixed sampling and minimizer sampling to assess how these
methods perform with respect to index construction time, index size, and query processing time.
Specifically, we compare and contrast the construction time, index size, and query processing
time required to find all MEMs using both fixed and minimizer sampling.
We start by formalizing the problem of finding MEMs between two sequences. Then, we
illustrate how kmer indexes are used to find MEMs. Next, we formally describe the two kmer
sampling methods: fixed sampling and minimize sampling. We highlight the key similarities
and differences between these two sampling methods. Finally, we set the comparison
framework and conclude with the comparison results.
The MEM enumeration problem
Let S be a finite ordered alphabet. We focus on the alphabet for nucleotide databases S = {A,
C, G, T}. Let s be a string over S of length s. We use s[i] to denote the character at position i in
s, for 0 i < s. We use s[0] to denote the first character in string s. We use the ordered pair
s(i, j) to denote the substring in s starting with the character at position i and ending with the
character at position j for 0 i < j < s. We note that substring s(i, j) is also denoted as s[i..j]
in some papers, but we only use s(i, j) in this paper.
Definition 1 Exact Match (EM) For any two strings s1 and s2, a pair of substrings (s1(i1, j1),
s2(i2, j2)) is an exact match if and only if s1(i1, j1) = s2(i2, j2). The length of an exact match is
j1 − i1 + 1.
Definition 2 Maximal Exact Match (MEM) An exact match (s1(i1, j1), s2(i2, j2)) is called
maximal if s1[i1 − 1] 6 s2[i2 − 1] and s1[j1 + 1] 6 s2[j2 + 1].
We now formalize the problem of finding MEMs between two datasets.
2 / 23
Definition 3 MEM Enumeration Problem Given two datasets of sequences D1 and D2 and
an integer L, the MEM enumeration problem is to find the set of all MEMs of length at least L
between all sequences in D1 and all sequences in D2. We denote this set as MEM(D1, D2, L). We
use MEM(L) if D1 and D2 are clear from the context.
We illustrate many of these and later definitions using the following example where
D1 = {s1} and D2 = {s2} and s1 and s2 are as follows:
s1 = GTAC T AGG CTA CTA GGGG with length s1 = 18
s2 = GTAC A AGG CTA CTA CTA TTTT with length s2 = 21
The two string s1 and s2 have two MEMs of length at least 6:
AGGCTACTA = (s1(5, 13), s2(5, 13)) with length 9 and
CTACTA = (s1(8, 13), s2(11, 16)) with length 6. Thus,
MEM({s1}, {s2}, 6) = {(s1(5, 13), s2(5, 13)), (s1(8, 13), s2(11, 16))} whereas
MEM({s1}, {s2}, 8) = {(s1(5, 13), s2(5, 13))}.
In this study, we focus on finding MEMs of a minimum length L between a query sequence
and a database of sequences because it is a critical step in searching for local alignments with
tools such as NCBI BLAST.
Using kmer indexes to find MEMs
Suffix trees have been the traditional data structure of choice when searching for MEMs [20,
24±26]. However, Khiste and Ilie [
21
] recently showed that kmer indexes use much less space
and are more amenable to parallelization compared to suffix trees, at least for larger MEMs.
This is true despite the development of many new compressed and sparse suffix array data
structures [
20, 25
], Specifically, Khiste and Ilie use fixed sampling to build a memoryefficient
kmer index to search for ªall MEMs of a minimum length 100 between the whole human and
mouse genomesº. Their results show that kmer indexes are 2 times faster when both methods
are limited to using 4GB of memory. In parallel mode with 12core machines, kmer indexes
are 6 times faster while using 2.6 times less memory. We do note that Khiste and Ilie did not
compare their kmer based solution with suffix trees for finding small MEMs with minimum
lengths as low as 20. It is possible that suffix trees are better for these cases. We focus on finding
larger MEMs, so we focus on kmer index solutions to find MEMs.
Finding MEMs with a minimum length L is often sped up using a kmer index, k L, at the
cost of additional space. A kmer index supports quickly finding EMs of length k which are
also known as shared kmers. We typically extend these shared kmers in two stages. In the
first stage, we try to extend every shared kmer into an EM of length L k. If the first stage is
successful, we then further extend the match into an MEM. In the context of searching for
local alignments, every MEM is extended even further by allowing mismatches and/or gaps.
It is possible to skip the first extension step and build an Lmer index to find all shared Lmers.
However, due to technical limitations and the huge memory requirements necessary for
building an Lmer index for L > 32, it is common to build the index using k 32 < L [
4, 17, 21, 23
].
We typically work with a kmer index as follows. We save the list of kmers present in the
database and we refer to this list as dictionary. For each saved kmer, we save some of its
occurrences into a list. When given a query sequence, we extract kmers from the query, see if that
kmer appears in the dictionary, and find the corresponding shared kmers by using the stored
list of occurrences.
We describe this search process more precisely as follows.
Definition 4 (kmer and kmer occurrence) Consider any length k substring s(j − k + 1, j) of
string s where k − 1 j s − 1. We call that substring a kmer and more concisely represent this
kmer occurrence using the ordered pair (s, j).
3 / 23
Definition 5 (Shared kmers and shared kmer occurrences) Consider any two strings s1
and s2 that have an exact match (s1(i1, j1), s2(i2, j2)) of length k. We call the common substring
s1(i1, j1) (equivalently s2(i2, jw)) a shared kmer and more concisely represent the corresponding
shared kmer occurrence using the quadruple (s1, j1, s2, j2).
Any string s of length s contains exactly s − k + 1 kmer occurrences. Using our previous
example with k = 3, s1 and s2 have 16 and 19 3mer occurrences, respectively. Furthermore, s1
and s2 have exactly seven shared 3mers: ACT, AGG, CTA, GCT, GGC, GTA, and TAC. These
shared 3mers result in 24 different shared 3mer occurrences as follows. GCT, GGC, and GTA
appear exactly once in s1 and s2, and thus each of them has exactly one shared 3mer
occurrence: (s1, 9, s2, 9), (s1, 8, s2, 8), and (s1, 2, s2, 2). The 3mer AGG occurs 2 times in s1 and 1 time
in s2, and thus AGG is part of two different shared 3mer occurrences: (s1, 5, s2, 7) and (s1, 15,
s2, 7). Since shared 3mer ACT occurs 2 times in both s1 and in s2, the shared 3mer ACT is
part of 2 × 2 different shared 3mer occurrences: (s1, 4, s2, 12), (s1, 4, s2, 15), (s1, 12, s2, 12), and
(s1, 12, s2, 15). Similarly, TAC occurs 2 times in s1 and 3 times in s2, so shared 3mer TAC is
part of 3 × 2 different shared 3mer occurrences: (s1, 3, s2, 3), (s1, 3, s2, 11), (s1, 3, s2, 14), (s1, 11,
s2, 3), (s1, 11, s2, 11), and (s1, 11, s2, 14). Finally, CTA occurs 3 times in s1 and 3 times in s2,
leading to 3 × 3 different shared 3mer occurrences: (s1, 5, s2, 10), (s1, 5, s2, 13), (s1, 5, s2, 16), (s1, 10,
s2, 10), (s1, 10, s2, 13), (s1, 10, s2, 16), (s1, 13, s2, 10), (s1, 13, s2, 13), and (s1, 13, s2, 16).
Since k L, it is possible that a shared kmer occurrence is not part of an MEM of length at
least L; we call such a shared kmer occurrence a false positive. In general, decreasing the
value of k increases the chance that a shared kmer occurrence is a false positive.
Using the above 24 shared 3mers occurrences and assuming L = 6, the MEM =
AGGCTACTA can be found by extending any of the following seven 3mer occurrences:
(s1, 7, s2, 7), (s1, 8, s2, 8), (s1, 9, s2, 9), (s1, 10, s2, 10), (s1, 11, s2, 11), (s1, 12, s2, 12), or
(s1, 13, s2, 13). Similarly the MEM = CTACTA can be found by extending any of the following
four 3mer occurrences: (s1, 10, s2, 13), (s1, 11, s2, 14), (s1, 12, s2, 15), or (s1, 13, s2, 16). The
remaining twelve shared 3mer occurrences are false positives. If L = 8, then the seven shared
3mer occurrences that can be extended to AGGCTACTA are not false positives. The
remaining sixteen shared 3mer occurrences are false positives.
Every EM of length L has L − k + 1 shared kmer occurrences. Finding and extending one
of these shared kmer occurrences is sufficient for finding that EM. Therefore, when building
kmer indexes, we can store a sampled subset of kmer occurrences in the index and still find
every possible MEM of length at least L. With sampling, we not only reduce the index's
memory requirements, we also reduce query time by not discovering the same MEM multiple
times. Sampling, therefore, is a very effective method for improving a kmer index's efficiency
(reducing construction time, space, and query time).
Fixed sampling versus minimizer sampling
In bioinformatics, two sampling methods are commonly used to build kmer indexes: fixed
sampling [
4, 21
] and minimizer sampling [17, 23]. To ensure that a kmer index achieves
100% sensitivity which means that it finds all shared MEMs of length at least L, both methods
ensure that within every MEM of length at least L, at least one kmer occurrence is saved to the
index. We now define both sampling methods comparing and contrasting their relative
strengths and weaknesses.
Fixed sampling is a simple greedy sampling strategy that minimizes the number of kmer
occurrences stored in the index. The goal is to ensure we choose one complete kmer from
every possible substring of length L from each database sequence s. For example, we must
choose one kmer from s(0, L − 1) to store in the index; we greedily choose the kmer that ends
4 / 23
at s[L − 1] since it not only covers this substring but also the next L − k − 1 substrings up to but
not including s(L − k + 1, 2L − k). To cover the substring s(L − k + 1, 2L − k), we again greedily
choose the kmer that ends at s[2L − k] since it again covers the next L − k − 1 substrings. In
general, the jth kmer occurrence that we sample ends at position L − 1 + (j − 1)w where
w = L − k + 1. We typically refer to w = L − k + 1 as our sampling step or sampling window for
fixed sampling. During the query phase, we extract every kmer from the query sequence q to
search for shared kmer occurrences. Since every kmer is extracted from q, if s and q have an
MEM of length at least L, then some shared kmer from that MEM will be in the kmer index
and the MEM can be recovered. Fixed sampling has several advantages. First, it is very fast to
construct the index requiring no kmer comparisons and skipping over significant portions of
the database sequence s. Second, it stores the minimum possible number of kmer occurrences
in the index to guarantee 100% sensitivity and thus minimizes index size. The disadvantage is
that all kmers from the query sequence need to be processed which may slow query time.
Minimizer sampling uses a more sophisticated sampling strategy that allows sampling of
both the database sequence s and the query sequence q. We must again choose one kmer from
s(0, L − 1) to store in the index. This time, we choose to store the minimum kmer from
s(0, L − 1) in our index where we order substrings in some canonical order. For simplicity, one
can use the alphabetical order where A < C < T < G which implies AAA < AGA < AGG <
TAA. Roberts et al. recognized using an alphabetical ordering can have issues, particularly in
low complexity regions where there may be repeats of characters or short substrings. To
handle the low complexity regions, Roberts et al. write ªIn general, we want to devise our ordering
to increase the chance of rare kmers being minimizers.º For example, Robert et al. proposed
using C < A < T < G in odd numbered bases and the reverse ordering in evennumbered
bases [
23
]. Alternatively, Wood et al. [17] suggest using the exclusiveor (XOR) operation to
scramble the standard ordering of each kmer before comparing the kmers to each other
using lexicographical ordering. Once an ordering is established, we process each length L
substring of s (equivalently each window of w = L − k + 1 kmers) in turn storing the minimum
kmer occurrence in the index. The first view focusing on substrings of length L seems more
intuitive; the second view focusing on windows of w kmers is useful when predicting the
expected size reduction from using minimizer sampling. We note a few things. First, there
may be multiple occurrences of a minimum kmer within a length L substring; in this case for
the original minimizer algorithm but not the one we focus on in this paper, each occurrence is
stored in the index. Second, minimizer sampling is likely to store more occurrences than fixed
sampling as it does not maximize the distance between kmer occurrences stored in the index.
Third, the time to construct the index is a bit slower as more work is done for each window
including comparing different kmers. The advantage that minimizer has comes at query time.
Rather than choosing all kmers from query q, minimizer sampling applies the same sampling
strategy to q. That is, we consider every substring of length L and choose only the minimum
kmers from within each length L substring to consider for extension. Since both the query and
each database string extract the minimum kmer(s) from every substring of length L, if there is
an EM of length L, the same minimum kmer will be extracted and then extended into the EM
and then MEM. In summary, minimizer sampling requires a bit more time to construct its
index though it can do so in linear time and builds a larger index than fixed sampling, but it
processes fewer kmers from the query sequence and thus may have faster query processing
times.
We illustrate the two algorithms using our previous example where we use k = 3 and L = 8
so w = L − k + 1 = 6, and we use D1 = {s1} as our database and D2 = {s2} as our query dataset.
The goal is to return MEM({s1}, {s2}, 8). With fixed sampling, we store the 3mer AGG with its
occurrence (s1, 7) and the 3mer CTA and its occurrence (s1, 13) in the 3mer index.
5 / 23
All 19 3mers from s2 are extracted with both AGG and CTA being shared 3mers. Since
CTA occurs three times in s2, we consider four shared 3mer occurrences for extension: (s1, 7,
s2, 7), (s1, 13, s2, 10), (s1, 13, s2, 13), (s1, 13, s2, 16). The first and third can be extended to the
same MEM of length 9 whereas the other two cannot be extended to a length 8 MEM and thus
are false positives.
With minimizer sampling, we store three 3mer occurrences to the index, two with ACT
and one with AGG: (s1, 4), (s1, 7), and (s1, 12). Minimizer sampling is also applied to the query
s2 and three 3mer occurrences are chosen to test for extension, two with ACT and one with
AAG: (s2, 7), (s2, 12), and (s2, 15). The only shared 3mer is ACT, and since ACT appears twice
in both sequences, we consider four shared 3mer occurrences for extension: (s1, 4, s2, 12), (s1,
4, s2, 15), (s1, 12, s2, 12), and (s1, 12, s2, 15). Only (s1, 12, s2, 12) can be extended to an MEM of
length at least 8; the other three shared kmer occurrences are false positives.
There are two possible optimizations that can be used to improve the performance of
minimizer sampling. The first one is to avoid using lexicographical ordering. The second one is not
to sample duplicate minimizers. In this paper, we study the effects of each optimization
individually and combined. To avoid lexicographical ordering, we use the randomization method
suggested by Wood et al. [17]. To prevent sampling duplicate minimizers, we use the robust
winnowing method proposed by Schleimer et al. [
27
], but only apply robust winnowing to the
index, not to the query sequences. We sample all minimizers from a query sequence window.
In this scenario, the correct minimizer occurrence matches are guaranteed to be found. We
finally apply both methods together to test the effectiveness of using both optimizations
simultaneously.
Problem statement and overall aims
It has widely been assumed that fixed sampling produces smaller indexes than minimizer
sampling but that minimizer sampling leads to faster query processing than fixed sampling. For
example, Roberts et al. [
23
] highlight the importance of sampling kmers at query time during
the search procedure saying that ªthe procedure would still be more efficient if we could
compare only a fraction of the kmers in T to the databaseº where T is their notation for a set of
query strings. However, these beliefs have never been empirically verified. In fact, few studies
have empirically tested either method on its own [
4, 23
].
In this paper, we fill this gap by systematically evaluating and comparing fixed sampling
and minimizer sampling to assess how these methods perform with respect to index
construction time, index size, and query processing time. Specifically, we compare and contrast
the construction time, index size, and query processing time required to solve the MEM
Enumeration Problem using both fixed and minimizer sampling. We use the human genome as
our database and the mouse genome, the chimp genome, and an NGS data set as our three
query sets. Our goal is to provide guidance to developers of kmerbased bioinformatics tools
so that they can choose the best method for their application.
Our main contributions are the following: First, we systematically compare fixed sampling
with minimizer sampling using real biological datasets to assess how well they find all MEMs
with respect to index construction time, index size, and query processing time. Our results
show that if we use the same kmer size, minimizer sampling is only slightly faster than fixed
sampling, despite sampling from the query sequence. If we are allowed to use any kmer size
for each method, then we can choose a kmer size such that fixed sampling both uses less space
and processes queries faster than minimizer sampling. Second, we evaluate the impact of the k
value on the effectiveness of fixed and minimizer sampling methods to find all MEMs. Previous
studies usually focus on only one value of k. We show that the value of k has a significant
6 / 23
impact on a sampling method's index size and query processing time. When the value of k
decreases, fewer kmer occurrences are saved resulting in smaller indexes. However, when the
value of k increases, the index processes queries much faster. On average, the reduction in
query times for all query sets when k = 32 compared to k = 12 is 37 and 136 times faster for
fixed sampling and minimizer sampling, respectively.
Related work
Over the last decade, there has been dramatic increase in the use of kmer indexes in biological
applications and pipelines to accelerate the search for EMs, MEMs, or local alignments. Since
kmer indexes require a lot of memory compared to the underlying databases, sampling has
been widely used to reduce index size. Two primary sampling methods have been proposed in
the literature for building kmer indexes: fixed sampling [
4, 21
] and minimizer sampling [
23,
28, 29
]. Fixed sampling is the dominant sampling method in practice since it is used in indexed
BLAST [4]. Minimizer sampling was proposed in 2004 by Roberts et al. but was largely ignored
for many years until roughly 2012 when papers began using the minimizer sampling concept
for a wide variety of bioinformatics applications [17, 28±34]. Recently, tools have been
developed that use minimizer sampling to reduce kmer index size and increase efficiency [
28, 29
];
both tools map long reads (10kb or longer) that are produced by new sequencing technologies
(Single Molecule Real Time and Oxford Nanopore Technologies) against large reference
databases.
Despite the fact that both sampling methods have existed for some time, no previous work
has compared fixed sampling with minimizer sampling to determine their relative benefits and
weaknesses with respect to building kmer indexes to search for EMs, MEMs, or local
alignments. Specifically, Roberts et al. did not compare to fixed sampling when they presented
minimizer sampling; they compared instead to no sampling [
23
]. Morgulis et al. did not compare
to minimizer sampling when they presented their work on fixed sampling in indexed BLAST
[
4
]. Khiste and Ilie only consider fixed sampling when analyzing kmer based indexes [
21
]. Li
as well as Jain et al. did not consider fixed sampling in their work [
28, 29
]. We fill in this gap
by carefully comparing these two strategies to determine how well they perform for this
important problem.
Schleimer et al. [
27
] independently introduced minimizer sampling calling it winnowing
sampling. While winnowing is identical to minimizer, winnowing has been studied and used
for different problems and domains than minimizer. Minimizer has been used with biological
datasets to solve the problem of searching for local alignments whereas winnowing has been
used with text document datasets to solve the problems of kmer counting and ranking to
detect document plagiarism. A key difference between minimizer and winnowing sampling is
that minimizer sampling requires saving sampled kmer positions whereas winnowing
sampling does not. This is because in MEM and HSLA search problems, the positions define
anchor points to compare sequences whereas in kmer counting and ranking problems, only
the number of occurrences is used to estimate the similarity between two documents. Similar
to Roberts et al., Schleimeret al. observed that in low complexity regions (or lowentropy
strings in text mining literature), a kmer might occur more than once and all of its
occurrences are sampled. Schleimer addressed this problem by not sampling duplicate kmers in
both the indexing and the querying phases; they call this version robust winnowing. Schleimer
et al. did not compare the performance of winnowing and robust winnowing.
We now summarize some other applications which have used kmer sampling in
bioinformatics. We emphasize that these all differ from our application and our not directly
comparable. We start with sequence assembly. Ye et al. [
30
] proposed using sampled kmers, instead of
7 / 23
all kmers, to reduce the memory requirements for De Bruijn graph (DBG) based assemblers.
Ye et al. used fixed sampling to sample kmers; the penalty is that links or edges between
kmers are longer and slightly more complex. Ye et al. report that fixed sampling with step w
reduces their dictionary by roughly 1/w compared to tools that use a full list of kmers [35±37].
Ye et al. note the existence of minimizer sampling and express interest in comparing
minimizer sampling to fixed sampling in future work but did not compare the two in their work. In
genome assembly, we only use the the dictionary (the list of kmers); we do not use the lists of
kmer occurrences. In other applications, where we use the lists of kmer occurrences, the size
of these lists is the dominant factor in index size. Therefore it is important to understand how
fixed and minimizer sampling affect both the number of kmers and the number of kmer
occurrences. Li et al. [
38
] and Movahedi et al. [
31
] both proposed diskbased DBG assemblers
to avoid loading the whole graph into RAM. They load small segments of the graph
incrementally, and complete the assembly in this fashion. Since completing the assembly requires
identifying adjacent EMs, both papers use minimizer sampling as a hashing mechanism to find
adjacent EMs and group them into the same segment. We do not focus on these applications
for two reasons. First, our goal is to focus on applications that use kmer indexes in RAM,
which means that index size is critical. Second, minimizer sampling, in the above context, is
used as a hashing function to minimize disk I/O operations rather than reducing the list of
kmers.
In the kmer counting problem, the task is to build a histogram of occurrences of every
kmer in a given data set where k is relatively large (k > 20) and it is infeasible to list all kmers in
RAM. Similar to disk based DBG assemblers, minimizer sampling is used to select mmers
(m < k) from every kmer. These mmers are later used to reduce disk I/O operations in disk
based counting kmers tools such as MSPKmerCounter [
33
] and KMC2 [
34
]. Again, this
problem is significantly different than our motivating problems which are searching for MEMs and
HSLAs. In MEM and HSLA search problems, the location of sampled kmers is important.
In metagenomic sequence classification, Kraken [17] uses the idea of minimizer to
accelerate the classification process in large data sets. Kraken starts with creating a database that
contains entries of an Lmer and the lowest common ancestor LCA of all organisms whose
genomes contain that Lmer. When a query sequence is given, Kraken searches the database
for each Lmer in a sequence, and then uses the resulting set of LCA taxa to determine an
appropriate label for the sequence. To find all Lmers effectively, Kraken builds a kmer index,
(k < L) where each kmer is associated with all Lmers containing this kmer as a its
minimizer. Since a simple lexicographical ordering of kmers can be biased to sample more
minimizers over lowcomplexity regains, Kraken uses the exclusiveor (XOR) operation to
scramble the standard ordering of each kmer's canonical representation before comparing
the kmers to each other using lexicographical ordering.
Recently Orenstein et al. [
39
] proposed a new kmer sampling method called DOCKS. For a
given data set, L and k (k < L), the task is to find the minimumsize set of kmers such that for
every Lmer in the data set, at least one kmer is in this Lmer. They show that DOCKS
sampling results in a much smaller set compared to minimizer sampling. It will be interesting to
compare fixed sampling and minimizer sampling with DOCKS in future work.
In this study we focus on kmer indexes that find all shared EMs of length L. All indexes are
built with w = L − k + 1 and thus achieve 100% sensitivity. Almutairy and Torng [
40
] proposed
to study the impact of the sampling parameter w on the effectiveness of kmer indexes to find
all highly similar local alignments (HSLAs) using NCBI BLAST. In their paper, kmer indexes
are built using the default BLAST setting where k = 12 and fixed sampling. To study the impact
of w, they build indexes with a wide range of w values where w > L − k + 1. They compared the
indexes' effectiveness with respect to baseline indexes; which are indexes created using
8 / 23
w = L − k + 1. They show very large w can still achieve high sensitivity; that is, even with very
large w, fixed sampling can find almost all HSLAs. This is slightly different than our study
where we focus on finding all MEMs and not all HSLAs. Our results about the impact of the
value of k on fixed sampling would enhance their findings since both k and w play a major role
in kmer index efficiency.
Materials and methods
We represent nucleotides using two bits and store kmers for k 32 in a 64bit block. All
indexes are saved as hash tables where a key is a kmer and its value is a pointer to that kmer's
list of occurrences. We store each kmer's list of occurrences in a set data structure; each
occurrence is an ordered pair of 64bit positive integers (s, j) where s is a sequence ID and j is the
ending position of this kmer occurrence in s.
Sampling methods
We compare fixed sampling and minimizer sampling. Minimizer sampling can be improved
using two different optimizations: randomized ordering and duplicate minimizer removal. We
list all possible combinations of our two optimizations for minimizer sampling in Table 1.
Using both optimizations will result in the most effective minimizer method minrand,one.
Thus, we compare minrand,one with fixed sampling (fix) to test the effectiveness of the two
major sampling methods. We compare minlex,one with min minrand,one to determine how much
effect the randomization optimization has, and we compare minrand,many with min minrand,one
to determine how effective duplicate removal is. We will not consider minlex,many in any
comparison since it is the worst version of minimizer and known to be inefficient. Next, we
formally describe each sampling method. Note that we choose parameters that ensure we achieve
100% sensitivity which means we will find all MEMs.
In fixed sampling (fix), as we described earlier, we build the kmer index for a database of
sequences by sampling from every database sequence s the kmer occurrences ending at
positions L − 1 + (j − 1)w where w = L − k + 1 and 0 j b(s − L + 1)/wc. We refer to
w = L − k + 1 as our sampling step. During the query phase, we extract all kmer occurrences
from each query sequence q and consider them for extension. Since every kmer is extracted
from q, if a database sequence s and q have an MEM of length at least L, then some shared
kmer from that MEM will be in the kmer index and the MEM can be recovered.
In standard minimizer sampling without any optimization (minlex,many), for every substring
of length L in a database sequence s, we store the minimum kmer occurrence in our index; if
there is more than one minimum kmer within any length L substring, all minimum kmer
occurrences are stored. We use the normal lexicographical ordering where A < C < T < G to
define minimum kmers in our work. Unlike fixed sampling, during the query phase, we use
the same sampling for a query sequence q. That is, we consider only the minimum kmers
from each substring of length L in q for extension. Since both the query and each database
string extract the minimum kmer(s) from every substring of length L, if there is an EM of
Lexicographical
minlex,many
minlex,one
Randomized
minrand,many
minrand,one
9 / 23
length L, the same minimum kmer will be extracted and then extended into the EM and then
MEM.
We find minimum kmers from s and q using a linear time sliding window approach
proposed by Smith [
41
]. The basic idea is to store kmers in a doubleended queue or deque that
allows fast insertion and deletion at both its beginning and its end. We maintain the following
invariant for the deque. The deque contains kmers from the eligible window sorted in two
ways: by kmer value where the smallest kmer value is at the front and the largest kmer value
is at the rear and by location where the oldest kmer is at the front and the newest kmer is at
the rear. The basic reason for these invariants is that any kmer that is larger and further to the
left of the most recent kmer will never be a minimizer in any future windows since this new
kmer will be in any such windows and has smaller value. We maintain this invariant in two
ways. First, when we add the new kmer that is in this window, we compare it to the kmer at
the rear of the deque and remove that rear kmer if it is larger than the current kmer. We
continue in this fashion until we find a kmer smaller than the new kmer or the deque is empty
and we add the new kmer to the deque at the rear of the deque. Second, we find the minimizer
by pulling the first element from the deque. However, we must first verify this kmer is still in
the current window. If not, we remove it from the deque and use the next kmer as the
minimizer. We can see this takes linear time using the following charging scheme where we charge
each comparison to the item that was deleted or inserted into the deque. Each item is charged
twice, once for when it is inserted, and once for when it is deleted. Thus, the total number of
comparisons is linear in the number of kmers processed.
Now, we describe how to apply optimizations to reduce the number of kmers sampled
from the database which leads to a significant speedup of query time. The first optimization is
to use randomized ordering instead of lexicographical ordering. To do this, we first create a
random kmer mask by uniformly selecting k letters from A, C, G, or T in each position. We
then view a kmer as a 2k bit string. For any kmer or equivalently 2k bit string, we create a
new scrambled 2k bit string by doing an exclusiveor (XOR) operation between every bit of
the 2k bit string and the 2k bit mask. We then sort all the scrambled 2k bit strings to identify a
minimizer. For example, the bit string for the 4mer AACC is 00000101 using lexicographical
ordering, where A = 00, C = 01, G = 10, and T = 11. Let the random 4mer mask be CGAT
which is equivalent to the 8bit string 01100011. After applying the XOR operation between
AACC (00000101) and CGAT (01100011), the resulting scrambled bit string for AACC is
01100110. We refer to minimizer sampling that uses this randomized ordering as minrand.
The second optimization prevents sampling duplicate minimizers in the indexing phase.
There are two occasions where standard minimizer stores duplicate minimizers in the index.
The first occurs when the current minimizer is not part of the next window and we must
examine all w = L − k + 1 kmers in that window. If we find multiple minimizers, all are stored
in the index using the standard minimizer sampling strategy. To apply the duplicate removal
optimization, we store only the rightmost minimizer in the index. The second possibility for
storing duplicate minimizers occurs when the current minimizer for the previous window still
lies within the next window and is identical to the one new kmer for that window. In this
scenario, to remove duplicates, we do not store this duplicate copy at this time in the index.
However, we do track its position so that if no new minimizer is found before the current
minimizer moves out of the current window, we can use this kmer to replace the current
minimizer at that time and still do only one comparison for that window. At that time, we would
have to store this minimizer in the index if it is the minimizer of that window. We refer to
minimizer sampling that uses duplicate removal as minone.
10 / 23
Indexing and querying
In the indexing phase, we create a kmer index for a given database and sampling method as
follows. We sample kmers and their occurrences from each sequence based on the selected
sampling method (fix, minrand,one, minrand,many, and minlex,one). We save the sampled kmers
into the index dictionary, and for each kmer occurrence, we update the corresponding list of
kmer occurrences.
We then proceed to the querying phase where we sequentially process each query sequence.
If we use fixed sampling (fix), we extract all kmer occurrences from query sequence q. For all
minimizer methods minrand,one, minrand,many, and minlex,one, we extract the minimum kmer
occurrences including duplicates from each window in q.
Once we extract the kmer occurrences from q, we use the index to find shared kmer
occurrences and then MEMs as follows. For every kmer occurrence in q, we check if the
kmer is in the index dictionary. If the kmer is found, then we use the kmer's associated list of
occurrences to find all shared kmer occurrences between q and the database of sequences.
We perform this search in a manner similar to NCBI BLAST with the goal of minimizing
the number of database read operations. Specifically, we group the shared kmer occurrences
between q and DB by database sequence ID s. For the list of kmer occurrences shared between
q and s, we sort them in alphabetical order and then positional order. We store this
information in a hash table with key s where the hash table entries are pointers to the sorted lists of
shared kmer occurrences. We then read in each relevant database sequence s exactly once and
process all the corresponding shared kmers in alphabetical order of kmer.
For every query sequence q and a database sequence s, we report any shared kmer
occurrence that can be extended to length at least L as an MEM. Our extension method is similar to
that of Khiste and Ilie [
21
]. Before we try to extend a shared kmer occurrence, we first check if
it is contained within our list of discovered MEMs which is, of course, initially empty. If so, we
skip this shared kmer occurrence and move on to the next one. If not, then we try to extend
the kmer in both directions to see if it is part of an MEM with length at least L. If the extension
succeeds, we add the new MEM to our list of discovered MEMs. If the extension fails, we report
this shared kmer occurrence as a false positive. This ensures we only extend one shared kmer
occurrence within any MEM.
We check if a shared kmer occurrence is part of a discovered MEM using the following
properties. A shared kmer occurrence
q; j0q; s; j0s is part of a shared MEM MEM = (q(iq, jq),
s(is, js)) if the following conditions hold: (1) iq j0q k 1 < j0q jq, (2),
is j0s k 1 < j0s js, and (3)
j0q k 1 iq
j0s k 1 is. Checking these
conditions can be done in constant time per discovered MEM, and typically the number of
discovered MEMs per pair of sequences q and S is small, so this verification step typically takes
constant time.
Experimental setting and evaluation metics
Database and query sets. We consider only nucleotide datasets. Our database is the
human genome. We use three query sets: the mouse genome, the chimp genome and an NGS
dataset. All the datasets are publicly available. The genome datasets can be downloaded from
UCSC (http://hgdownload.cse.ucsc.edu). The NGS dataset can be downloaded from Sequence
Read Archive (SRA) on the NCBI website (https://www.ncbi.nlm.nih.gov/sra) Table 2
describes each dataset used. According to Koning et al. [
42
], two third of the human genome
consists of repetitive sequences. It is also known that the mouse genome contains many repeats
too. [
4, 43
]. It is unclear if this is the case for the chimp and NGS datasets.
11 / 23
The database set is only one large set. The query sets are partitioned into 1000 small query sets where each small set has the indicated number of processed sequences
(except the last set may have fewer sequences). The processed sequences are of length 1000 (except the last processed sequence for every sequence may be shorter).
Performing the queries using each query set directly would require lots of computing
power, memory and time. For example, the number of MEMs of length at least 50 is more than
two billion when we compare the human and the chimp genomes. Khiste and Ilie [
21
]
proposed to store the MEMs in files based on their starting positions in the query genome. Later,
each file is sorted, duplicates are removed, and MEMs are output in the right order. In our
case, we preprocess all the datasets by dividing every sequence into nonoverlapping
sequences of length 1000. Therefore, we process more but smaller sequences while keeping the
list of MEMs for each sequence manageable in RAM. This preprocessing also supports
parallelization of queries on MSU's High Performance Computing Cluster. The number of resulting
sequences is shown in Table 2. We also only save letters in {A, C, G, T}; that is, ambiguous
characters are removed. For each preprocessed query set, we partition the set into 1000 query
sets of equal size (except the last set may be slightly smaller). We recognize that we may not be
able to find MEMs that extend across the preprocessed sequences, but this should not
significantly change our results.
For each of our preprocessed query sets, we compute the actual number of MEMs between
that query set and the preprocessed human genome for both choices of L. These results are
shown in Table 3.
Index parameters and metrics. We study the impact of the sampling methods on the
kmer index creation phase. We consider the following sampling methods fix, minrand,one,
minlex,one, and minrand,many. We use the index to find all MEMs of length at least L where
L 2 {50, 100}. For each sampling method, we create a set of indexes for k 2 [
12, 32
]. We
consider L = 50 and L = 100, because both are frequently used in biological applications that
compare the mouse and chimp genomes to the human genome [
20, 21
] or map an NGS
dataset against the human genome [
8, 22
].
For each index, we report the dictionary size, lists size and the total index size which is the
sum of dictionary and lists sizes. The dictionary size is measured by counting the number of
kmers. The lists size is measured by counting the number of kmers occurrences in all lists. We
also report the index construction time.
Mouse
838,857,328
428,609
Chimp
2,077,183,744
101,868,611
NGS
940,731
457,512
12 / 23
Querying parameters and metrics. The index is used to find all MEMs of length at least L
where L 2 {50, 100}. For L and for each sampling method, we used a set of kmer indexes
where k 2 {12, 16, 20, 24, 28, 32}. The total number of indexes considered is 2 × 3 × 6 = 48
indexes. All the indexes give the same final results, namely all MEMs of length at least L.
For each query phase, we report the time and the number of ªfalse positivesº. The number
of false positives for a query set is the number of shared kmer occurrences that failed to be
extended to a MEM of length at least L. Recall that we partition each query set into 1000 query
set partitions. The reported time is the sum of times that an index needs to answer all queries
in all query set partitions. Likewise, the number of false positives for a query set is the sum of
the number false positives for all queries in all query set partitions..
System specification/configuration. We run the experiments on a cluster that runs the
Community Enterprise Operating System (CentOS) 6.6. The cluster has 24 nodes where each
node has two 2.5Ghz 10core Intel Xeon E52670v2 processors, 256 GB RAM, and 500 GB
local disk.
Results and discussion
Index size and index construction time
Fixed sampling (fix) produces indexes that are less than half the size of those produced by all
minimizer sampling methods (minrand,one, minrand,many, and minlex,one) for almost all choices of k.
Likewise, we can construct fixed sampling's index roughly 1.5 to 1.9 times as fast as we can
construct minimizer's index. We provide full index size and construction time results in Fig 1.
We now explore why fixed sampling produces indexes that are roughly half the size of
indexes produced by minimizer sampling minrand,one. We start with the size of the occurrence
lists. For a fixed value of k, we can accurately predict the size of fixed sampling's kmer
occurrence lists because 1/w of the total number of kmer occurrences will be sampled. For
minimizer, Roberts et al. showed that for random sequences, the number of minimizers would be
roughly 2/(w + 1) of the total number of kmer occurrences [
23
]. Basically, each minimizer
would cover roughly half a window of length w rather than a full window of length w as we get
from fixed sampling. Thus, we would expect fixed sampling to produce occurrence lists that
are roughly (w + 1)/2w the size of the occurrence lists produced by minimizer sampling; that
is, the occurrence lists should be just more than half the size. In our experiments, we see that
the number of sampled occurrences for fixed sampling divided by the number of sampled
occurrences for minimizer sampling ranges from 48% to 55% for both L = 50 and L = 100
which is consistent with the expectation.
Minimizer sampling minrand,many has essentially identical results to minimizer sampling
minrand,one with respect to the size of occurrence lists; the optimization to remove duplicate
minimizers from a window does not have much effect on the total number of sampled
occurrences. Minimizer sampling minlex,one produced indexes that are larger than minimizer
sampling minrand,one with respect to the size of occurrence lists; the optimization to use
randomized ordering, instead of lexicographical ordering, effectively reduces over sampling
the same kmer in regions with many repeats resulting in 15% to 20% reduction for L = 50 and
L = 100, respectively.
We now consider the dictionary size. For the smallest values of k that we consider, mainly
1215, minimizer typically has much smaller dictionaries than fixed sampling. For k = 12 and
L = 100, minimizer's dictionary is almost 6 times smaller than fixed sampling's dictionary. For
these small values of k, many of the sampled kmers are chosen many times, and this is
especially true for minimizer which leads to its smaller dictionary. However, for these k values,
because many of the sampled kmers are chosen many times, the dictionaries are much smaller
13 / 23
Fig 1. The dictionary sizes, lists sizes, and construction times for kmer indexes built using fixed sampling (fix) and minimizer sampling minrand,one,
minrand,many, and minlex,one. For parts (a), (b), and (c), we use L = 50. For parts (d), (e), and (f), we use L = 100. For all graphs, 12 k 32.
than the occurrence lists, so fixed sampling still has a total index size that is roughly half that of
minimizer. For example, for k = 12 and L = 50, for fixed sampling, each kmer in the dictionary
appears roughly 6.5 times in the occurrence lists whereas for minimizer sampling minrand,one,
each kmer in the dictionary appears roughly 52 times in the occurrence lists.
Once we consider k 16, for fixed sampling, each dictionary consists of mostly unique
kmers. For example, for k = 16 and L = 50, each kmer in fixed sampling's dictionary appears
roughly 1.23 times in the occurrence lists. For minimizer sampling minrand,one, this starts to
happen around k = 21. For example for k = 21 and L = 50, each kmer in minimizer sampling's
dictionary appears roughly 1.24 times in the occurrence lists. By the time k = 32, each
dictionary kmer appears less than 1.13 times in the occurrence lists for both fixed sampling and
minimizer sampling. This implies that for large k, the dictionary size is comparable to the
14 / 23
occurrence lists size. Specifically, we see that fixed sampling's dictionaries are roughly half the
size of minimizer sampling's dictionaries for k 21 for both L = 50 and L = 100. Finally, we
note that the dictionary size for minimizer sampling minrand,many is identical to that of
minimizer sampling minrand,one as minrand,many only omits some repeated occurrences for the same
kmer. Minimizer sampling minlex,one produced dictionaries that are larger than minimizer
sampling minrand,one, again because we reduce oversampling the same kmer in regions with
many repeats. For example, when k > 16 the reduction ranges from 12% to 23% for L = 50 and
L = 100, respectively.
We note that for all sampling methods, increasing the value of k increases the size of the
index. This is expected, since the sampling step w = L − k + 1 decreases as k increases.
Finally, fixed sampling's faster construction time is easily explained. First, the number of
sampled occurrences is less than half as many as minimizer sampling. Second, no comparisons
are needed; fixed sampling simply grabs every wth kmer whereas minimizer sampling needs
to consider every kmer and do comparisons to determine if new kmers are minimizers.
However, as we show in later, the reduction has significant impact on reducing query time.
Query time
We first start with our query time results. Our full query time results for each of our three
sampling methods for L = 50 and L = 100 are shown in Tables 4 and 5.
Our key query time result is that for the same k values and query data with many repeats, such
as in the mouse genome, minrand,one processes queries significantly faster than fixed sampling,
especially for commonly used small k values like 12 and 16 [4, 17, 44±49]. For example, when
k = 12 and k = 16, minrand,one answer the queries 126.14% and 369.14% faster, on average, than
fixed sampling for L = 50 and L = 100, respectively. For large k, k > 16, minrand,one processes
the queries 58.36% and 271.64% faster, on average, than fixed sampling for L = 50 and L = 100,
15 / 23
respectively. When there are only a few repeats in the query data, such as in the chimp genome
and NGS datasets, and for small k values, minrand,one is 15.15% to 37.65% faster, on average,
than fixed sampling. On the other hand, when the value of k > 16, minrand,one is 7.73% to
19.82% slower, on average, than fixed sampling.
While we observe that minimizer sampling process queries faster than fixed sampling for
the same choice of k, we also observe that minimizer sampling uses more space than fixed
sampling for the same choice of k. We will later compare minimizer sampling with fixed sampling
when they are restricted to indexes of the same size to determine which is indeed faster. When
exploring this tradeoff, we find that fixed sampling is faster than minimizer sampling when
both methods have equal sized indexes.
Our next query time result is that increasing k significantly decreases the query processing time
of all methods. For all query sets, increasing k from 12 to 16 reduces the query time of fixed
sampling and minrand,one by roughly 35 times; the one exception is minrand,one with the mouse
genome query set for L = 50 and L = 100 where the reduction is only 7.89% and 17.96% times,
respectively. For all query sets and all methods, increasing k by an additive factor of 4 above 16
roughly halves the method's query processing time with a couple of outliers in both directions.
Our final query time result is that the optimization using randomized ordering is significantly
more effective than the optimization that removes duplicate minimizers, especially for large L and
small k values. For the mouse genome and for k = 12 and k = 16, the randomized ordering is
360.46% to 1508.86% faster, on average, than lexicographical ordering. For the chimp and
NGS datasets and k = 12 and k = 16, the randomized ordering is 81.62% to 280.17% faster, on
average, than lexicographical ordering. On the other hand, for the mouse genome k = 12 and
k = 16, duplicate minimizer removal is only 14.94% to 173.93% faster, on average, than
standard minimizer sampling without duplicate removal. For the chimp and NGS datasets,
duplicate minimizer removal does not improve minimizer sampling's query time; the one exception
16 / 23
is for chimp data when L = 100 and k = 12 and k = 16 where duplicate removal is 17.16% faster,
on average, than standard minimizer without duplicate removal.
The theoretical vs. empirical query time expectation. Because minimizer sampling only
tests some kmers extracted from the query sequence to see if they are shared kmers, one
might expect that minimizer sampling would process queries as much as w times faster than
fixed sampling. However, the query time results show this is not the case; the speedup is
typically much less than w and often less than twice as fast. This is explained by counting the total
number of shared kmer occurrences found by both fixed and minrand,one. These counts are
shown in Tables 6 and 7.
Recall fixed sampling will test all q − k + 1 kmers from a query sequence q. On the other
hand, the expected number of query kmers that minimizer sampling will test is 2(q − k + 1)/
(w + 1) or roughly 2/(w + 1) times smaller if the sequences are generated uniformly at random
[
23
]. Each tested kmer will generate x shared kmer occurrences where x is the length of that
kmer's occurrence list in the index. Theoretically, if x = c for fixed sampling, then we expect
that x = 2c for minimizer sampling. Then fixed sampling will test c(q − k + 1) kmers
occurrences and minimizer sampling will test 2c(q − k + 1)/(w + 1) of kmers occurrences; since
minimizer sampling produce lists large by a factor of two. However, this is not always the case.
For example, for q = 1000 and L = 100, then we expect minimizer sampling to be 95% faster
than fixed sampling when 20 k 32. For chimp and NGS datasets, the empirical results
show that minimizer sampling is 7.73% to 14.24% slower than fixed sampling.
To understand the query time, we need to compute the average size of x for kmers that are
in the index dictionary. We show this in Tables 8 and 9 for each sampling method; specifically,
these tables show the mean and the standard deviation of occurrence list lengths in each index.
For L = 100 and k = 12, the mean length of a minimizer sampling occurrence list is just over 13
times larger than the mean length of a fixed sampling occurrence list. For k = 16, this falls to
17 / 23
roughly 1.55 times larger, and for larger k, this falls to just a bit larger. Even more dramatic, the
standard deviation for minimizer's occurrence list lengths ranges from 6.58 times up to 21.5
times larger than the standard deviation of fixed samplings occurrence list lengths.
What this shows is that some kmers in minimizer sampling have very large occurrence
lists. Furthermore, the kmers that have large occurrence lists are exactly the kmers that are
most likely to be extracted from a query sequence since the sampling method is biased to
choose them. This explains why, despite testing relatively few query kmers, minimizer
sampling have much larger query times than expected theoretically.
18 / 23
Space and speed
We summarize our comparison of the the sampling methods by plotting the space and speed
of the resulting index for each query set and both choices of L in Fig 2.
If we ask both methods to use the same space regardless of k, we find that fixed sampling
typically answers queries at least as fast as minrand,one and often is faster. For example, the
index created using fixed sampling with k = 16 has roughly the same size as the index created
using minimizer sampling with k = 12. However, fixed sampling is 37.43%, 51.11%, and
44.52% faster than minimizer sampling for the mouse, chimp, and NGS datasets, respectively.
Combined with the fact that fixed sampling is much simpler than minimizer sampling, fixed
sampling is always the best choice if we can choose k to optimize both time and space.
Finally, we observe that the randomized ordering optimization is more effective than the
duplicate removal optimization. We can see that for all k values we consider, the effectiveness
of minrand,many and minrand,one are very similar. On the other hand, minlex,one is significantly
worse than both minrand,one and minrand,many.
Conclusion
We now summarize our main conclusions. When comparing fixed and minimizer sampling,
our results show that if we use the same kmer size, minimizer sampling is only slightly faster
than fixed sampling, despite sampling from the query sequence. If we are allowed to use any
kmer size for each method, then we can choose a kmer size such that fixed sampling both uses
less space and processes queries faster than minimizer sampling. As is common in many
applications, there is a space versus speed tradeoff. Using a larger k requires more space but results
in smaller query times. The key benefit of increasing k is that there are many fewer false
positives which leads to much faster query processing. On average, the reduction in query times
for all query sets when k = 32 compared to k = 12 is 37 and 136 times faster for fixed sampling
and minimizer sampling, respectively.
19 / 23
Fig 2. Comparing the space and speed of fixed sampling (fix), minimizer sampling minrand,one, minrand,many, and minlex,one. We use L = 50 for the three
upper figures and L = 100 for the three lower figures. The values for minlex,one when k = 12 are very large and removed from figure below.
Acknowledgments
We thank the anonymous reviewers for their very helpful comments that greatly improved the
paper. This work was supported in part by Michigan State University through computational
resources provided by the Institute for CyberEnabled Research. We also would like to
acknowledge AlImam Muhammad ibn Saud Islamic University for support of this work.
Author Contributions
Conceptualization: Meznah Almutairy, Eric Torng.
Data curation: Meznah Almutairy.
Investigation: Meznah Almutairy.
20 / 23
Methodology: Meznah Almutairy, Eric Torng.
Project administration: Meznah Almutairy, Eric Torng.
Software: Meznah Almutairy.
Supervision: Eric Torng.
Validation: Meznah Almutairy, Eric Torng.
Visualization: Meznah Almutairy, Eric Torng.
Writing ± original draft: Meznah Almutairy, Eric Torng.
Writing ± review & editing: Meznah Almutairy, Eric Torng.
21 / 23
17.
Wood DE, Salzberg SL. Kraken: Ultrafast metagenomic sequence classification using exact
alignments. Genome Biology. 2014; 15(3):R46. https://doi.org/10.1186/gb2014153r46 PMID: 24580807
22 / 23
1. Pearson WR , Lipman DJ . Improved tools for biological sequence comparison . Proceedings of the National Academy of Sciences . 1988 ; 85 ( 8 ): 2444 ± 2448 . https://doi.org/10.1073/pnas.85.8. 2444
2. Altschul SF , Madden TL , SchaÈffer AA , Zhang J , Zhang Z , Miller W , et al. Gapped BLAST and PSIBLAST: A new generation of protein database search programs . Nucleic Acids Research . 1997 ; 25 ( 17 ): 3389 ± 3402 . https://doi.org/10.1093/nar/25.17.3389 PMID: 9254694
3. Zhang Z , Schwartz S , Wagner L , Miller W. A greedy algorithm for aligning DNA sequences . Journal of Computational Biology . 2000 ; 7 ( 1 2): 203 ± 214 . https://doi.org/10.1089/10665270050081478 PMID: 10890397
4. Morgulis A , Coulouris G , Raytselis Y , Madden TL , Agarwala R , SchaÈffer AA. Database indexing for production MegaBLAST searches . Bioinformatics . 2008 ; 24 ( 16 ): 1757 ± 1764 . https://doi.org/10.1093/ bioinformatics/btn322 PMID: 18567917
5. Irizarry K , Kustanovich V , Li C , Brown N , Nelson S , Wong W , et al. Genomewide analysis of singlenucleotide polymorphisms in human expressed sequences . Nature Genetics . 2000 ; 26 ( 2 ): 233 ± 236 . https://doi.org/10.1038/79981 PMID: 11017085
6. Sachidanandam R , Weissman D , Schmidt SC , Kakol JM , Stein LD , Marth G , et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms . Nature . 2001 ; 409 ( 6822 ): 928 ± 933 . https://doi.org/10.1038/35057149 PMID: 11237013
7. Ng PC , Henikoff S. Predicting deleterious amino acid substitutions . Genome Research . 2001 ; 11 ( 5 ): 863 ± 874 . https://doi.org/10.1101/gr.176601 PMID: 11337480
8. Kent WJ . BLATthe BLASTlike alignment tool . Genome Research . 2002 ; 12 ( 4 ): 656 ± 664 . https://doi. org/10.1101/gr.229202 PMID: 11932250
9. Ning Z , Cox AJ , Mullikin JC . SSAHA: A fast search method for large DNA databases . Genome Research . 2001 ; 11 ( 10 ): 1725 ± 1729 . https://doi.org/10.1101/gr.194201 PMID: 11591649
10. Wu TD , Watanabe CK . GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics . 2005 ; 21 ( 9 ): 1859 ± 1875 . https://doi.org/10.1093/bioinformatics/bti310 PMID: 15728110
11. Simpson JT , Durbin R . Efficient construction of an assembly string graph using the FMindex . Bioinformatics . 2010 ; 26 ( 12 ):i367± i373 . https://doi.org/10.1093/bioinformatics/btq217 PMID: 20529929
12. Pell J , Hintze A , CaninoKoning R , Howe A , Tiedje JM , Brown CT . Scaling metagenome sequence assembly with probabilistic de Bruijn graphs . Proceedings of the National Academy of Sciences . 2012 ; 109 ( 33 ): 13272 ± 13277 . https://doi.org/10.1073/pnas.1121464109
13. Peterlongo P , Chikhi R . Mapsembler, targeted and micro assembly of large NGS datasets on a desktop computer . BMC Bioinformatics . 2012 ; 13 ( 1 ): 48 . https://doi.org/10.1186/ 1471 21051348 PMID: 22443449
14. Edgar RC . Search and clustering orders of magnitude faster than BLAST . Bioinformatics . 2010 ; 26 ( 19 ): 2460 ± 2461 . https://doi.org/10.1093/bioinformatics/btq461 PMID: 20709691
15. Ghodsi M , Liu B , Pop M. DNACLUST: Accurate and efficient clustering of phylogenetic marker genes . BMC bioinformatics . 2011 ; 12 ( 1 ): 271 . https://doi.org/10.1186/ 1471 210512271 PMID: 21718538
16. Li W , Godzik A . Cdhit: A fast program for clustering and comparing large sets of protein or nucleotide sequences . Bioinformatics . 2006 ; 22 ( 13 ): 1658 ± 1659 . https://doi.org/10.1093/bioinformatics/btl158 PMID: 16731699
18. Ames SK , Hysom DA , Gardner SN , Lloyd GS , Gokhale MB , Allen JE . Scalable metagenomic taxonomy classification using a reference genome database . Bioinformatics . 2013 ; 29 ( 18 ): 2253 ± 2260 . https://doi. org/10.1093/bioinformatics/btt389 PMID: 23828782
19. Diaz NN , Krause L , Goesmann A , Niehaus K , Nattkemper TW . TACOA±Taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach . BMC Bioinformatics . 2009 ; 10 ( 1 ): 56 . https://doi.org/10.1186/ 1471 21051056 PMID: 19210774
20. Vyverman M , De Baets B , Fack V , Dawyndt P. essaMEM: Finding maximal exact matches using enhanced sparse suffix arrays . Bioinformatics . 2013 ; 29 ( 6 ): 802 ± 804 . https://doi.org/10.1093/ bioinformatics/btt042 PMID: 23349213
21. Khiste N , Ilie L. EMEM : Efficient computation of maximal exact matches for very large genomes . Bioinformatics . 2015 ; 31 ( 4 ): 509 ± 514 . https://doi.org/10.1093/bioinformatics/btu687 PMID: 25399029
22. Vyverman M , De Baets B , Fack V , Dawyndt P. A long fragment aligner called ALFALFA . BMC Bioinformatics . 2015 ; 16 ( 1 ): 159 . https://doi.org/10.1186/s12859015 05330 PMID: 25971785
23. Roberts M , Hayes W , Hunt BR , Mount SM , Yorke JA . Reducing storage requirements for biological sequence comparison . Bioinformatics . 2004 ; 20 ( 18 ): 3363 ± 3369 . https://doi.org/10.1093/bioinformatics/ bth408 PMID: 15256412
24. Kurtz S , Phillippy A , Delcher AL , Smoot M , Shumway M , Antonescu C , et al. Versatile and open software for comparing large genomes . Genome Biology . 2004 ; 5(2):R12 . https://doi.org/10.1186/gb2004  52r12 PMID: 14759262
25. Abouelhoda MI , Kurtz S , Ohlebusch E. Replacing suffix trees with enhanced suffix arrays . Journal of Discrete Algorithms . 2004 ; 2 ( 1 ): 53 ± 86 . https://doi.org/10.1016/S1570 8667 ( 03 ) 00065  0
26. Khan Z , Bloom JS , Kruglyak L , Singh M. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays . Bioinformatics . 2009 ; 25 ( 13 ): 1609 ± 1616 . https:// doi.org/10.1093/bioinformatics/btp275 PMID: 19389736
27. Schleimer S , Wilkerson DS , Aiken A . Winnowing: local algorithms for document fingerprinting . In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM; 2003 . p. 76 ± 85 .
28. Li H. Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences . Bioinformatics . 2016 ; 32 ( 14 ): 2103 ± 2110 . https://doi.org/10.1093/bioinformatics/btw152 PMID: 27153593
29. Jain C , Dilthey A , Koren S , Aluru S , Phillippy AM . A fast approximate algorithm for mapping long reads to large reference databases . In: International Conference on Research in Computational Molecular Biology . Springer; 2017 . p. 66 ± 81 .
30. Ye C , Ma ZS , Cannon CH , Pop M , Douglas WY . Exploiting sparseness in de novo genome assembly . BMC Bioinformatics . 2012 ; 13 ( 6 ):1. https://doi.org/10.1186/ 1471 210513S6S1
31. Movahedi NS , Forouzmand E , Chitsaz H . De novo coassembly of bacterial genomes from multiple single cells . In: Bioinformatics and Biomedicine (BIBM) , 2012 IEEE International Conference on. IEEE; 2012 . p. 1 ± 5 .
32. Chikhi R , Limasset A , Jackman S , Simpson JT , Medvedev P . On the representation of de Bruijn graphs . In: Research in Computational Molecular Biology . Springer; 2014 . p. 35 ± 55 .
33. Li Y , Yan X MSPKmerCounter: A fast and memory efficient approach for kmer counting . arXiv preprint arXiv:150506550 . 2015 ;.
34. Deorowicz S , Kokot M , Grabowski S , DebudajGrabysz A . KMC 2: Fast and resourcefrugal kmer counting . Bioinformatics . 2015 ; 31 ( 10 ): 1569 ± 1576 . https://doi.org/10.1093/bioinformatics/btv022 PMID: 25609798
35. Zerbino DR , Birney E. Velvet : Algorithms for de novo short read assembly using de Bruijn graphs . Genome Research . 2008 ; 18 ( 5 ): 821 ± 829 . https://doi.org/10.1101/gr.074492.107 PMID: 18349386
36. Simpson JT , Wong K , Jackman SD , Schein JE , Jones SJ , Birol I. ABySS: A parallel assembler for short read sequence data . Genome Research . 2009 ; 19 ( 6 ): 1117 ± 1123 . https://doi.org/10.1101/gr.089532. 108 PMID: 19251739
37. Li R , Zhu H , Ruan J , Qian W , Fang X , Shi Z , et al. De novo assembly of human genomes with massively parallel short read sequencing . Genome Research . 2010 ; 20 ( 2 ): 265 ± 272 . https://doi.org/10.1101/gr. 097261.109 PMID: 20019144
38. Li Y , Kamousi P , Han F , Yang S , Yan X , Suri S. Memory efficient minimum substring partitioning . In: Proceedings of the VLDB Endowment . vol. 6 ( 3 ). VLDB Endowment; 2013 . p. 169 ± 180 .
39. Orenstein Y , Pellow D , MarcËais G , Shamir R , Kingsford C . Compact universal kmer hitting sets . In: International Workshop on Algorithms in Bioinformatics. Springer; 2016 . p. 257 ± 268 .
40. Almutairy M , Torng E. The effects of sampling on the efficiency and accuracy of k mer indexes: Theoretical and empirical comparisons using the human genome . PLOS ONE . 2017 ; 12 ( 7 ):e0179046. https://doi.org/10.1371/journal.pone. 0179046 PMID: 28686614
41. Smith KC . Sliding window minimum implementations; 2016 . https://people.cs.uct.ac.za/~ksmith/ articles/sliding_window_minimum. html#id2
42. de Koning AJ , Gu W , Castoe TA , Batzer MA , Pollock DD . Repetitive elements may comprise over twothirds of the human genome . PLOS Genetic . 2011 ; 7 ( 12 ):e1002384. https://doi.org/10.1371/journal. pgen.1002384
43. Morgulis A , Gertz EM , SchaÈffer AA , Agarwala R. WindowMasker: Windowbased masker for sequenced genomes . Bioinformatics . 2006 ; 22 ( 2 ): 134 ± 141 . https://doi.org/10.1093/bioinformatics/ bti774 PMID: 16287941
44. Hach F , Hormozdiari F , Alkan C , Hormozdiari F , Birol I , Eichler EE , et al. mrsFAST: A cacheoblivious algorithm for shortread mapping . Nature Methods . 2010 ; 7 ( 8 ): 576 ± 577 . https://doi.org/10.1038/ nmeth0810576 PMID: 20676076
45. Alkan C , Kidd JM , MarquesBonet T , Aksay G , Antonacci F , Hormozdiari F , et al. Personalized copy number and segmental duplication maps using nextgeneration sequencing . Nature Genetics . 2009 ; 41 ( 10 ): 1061 ± 1067 . https://doi.org/10.1038/ng.437 PMID: 19718026
46. Rumble SM , Lacroute P , Dalca AV , Fiume M , Sidow A , Brudno M. SHRiMP: Accurate mapping of short colorspace reads . PLOS ONE Computational Biology . 2009 ; 5 ( 5 ):e1000386. https://doi.org/10.1371/ journal.pcbi.1000386
47. Ahmadi A , Behm A , Honnalli N , Li C , Weng L , Xie X . Hobbes: Optimized grambased methods for efficient read alignment . Nucleic Acids Research . 2012 ; 40 ( 6 ):e41± e41 . https://doi.org/10.1093/nar/ gkr1246 PMID: 22199254
48. Hormozdiari F , Hach F , Sahinalp SC , Eichler EE , Alkan C . Sensitive and fast mapping of dibase encoded reads . Bioinformatics . 2011 ; 27 ( 14 ): 1915 ± 1921 . https://doi.org/10.1093/bioinformatics/btr303 PMID: 21586516
49. Weese D , Emde AK , Rausch T , DoÈring A , Reinert K. RazerS: Fast read mapping with sensitivity control . Genome Research . 2009 ; 19 ( 9 ): 1646 ± 1654 . https://doi.org/10.1101/gr.088823.108 PMID: 19592482