On-line Searching in IUPAC Nucleotide Sequences.
Bioinformatics(2019)
摘要
We propose a novel pattern matching algorithm for consensus nucleotide sequences over IUPAC alphabet, called BADPM (Byte-Aligned Degenerate Pattern Matching). The consensus nucleotide sequences represent a consensus obtained by sequencing a population of the same species and they are considered as so-called degenerate strings. BADPM works at the level of single bytes and it achieves sublinear search time on average. The algorithm is based on tabulating all possible factors of the searched pattern. It needs O(m + mα2 logm)space data structure and O(mα2) time for preprocessing where m is a length of the pattern and α represents a maximum number of variants implied from a 4-gram over IUPAC alphabet. The worst-case locate time is bounded by O(nm2α4) for BADPM where n is the length of the input text. However, the experiments performed on real genomic data proved the sublinear search time. BADPM can easily cooperate with the block q-gram inverted index and so achieve still better locate time. We implemented two other pattern matching algorithms for IUPAC nucleotide sequences as a baseline: Boyer-Moore-Horspool (BMH) and Parallel Naive Search (PNS). Especially PNS proves its efficiency insensitive to the length of the searched pattern m. BADPM proved its strong superiority for searching middle and long patterns.
更多查看译文
关键词
Consensus Nucleotide Sequences,Genomic Sequences,Degenerate Pattern Matching,q-gram Inverted Index
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要