Serviceeinschränkungen vom 12.-22.02.2026 - weitere Infos auf der UB-Homepage

Treffer: Parallel computing for genome sequence processing.

Title:

Parallel computing for genome sequence processing.

Authors:

Zou Y; Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China., Zhu Y; Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China., Li Y; computer science at Old Dominion University, USA., Wu FX; College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada., Wang J; School of Computer Science and Engineering at Central South University, Changsha, Hunan, China.

Source:

Briefings in bioinformatics [Brief Bioinform] 2021 Sep 02; Vol. 22 (5).

Publication Type:

Journal Article; Research Support, Non-U.S. Gov't; Review

Language:

English

Journal Info:

Publisher: Oxford University Press Country of Publication: England NLM ID: 100912837 Publication Model: Print Cited Medium: Internet ISSN: 1477-4054 (Electronic) Linking ISSN: 14675463 NLM ISO Abbreviation: Brief Bioinform Subsets: MEDLINE

Imprint Name(s):

Publication: Oxford : Oxford University Press
Original Publication: London ; Birmingham, AL : H. Stewart Publications, [2000-

Contributed Indexing:

Keywords: cluster computing; genome sequence processing; heterogeneous computing; multi-core computing; parallel computing

Entry Date(s):

Date Created: 20210406 Date Completed: 20211122 Latest Revision: 20211122

Update Code:

20250114

DOI:

10.1093/bib/bbab070

PMID:

33822883

Database:

MEDLINE

Weitere Informationen

The rapid increase of genome data brought by gene sequencing technologies poses a massive challenge to data processing. To solve the problems caused by enormous data and complex computing requirements, researchers have proposed many methods and tools which can be divided into three types: big data storage, efficient algorithm design and parallel computing. The purpose of this review is to investigate popular parallel programming technologies for genome sequence processing. Three common parallel computing models are introduced according to their hardware architectures, and each of which is classified into two or three types and is further analyzed with their features. Then, the parallel computing for genome sequence processing is discussed with four common applications: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is firstly introduced, and then a list of tools or algorithms are summarized in the aspects of principle, hardware platform and computing efficiency. The programming model of each hardware and application provides a reference for researchers to choose high-performance computing tools. Finally, we discuss the limitations and future trends of parallel computing technologies.
(© The Author(s) 2021. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.)

AN0152975218;g0y01sep.21;2021Oct14.07:28;v2.2.500

Parallel computing for genome sequence processing

Keywords: parallel computing; multi-core computing; cluster computing; heterogeneous computing; genome sequence processing

Abstract

Background

With the rapid development of gene sequencing technologies, the sequentially running tools are becoming more and more incompetent to process ever-increasing quantities of biological data [[1], [3]]. Until now, the sequencing speed is 50 000 times faster than that in 2000, but the cost is only 1/25 000 [[4]]. With this exponential growth of sequence data, a lot of biological data analysis tools and frameworks have been developed, such as sequence alignment, genome assembly, single nucleotide polymorphism (SNP) detection and genome-wide association studies (GWAS) [[5]]. These tools and frameworks have some common computing challenges: massive data processing, long runtime and hardware architecture dependency. Thus, solving high-computational complexity problem has become more and more crucial.

Yelick et al. [[6]] review parallel computing tools in the ExaBiome project and classify the tools in the aspects of functions: K-mer analysis, pairwise alignment, all-to-all alignment, graph traversal for genome assembly, sparse matrix operations for protein clustering, machine learning from genomics and proteomics. Besides, their review summarizes seven parallelism motifs from the classified tools. The research also reviews hardware and software supporting parallel genome analysis. Marco et al. introduce GPU-based parallel computing in bioinformatics, computational biology and systems biology. The study reviews six common high-performance computing types, and the advantages and drawbacks of each type are also summed up. Tools supported by GPUs are introduced in the view of their functions: sequence alignment, molecular dynamics, molecular docking, prediction and searching of molecular structures, simulation of Spatio-temporal dynamics, etc. [[7]]. Zou et al. review the MapReduce-based software and projects in next-generation sequencing (NGS) data processing in the aspects including sequence alignment, mapping, assembly, gene expression analysis, SNP analysis, etc. [[8]]. Alex et al. [[9]] review parallel computing in detecting epistasis of genome data. Their work discusses this specific case in different computational methods, and a list of tools in each method are introduced with their parallel models, advantages and disadvantages.

Among these studies, genome sequence processing is the most common operation in parallel computing. In many applications of genome sequence processing, the input data could be natively divided into multiple parts, and then each part is processed separately. So, the parallel computing or distributed computing model can be feasibly applied to most of the genome sequence processing. There are many factors that should be considered to obtain efficient parallel computing, because parallel computing models with different memory, network and I/O features lead to different computing performance. In this review, we aim at helping users to choose suitable parallel computing models for their software design or find out the applicable tools for their data processing. We firstly introduce the current popular parallel programming models in genome sequence processing: the multi-core computing model, the cluster computing model and the heterogeneous computing model. Secondly, we briefly discuss and compare the main ideas of the three models and introduce applications adopting parallel computing under four common scenarios: genome sequence alignment, SNP calling, genome sequence preprocessing and pattern detection and searching. We introduce the main steps which adopt the parallelized strategy for each scenario and summarize their classical and relatively new tools or algorithms. Finally, we discuss challenges and current trends of parallel computing in genome sequence processing.

Parallel computing model

Parallel computing refers to the simultaneous use of multiple computing resources when solving computational problems. Specifically, a problem is divided into several parts and solved with multiple processors. It can make full use of programming instructions, data structure and computer hardware to complete the calculation efficiently. Usually, there are four common forms of parallelism according to the granularity: bit-level, instruction-level, data-level and task-level parallelism [[10]], which can work together in a parallel computing solution.

In 1966, Flynn [[11]] proposed a taxonomy of parallel computing based on the number of concurrent instruction (or control) streams and data streams available in the architecture. The four classifications posed by him are single instruction stream and single data stream (SISD), single instruction stream and multiple data stream (SIMD), multiple instruction streams and single data stream (MISD), multiple instruction streams and multiple data stream (MIMD). According to its memory access model, microprocessor and internet network, MIMD systems can be divided into five categories: symmetric multi-processor (SMP), parallel vector processor (PVP), massively parallel processor (MPP), distributed shared memory (DSM) and cluster. Modern parallel GPUs are SIMD systems. Besides, most modern CPUs for general purposes employ SIMD instructions and execution units.

To make the classification more specific, we further categorize the most common parallel computing models in the aspects of hardware architecture and divide them into three types: multi-core computing model, the cluster computing model and the heterogeneous computing model, which are shown in Figure 1. The multi-core computing model can be further divided into two categories according to different working principles: multi-threading computing and multi-process computing. The cluster computing model can be further classified into three categories according to the framework: Apache Hadoop [[12]], Spark [[13]] and traditional distributed parallel (message passing interface – MPI, grid computing, etc.). The heterogeneous computing parallel computing model employs heterogeneous hardware to accelerate computation. Specifically, GPU, FPGA and Xeon Phi are the most commonly used heterogeneous hardware. Each of these three models is in SIMD or MIMD fashion according to their implementation. Furthermore, the MapReduce-based framework like Hadoop and Spark can be classified into single program and multiple data stream (SPMD) system, which is a subcategory of MIMD. All of them have their particular features in both of their hardware structures and programming frameworks. For example, the heterogeneous computing model can interact with coprocessors through low-level directives, and the multi-core computing model and the cluster computing model are often implemented through high-level languages or calling library in programming.

Graph: Figure 1 class="chapter-para">The difference of hardware architecture among multi-core processor server, computing cluster and heterogeneous computing server.

Multi-core computing model

Multi-core processors refer to two or more independent central processing units in a single computing node. Each core of the CPU is an independent unit, which consists of a control unit (CU), arithmetic logic unit (ALU), registers, cache, etc. In a multi-core computing system, processors read and write the random access memory (RAM) to load data and instructions, or exchange data with other components such as GPU and network card. In the multi-core processor, there is a hierarchy of multiple cache levels for quickly accessing and switching data. In multi-core computing models, memory spaces are partially shared between processors, and some parts of the memory hierarchies can be read and written asynchronously. For example, L1 and L2 caches are private per-core caches, and L3 cache is shared among the cores of the processor. We name the multi-core computing models for cases that run programs on multi-core processors in a single host, and classify the multi-core computing models into multi-threading computing and multi-process computing according to their programming models.

Multi-threading computing means that a CPU simultaneously executes multiple computing threads, while multi-process computing means that each process runs on an independent CPU or core. In multi-threading applications, all threads share resources including buffers, CPU caches and computing units. In a multi-threading system with multi-core processing units, thread-level parallelism and instruction-level parallelism are used to improve the execution efficiency. This approach takes advantage of unused computing resources by avoiding resource idleness and accelerating the overall execution procedure. The processes in the multi-process applications are often programmed through a fork() function, which executes instructions independently and shares data by interprocess communication. Usually, the multi-process computing efficiency of multi-process is lower than that of multi-threading on the same host. While it has the advantage in scalability, the multi-process program can be performed on different cores from multiple hosts through processing communication mechanisms like MPI. However, multi-threading computing is only suitable for a single host. To integrate their advantages, multi-threading computing and multi-process computing can be combined to make the best of computing resources. For example, in a multiple hosts cluster environment, a parallel computing application can be implemented in multi-threading mode by Open Multi-Processing (OpenMP) API, and at the same time, MPI is used to interact between different processes in different hosts. In this way, the application is in the multi-process mode among hosts and in the multi-threading mode among CPU cores of each host.

Cluster computing model

A cluster is a group of loosely coupled computing hosts that work closely together through cluster management and job scheduling system, and they can be viewed as a single system. Under this distributed architecture, we divide the most popular parallel cluster computing frameworks into three types: Apache Hadoop, Spark system and traditional distributed frameworks.

Apache Hadoop is one of the most widely used open-source implementations of the MapReduce programming model [[14]], including data storage layer (HDFS: Hadoop Distributed File System [[15]]) and resource manager layer (YARN: Yet Another Resource Negociator [[16]]). MapReduce includes two main phases which are map and reduce respectively. They are separated by the internal shuffle phase of the intermediate results. The framework of MapReduce automatically performs these functions in parallel with any number of processors [[17]]. The Hadoop software library can simplify programming and support scaling from a single server to even thousands of servers. Each server can provide local storage and computing functions to solve computing problems involving large data sets in parallel effectively. HDFS splits a file into large chunks, distributes them to nodes in the cluster, and transfers the packaged code to the nodes to process data in parallel. This method takes advantage of the locality principle of data, making each node operate more efficiently.

The features of the Hadoop framework allow researchers to create parallel and scalable models for many existing computing applications. Facing massive amounts of computational data, Hadoop can still maintain high performance, which is suitable for many research fields in computational biology. It is also compatible with various algorithms including BLAST [[18]], SOAP [[19]], MAQ [[20]] and ZOOM [[21]], especially seed-and-extend style algorithms. Parallel algorithms implemented by the Hadoop model have many advantages including good scalability, little redundancy, automatic monitoring and restart, and high-performance distributed file access. In further research, combining large-scale computing clouds with thousands of nodes with existing parallel applications can effectively increase the cost-effectiveness of computational biology. Although Hadoop has become a standard data processing technology, there are still many shortcomings. For the structure of HDFS, different MapReduce jobs cannot directly share data, and intermediate results must be written to disk and read from HDFS at the beginning of the next iteration. The reading of intermediate data in this HDFS system reduces the performance of Hadoop. For example, when an iterative algorithm runs in the form of MapReduce, the computing performance decreases significantly due to updating intermediate data frequently. Besides, the MapReduce computational model has a long waiting time to satisfy real-time computational requirements, so it is usually applied to offline batch processing solutions.

Spark [[13]] is an open-source cluster computing system developed by AMP Labs, University of California, Berkeley, to fasten data analysis. It supports high-level APIs in Scala, Java, Python, R for data analysis. Spark uses a distributed memory abstraction called Resilient Distributed Data Sets (RDD) to support computations on memory and disk in a fault-tolerant manner [[22]]. RDD supports caching in memory between cluster nodes and reuses in multiple MapReduce-like parallel operations. The main advantage of Spark is that the intermediate results and calculation data are temporarily stored in the memory. The program can directly obtain data from memory and reduce the I/O overhead significantly [[23]]. Although Spark can employ RDD to achieve fast computing speed than Hadoop, it has higher requirements for the machine memory and CPU [[13]] due to the synchronization overhead and slower memory accesses of MapReduce framework.

As a popular tool for the iterative calculation of massive data, the efficiency of Spark is related to the data quantity. The efficiency increases when the number of calculations increases [[24]]. Besides, the Spark framework is also available for tools that require real-time analysis. When conducting related studies, researchers have to pay attention to the fact that due to the particular structure of RDD, it is challenging for Spark to perform well in applications that need to be executed asynchronously or updated continuously in a fine-grained process. Researchers also need to consider the potential complexity of cluster computing technology and the latency, cost, and data security brought by the large-scale data transfer on clouds.

Besides Hadoop and Spark clusters, there are other traditional distributed parallel types such as MPI [[25]] and grid computing. MPI is a portable messaging standard that can be used to standardize multiple parallel computing architectures, and it is the most commonly used form in traditional high-performance computing. MPI-related interfaces are designed to provide virtual topology, data synchronization and communication functions in a group of servers. It completes related procedures with multi-process work and assigns one process to each CPU core. Different processes work together to achieve excellent performance. Each computing task uses its own memory, and the multiple tasks can not only run on the same host, but also distribute on different hosts. They exchange data with each other by sending and receiving messages. The MPI model supports both peer-to-peer and collective communication, excelling in performances including scalability and portability.

There are some differences between MapReduce-based computing models (Hadoop, Spark) and traditional distributed computing models (grid computing, MPI). The former has an inner fault-tolerant mechanism to conquer computing task failure, which makes it more programmer-friendly. The latter cannot tolerate fault automatically, but it can realize more complicated computing due to their APIs.

Heterogeneous Computing Model

In the past, the CPU was the only choice for general-purpose computing. Although the CPU manufacturers made a great effort to improve the performance and manufacturing processes of the CPU, it cannot satisfy the increasing requirement of computing and data processing nowadays. To overcome this obstacle, a heterogeneous computing model has been proposed. It adopts various heterogeneous hardware such as GPUs, FPGAs, ASICs, coprocessors, etc. together with CPUs to carry out computing tasks. These heterogeneous hardware systems usually use different instruction sets from CPU, and it also has a simpler structure than CPU. It is capable of handling powerful parallel processing due to its hardware structure. The widely used heterogeneous hardware in scientific computing and supercomputing are graphics processor unit (GPU), field-programmable gate array (FPGA) and Xeon-Phi, an Intel Many Integrated Core structure product (MIC) 2011–2018. Although these devices are different in their hardware structure and instruction sets, there are things in common: firstly, these hardware systems are plugged into the motherboard of the computing server and communicate data through peripheral component interconnect express (PCIe). Three of them have a common programming model: the program code is divided into the host side and the device side. The host side part runs on CPUs, while the device side part runs on the heterogeneous hardware. Both data and program are firstly loaded into the RAM of the server, and then the necessary part is transferred to the device side. The results are transferred to the host side after parallel processing for the final output.

GPU is the most popular heterogeneous hardware, as it is a common component of the computer hardware system. GPU consists of a large number of computing units and a processing pipeline, which make it suitable for simple and consistent data processing procedures [[26]]. A GPU can process images and graphic data more efficiently than a normal CPU in parallel computing, which is often equivalent to the general computational efficiency of a small traditional CPU cluster. After non-graphic data are migrated to graphical form, researchers start using GPU for scanning and analyzing to obtain a bigger acceleration [[7]]. General purpose GPU (GPGPU) refers to processing routine computing tasks with high-performance GPU, especially when the data processing capacity of the task is much larger than the amount of data that needs to be scheduled for transmission in the task. Compute unified device architecture [[27]] (CUDA) is a programming model that enables general processing and parallel computing with a GPU, which is developed by Nvidia. CUDA provides researchers with APIs to directly access the parallel computing-related components and virtual instruction sets of GPU in the model to complete the kernel calculations. It can process non-graphical data more effectively in biometrics. With the development of sequencing technology and the rapid growth of genomic data amount, matrix-based parallel tools have been proposed more and more frequently. Many researchers have turned to many-core architectures due to their good performance, such as GPU models supporting CUDA.

GPU is also the most widely used hardware for deep learning because of its high efficiency in data training. Nowadays, deep learning technologies are used in various fields. It has produced lots of surprising results, and bioinformatics is no exception. Giansanti et al. [[28]] introduce many cases that use deep learning technology in bioinformatics, including but not limited to proteomic applications, pharmaceutical research and gene expression. These applications usually employ mature frameworks such as TensorFlow [[29]], Theano [[30]], Torch [[31]], etc. to avoid complicated programming on GPU, achieving high utilization.

FPGA is another heterogeneous hardware, and its parallel computing ability is brought by components of look up tables (LUTs), RAM and digital signal processing unit (DSP). FPGA is a low power consumption hardware and can be adopted in parallel computing situations which with: (1) large amounts of data, (2) high data transfer efficiency (3) and simple algorithm control. Although the advantages of FPGA are obvious, the programming on FPGA is a little complicated. In the past, researchers had to use hardware description language (HDL) for FPGA programming. However, HDL is not easy for non-hardware researchers. They have to consider many product details and underlying peculiarities of the hardware. To reduce the development difficulty, Altera Company provided an Open Computing Language SDK (OpenCL) to low the programming difficulty. OpenCL [[32]] is a famous heterogeneous programming framework that provides a C/C++—like programming environment for researchers to work on without considering details of differences between hardware from different producers. In other words, the program can generally be ported to different FPGA hardware without any changes, which is the main reason of FPGA applications becoming popular in parallel computing. Although OpenCL can make the porting between different hardware easily, it has some drawbacks. Firstly, OpenCL keeps the implementation details from programmers and works as a middleware, so the program needs extra cost in interaction with hardware and the efficiency is a bit lower than that of programs coded through HDL. Secondly, programmers have to refine their codes according to the hardware features, such as the number of LUTs, and the size of RAM.

Xeon Phi [[33]] is a coprocessor product based on MIC architecture introduced by Intel Corporation, which can provide powerful computing performance. Compared with GPU and FPGA, programming with Xeon Phi is much easier. Besides, the architecture of Xeon Phi supports ring-based cache coherence protocols [[34]], which makes it suitable for different algorithms. Actually, Xeon Phi has a micro-operating system in the hardware, and the structure of the hardware is similar to multi-cores, in which each core roughly equates to a CPU core. Therefore, the original CPU program can be run on the MIC platform without changes in the native model. Besides, Xeon Phi supports multiple parallel programming models such as OpenMP and Pthread in traditional C/C++/Fortran and other programming languages. There are three execution modes on Xeon Phi: (1) Offload mode, which is also the most commonly used mode for GPU. In this case, the host side takes charge of the whole workflow. Specifically, when executing the parallel computing part, the host side transfers data and instructions to the device side to make parallel computing possible. When executing the serial computing part, the program is executed on the host-side CPU. (2) Peer-to-peer mode, which means the host-side CPUs and Xeon Phi run the same task simultaneously. In this situation, the Xeon Phi hardware is recognized as a separate host, which benefits from the micro-OS of Xeon Phi. This mode is similar to the MPI multi-hosts mode, each core from host-side CPU and device-side Xeon Phi is treated equally, and use MPI APIs to communicate data and synchronize computing. (3) Native mode, which means all programming tasks are executed on the device side. In this mode, the program is copied to the Xeon Phi device firstly, which is similar to transfer data from a server to another through the network. Then, the program has to be manually executed through the micro-OS of Xeon Phi.

Overall, Xeon Phi has a powerful computing performance which is equivalent to popularly used GPU, and it is more programmer-friendly. Although Intel canceled this product in 2018, there have been many applications that have adopted this in computing, and it is still used in many high-performance centers at present.

Genome sequence alignment

Genome sequence alignment is the most significant task in sequence data analysis. It is the basic step of a lot of applications such as genome assembly and GWAS analysis. Genome sequence alignment tasks are often parallelized at the data level, as the sequence file consists of independent reads. For each alignment task between two sequences, the instruction-level parallelization is used to implement dynamic programming based alignment algorithms. There are two main types of genome sequence alignment: the alignment of a sequence to a reference genome (which is usually called mapping) and the multiple sequence alignment. Usually, the genome sequence alignment has two stages of indexing and alignment, and both of these stages can be accelerated by parallel computing. There are many different ways to construct the index for alignments, such as suffix/prefix trees, enhanced suffix array, FM-index and hash tables. After constructing the index, most of the alignment tools use dynamic programming based Smith-Waterman or Needle-Wunsch algorithm to accomplish the procedure.

In the task of sequence alignment, each sequence is independent, so the data level parallel computing is intuitive, which can be realized in all three parallel computing models mentioned above. Furthermore, many studies develop accelerating algorithms in building indexes and reducing computation.

The Burrows-Wheeler Transform (BWT) backward search algorithm is one of the most important algorithms that are used for genomics data analysis, such as sequence mapping and sequence alignment, which are the most time-consuming steps overall. Several widely used tools implemented the BWT algorithm in multi-threading computing models. For example, BWA [[35]] can compare short sequence reads with a large number of reference sequences effectively and allow for mismatches and gaps. It consists of three algorithms: BWA-backtrack [[35]], BWA-SW [[36]] and BWA-MEM [[37]]. The first algorithm is designed for Illumina sequence reads as short as 100 base pairs (bp), while other algorithms focus on longer reads. All of these algorithms extended from BWT can work in multi-threading models and output in SAM format files for downstream analyses.

Besides BWA, Ben et al. [[38]] integrate the advantages of full-text minute indexing and the flexibility and efficiency of multi-threading algorithms and proposed Bowtie2. Bowtie2 completes the alignment quickly and accurately with higher sensitivity compared with Bowtie. Bowtie2 takes advantage of the FM-index and accelerates dynamic programming in parallel to implement gap alignment in the ungapped seeding-finding stage and gapped extension stage. Compared with BWA, Bowtie2 in the default parameter mode is faster than any parameter mode of BWA, achieving fantastic sensitivity and accuracy.

To improve the efficiency of BWA, Jo et al. [[39]] propose BWA-MT, which is evolved from BWA and designed to take full use of computing resources when running parallelly in multi-threading. BWA is limited by the fact that the multi-threading mechanism is not fully supported in the read alignment step, resulting in a long processing time. BWA-MT divides the read sequence into several subsets and processes each subset on a separate thread and parallelizes the refinement steps. Through this way, BWA-MT uses computing resources more efficiently and at the same time improves the data throughput. When working on the same datasets with the same results, BWA-MT is 3.66 times faster than the original BWA.

Meng et al. [[40]] propose another high-performance alignment tool, Pblat. Based on the algorithm from blat [[41]], Pblat accelerates blat aligning by multi-threading and cluster computing to obtain higher efficiency without changing the main algorithm of blat. Pblat shares the copy of the entire reference genome and stores indexes in the memory shared by all threads. When memory consumption and output results are consistent, the time required for Pblat decreases with the increasing number of threads. Pblat is 18.61 times faster than the original blat algorithm in the experiments on a multi-core node. An MPI version of Pblat (Pblat-Cluster) was also proposed to run on a cluster, improving the computing efficiency and reducing the runtime of blat from days to minutes.

Multiple sequence alignments (MSAs) have been the basic steps in the structural and function prediction of proteins and RNAs. They have drawn a lot of attention in recent years. Siavash et al. [[42]] propose PASTA. PASTA is one of the most advanced tools for multi-sequence alignment, and it produces highly accurate alignment results with stability and scalability. The main contribution of PASTA is using the transitivity for homologies to align sequences on the guide tree, which helps to accomplish homology detection and distant alignment effectively. It first uses the bootstrap tree for comparison, then estimates the tree with a straightforward HMM-based profile technique and realigns the sequences by the tree in the end. PASTA uses Python for the multi-process parallel processing program. Experiments show that multi-process can effectively complete multi-processor expansion and use a small number of processors to complete the comparison of 200 000 sequences sets in 1 day.

Michael et al. [[43]] develop a parallel read-mapping algorithm CloudBurst. CloudBurst is a seed and extension algorithm, which works similarly to RMAP [[44]]. It is used to align next-generation sequence data to the human genome and other reference genomes for bioanalysis, including SNP discovery, genotyping and personal genomics. By executing seeds in parallel, non-overlapping k-mers can be indexed and read as seeds and report all optimal alignment results for each read. CloudBurst uses Hadoop to execute in parallel on multiple servers to solve time-consuming problems. The time CloudBurst spends is near-linear with the number of aligned reads, as the data is equally divided and distributed to the computing nodes and each part of the data is processed independently under the MapReduce framework. To compare with the sequentially running tool RMAP, CloudBurst is 30 times faster than RMAP in a 24 CPU cores computing node, and 100 times faster than RMAP in a 96 CPU cores computing node. The striking speedup benefits from the situation that CloudBurst re-implements the mapping algorithm used in RMAP, and fine-tunes its parameters for optimal performance.

As a typical processing step after sequencing, short-segment mapping has become the bottleneck of NGS data analysis and attracted extensive attention from researchers. Luca et al. [[45]] describe SEAL as a distributed alignment tool based on BWA, which is built in two discrete applications, PairReadsQseq and Seqal. These two use the Hadoop MapReduce framework and can be implemented efficiently on cluster nodes. SEAL can be used for tasks of short read pair mapping and deduplication. It also provides the paired-end alignment of sequences read by Illumina. In the experiment, SEAL achieves an efficiency of processing 13 GB of data on a 16-node Hadoop cluster per hour in map and duplication removal modes, while in the mapping-only mode, it processes 19 GB of data per hour.

BigBWA [[46]] is also developed with the Hadoop framework to improve BWA performance. The code of BigBWA is the same as BWA's, so it has strong compatibility with various versions of BWA. Furthermore, BigBWA has high fault tolerance and parallel processing capabilities when performing big data processing, optimizing the processing efficiency. The workflow of BigBWA consists of several steps: firstly, converting the FASTQ input file to Hadoop-compatible format, then copying the input data to HDFS of a Hadoop cluster and finally copying the output from HDFS back to the local file storage system. When researchers tested BigBWA on an Amazon Web Services (AWS) cluster platform with five nodes (each with 16 cores) with data from 1000 different genomic experiments, BigBWA is 36.4 times faster than the original BWA in serial execution. In the case of an increasing number of cores, BigBWA is significantly better than SEAL and pBWA [[47]] when performing BWA backtracking algorithms. Besides, compared with BWA-MEM, BigBWA performed better in the scalability and performance of all candidate datasets.

José M et al. [[48]] develop SparkBWA, a tool for integrating BWA with the Spark framework. The parallel architecture of Spark improves the performance of BWA by efficiently processing massive sequencing data in less time. It not only has great scalability, but also achieves mapping DNA reads to large reference genomes. The specific workflow of SparkBWA includes three main phases: RDD creation, mapping and reduction. The mapping phase uses two separate software layers, so that the tool does not need to modify the original BWA source code to ensure compatibility with any version of BWA software. Besides, the double-layer design can perform the task of alignment using two levels of parallelism. The first level corresponds to the mapping process in the entire cluster, and the second level corresponds to the multi-thread parallel in each mapping process, and completing the BWA implementation on the shared storage device. To verify the performance of SparkBWA, it is evaluated with the other five BWA-related parallel alignment tools in different scenarios. As a result, when aligning short reads based on the BWA 0.5 version [[49]], SparkBWA is 1.9 times faster than SEAL [[45]] and 1.4 times faster than pBWA [[47]], when aligning longer reads with BWA–MEM algorithms, SparkBWA is 1.4 times faster than the average of BigBWA [[46]]. Besides, SparkBWA also provides fast-computing possibilities for users who are not familiar with big data or high-performance computing.

Abuin et al. [[50]] apply Spark technology to PASTA, creating PASTASpark that improved the performance of the alignment phase. Apache Spark takes advantage of the Spark clustering framework and uses RDD to process the most time-consuming parts of the initialization process by using distributed programs, significantly reducing I/O costs. Through the Spark cluster, the execution time of PASTA is reduced, so the multiple sequence alignment results can be obtained from a huge data set in a short time. It has been observed in the test that PASTASpark is about 10 times faster than single-threaded PASTA under the same condition. It can process massive data sets containing 200 000 sequences in 24 h.

To speed up the alignment of sequencing reads with reference DNA sequences, Klus et al. [[51]] develop BarrCUDA. The main idea of BarraCUDA is to make the most time-consuming computational steps in BWA parallelly run on GPU. To ensure the accuracy of results, BarraCUDA inherits the relevant algorithms in BWA, which support the original read mismatch and complete the gap alignment. In the comparison experiments, the alignment throughput on GPU is 6 times higher than that on CPU. Compared to BWA in multi-threading mode, BaraCUDA is 2.8 times faster. Therefore, BarraCUDA provides a favorable choice for large biometric systems, which is in urgent need of increasing efficiency and reducing the cost.

Another tool CUSHAW proposed by Liu et al. [[52]] uses the BWT algorithm and the CUDA-compatible graphic processor to improve the efficiency of short reads in large genomic data. CUSHAW uses two quality-aware bounded search methods (the original BWT transform and FM index) to reduce the search space and ensure the original comparison quality, and uses GPUs to accelerate the alignment procedure. Compared to three BWT-based tools, Bowtie, BWA and SOAP2 [[53]], CUSHAW not only shows good performance in speed, but also obtains better quality alignment results. The limitation of CUSHAW is that it only supports ungapped alignment.

CMSA [[54]] is a multiple similar sequence alignment tool that employs GPU to accelerate computation. It improves the center star policy to reduce the time complexity in the center sequence selection, and distributes pairwise sequence alignment tasks to CPU and GPU at the same time. The workload of CPU and GPU is determined by the computing capability of hardware in the preparation stage. To balance the workload between CPU and GPU, CMSA firstly runs a pre-computation to decide the number of sequences for each computing unit. The best experimental results show that CMSA is 24 times faster compared with the-state-of-the-art tool HAlign [[55]].

BGSA [[56]] implements a bit-parallel pairwise alignment algorithm based on Xeon Phi. It achieves better performance in calculating pairwise edit distance compared with other tools even on the same Xeon Phi device. Researchers have compared BGSA to a GPU tool NVBio [[57]] in the scoring and seed verification stage of alignment. The speed of the tool on Xeon Phi 7210 is 4.4 times faster than that of NVBio on Titan X GPU. The performance in single-precision (enough precision for sequence data) of Xeon Phi 7210 is 5324 GFLOPS, and that of Titan X GPU is 6600 GFLOPS.

SeqAn [[58]] carry out the standard dynamic programming algorithm for sequence alignments and propose a generalized vectorization layout on Xeon phi hardware. SeqAn uses instruction sets (SSE4, AVX2, AVX512, etc.) to achieve a high utilization rate of the computing resources. To accelerate the dynamic programming procedure, SeqAn uses a dynamic wavefront model proposed by Blumofe et al. which is thread-level parallelization. The researchers also designed a vector-level scheme in the matrix computing part. The effect of acceleration is fantastic both on the Illumina reads and PacBio reads. When using SeqAn to perform a single alignment task on Xeon phi, the parallel mode is 1600 times faster than the sequential mode. This result is surprising but reasonable. The frequency of the Xeon Phi core is only 1.3GHz and the sequential mode cannot take any advantage of Xeon Phi, while the parallel mode is fully optimized at both thread-level and vector-level. When SeqAN in parallel mode on Xeon phi and SeqAn in a quad-core Intel CPU@3.6GHz computing node are compared, the former is 2.18 times faster than the latter [[56]].

As the above mentioned, parallel computing is a common strategy in genome sequence alignment. Many tools can be found in all of the computing modes. These tools are highly efficient as their basic ideas are similar. The scale of the problem can be solved by these tools mainly depend on the hardware configuration. In Table 1, each tool or algorithm is listed with function descriptions, speedup, input, output and information of experiment which costs the maximum computation resource. From the list, we can find that BWA is the most used algorithm in contrast experiments. Some of the tools, such as PASTA, compared with itself in different computing resources to improve its scalability. The experiment data information and input/output file information can help users to choose a suitable tool according to the data size and data type. For example, we can use SparkBWA to align big genome data if there is a cluster for computing, SparkBWA has excellent speedup which is suitable for big sequence data. When executing ungapped alignment of genome sequence on a host with GPUs, CUSHAW is worth consideration.

Table 1 Studied applications for genome sequence alignment

<table><thead><tr><th>Tool name(Year). </th><th>Computing model. </th><th>Description. </th><th>Contrast experiment. </th></tr></thead><tbody><tr><td>Bowtie2 [<xref ref-type="bibr" rid="bibr37">37</xref>](2012) </td><td>Multi-threading </td><td>Function: align reads to referenceFeature: combine advantages of FM-index and SIMD dynamic programming to run fasterInput: read files (.fa/fq), Ref. file (.fa/fq)Output: alignment results (.sam) </td><td>Speedup: ~2.5 (in average in terms of runtime)Base: BWA in default modeHardware: 1 node with 4 cores (Intel Xeon X5550 Nehalem 2.66GHz CPU), 68.4GB RAMMax. Data: align 1 M NGS reads to human Ref. </td></tr><tr><td>BWA-MT [<xref ref-type="bibr" rid="bibr38">38</xref>](2015) </td><td>Multi-threading </td><td>Function: align reads to referenceFeature: parallelize alignments generation and refinement steps of BWA to run fasterInput: read files (.fa/fq), Ref. file (.fa/fq)Output: alignment results (.sam) </td><td>Speedup: 3.66 (in maximum in terms of runtime)Base: BWA in default modeHardware: 1 node with 24 cores, 128 GB RAMMax. Data: align 40 M NGS reads to human Ref. </td></tr><tr><td>Pblat [<xref ref-type="bibr" rid="bibr39">39</xref>](2019) </td><td>Multi-threading </td><td>Function: query read from datasetsFeature: parallelize genome or transcriptome mapping steps of blat to run fasterInput: query file, database fileOutput: alignment results (.psl) </td><td>Speedup: 18.61 (in maximum in terms of runtime)Base: blatHardware: 1 node with 32 cores (Intel Xeon E7–4830 @ 2.13GHz CPU), 128GB RAMMax. Data: query read in ~105 K reads (mean Len. 1868 bp) </td></tr><tr><td>PASTA [<xref ref-type="bibr" rid="bibr41">41</xref>](2015) </td><td>Multi-process </td><td>Function: multiple sequence alignmentFeature: employ data level parallelism to estimate multiple sequence alignment for each sequence subsetInput: read file(.fa)Output: alignment results (.aln), tree results (.tre) </td><td>Speedup: ~7 (in maximum in terms of runtime)Base: PASTA in sequential modeHardware: 1 node with 48 cores, 256GB RAMMax. Data: 2 M reads (mean Len. 1556 bp) </td></tr><tr><td>CloudBurst [<xref ref-type="bibr" rid="bibr42">42</xref>](2009) </td><td>Hadoop </td><td>Function: map NGS data to referenceFeature: use Landau-Vishkin k-difference alignment algorithm for alignment and carry on data-level parallelism on Hadoop platform to process large datasetInput: read files (.fa/fq), Ref. file (.fa/fq)Output: alignment results (.bed) </td><td>Speedup: 30 (in maximum in terms of runtime)Base: RMAP in sequential modeHardware: 12 nodes, each with 2 cores (@3.2GHz)Max. Max. Data: map 7.06 M 36 bp reads to human Ref. </td></tr><tr><td>SEAL [<xref ref-type="bibr" rid="bibr44">44</xref>](2011) </td><td>Hadoop </td><td>Function: map NGS data to reference, remove duplicate readFeature: integrate BWA into a Hadoop platform through Pydoop to employ data-level parallelismInput: paired read files (.prq), Ref. file (.fa), BWA index fileOutput: alignment results (.sam) </td><td>Speedup: 18.8 (in maximum in terms of runtime)Base: BWA in 8 threadsHardware: 32 nodes, each with 8 cores (Intel Xeon CPU @2.83GHz), 16GB RAMMax. Data: align 4.06G NGS reads to human Ref. </td></tr><tr><td>BigBWA [<xref ref-type="bibr" rid="bibr45">45</xref>](2015) </td><td>Hadoop </td><td>Function: align reads to referenceFeature: integrate BWA into a Hadoop platform through Java native interface to employ data-level parallelismInput: read files (.fq), Ref. file (.fq), BWA index fileOutput: alignment results (.sam) </td><td>Speedup: 36.4 (in maximum in terms of runtime)Base: BWA in sequential modeHardware: 4 nodes, each with 16 cores (Intel Xeon CPUs @2.5GHz)Max. Data: align 98.8 M NGS reads to human Ref. </td></tr><tr><td>SparkBWA [<xref ref-type="bibr" rid="bibr47">47</xref>](2016) </td><td>Spark </td><td>Function: align reads to referenceFeature: integrate BWA into a Spark platform through Java native interface to employ data level parallelism.Input: read files (.fq), Ref. file (.fq), BWA index fileOutput: alignment results (.sam) </td><td>Speedup: 1.4 (in average in terms of runtime)Base: BigBWAHardware: 6 nodes, each with 64 Cores AMD Opteron CPUs @1.6GHz, 256 GB RAMMax. Data: align 98.8 M NGS reads to human Ref. </td></tr><tr><td>PASTASpark [<xref ref-type="bibr" rid="bibr49">49</xref>](2017) </td><td>Spark </td><td>Function: multiple sequence alignmentFeature: integrate PASTA into a Spark platform through PySpark to employ data level parallelismInput: read file(.fa)Output: alignment results (.aln), tree results (.tre) </td><td>Speedup: ~10 (in maximum in terms of runtime)Base: PASTA in sequential modeHardware: 8 nodes, each with 8 cores (Intel Xeon CPU@2.4GHz), 54.5GB RAMMax. Data: 200 K reads (mean Len. 1556 bp) </td></tr><tr><td>BarraCUDA [<xref ref-type="bibr" rid="bibr50">50</xref>](2012) </td><td>GPU </td><td>Function: align NGS reads to referenceFeature: implement BWT algorithm through CUDA, to execute independent mapping tasks parallelly on multiple GPUsInput: read files (.fa), Ref. file (.fa)Output: alignment results (.sam) </td><td>Speedup: 2.8 (in maximum in terms of runtime)Base: BWA in 12 threadsHardware: 1 node with 12 cores (Intel Xeon 5679 CPU@2.93 GHz), 8GB RAM, 8 NVIDIA Tesla M2050 GPUsMax. Data: align 13.6 M 96 bp reads to <italic>D. melanogaster</italic> Ref. </td></tr><tr><td>CUSHAW [<xref ref-type="bibr" rid="bibr51">51</xref>](2012) </td><td>GPU </td><td>Function: align reads to referenceFeature: implement BWT algorithm through CUDA like BarraCUDA, but store reference data in global memory instead of texture memory, support ungapped alignment onlyInput: read files (.fa), Ref. file (.fa)Output: alignment results (.sam) </td><td>Speedup: 5.5 (in average in terms of runtime)Base: BWA in 4 threadsHardware: 1 node with 4 cores (AMD Opteron 2378 CPU @2.4 GHz), 8GB RAM, 2 NVIDIA Tesla C2050 GPUsMax. Data: align 53.5 M NGS reads to human Ref. </td></tr><tr><td>CMSA [<xref ref-type="bibr" rid="bibr53">53</xref>](2017) </td><td>GPU </td><td>Function: multiple sequence alignmentFeature: run pairwise sequence alignment tasks simultaneously on CPU and GPU, instead of allocating all tasks to GPUInput: read file (.fa)Output: alignment results in self designed format </td><td>Speedup: 24 (in maximum in terms of runtime)Base: HAlign in 6 threadsHardware: 1 node with 6 cores (Intel Xeon E5-2620 CPU@2.4 GHz), 32GB RAM, 1 NVIDIA Tesla K40 GPUMax. Data: 1.0 M reads (mean Len. 1388 bp) </td></tr><tr><td>BGSA [<xref ref-type="bibr" rid="bibr54">54</xref>](2019) </td><td>Xeon Phi </td><td>Function: query read from datasetsFeature: employ bit-parallel technology to optimize pairwise alignment on many-core architecture hardwareInput: query file (.fa), database fileOutput: job execution log </td><td>Speedup: 4.4 (in maximum in terms of GCUPS)Base: NVBio on Titan X GPUHardware: 1 node with 4 cores (Intel Xeon W-2123 CPU@3.6 GHz), 32GB RAM, 1 Xeon Phi 7210 CPU (64 cores@1.3GHz)Max. Data: query 5 M short reads in 1 K reads </td></tr><tr><td>SeqAn [<xref ref-type="bibr" rid="bibr55">55</xref>](2015) </td><td>Xeon Phi </td><td>Function: query read from datasetsFeature: implement dynamic programming algorithm both in thread-level and vector-level parallelly, and run it on Xeon PhiInput: query file (.fa), database fileOutput: job execution log </td><td>Speedup: 2.18 (in maximum in terms of GCUPS)Base: SeqAN in host CPU (4 cores@3.6GHz)Hardware: 1 node with 4 cores (Intel Xeon W-2123 CPU@3.6 GHz), 32GB RAM, 1 Xeon Phi 7210 CPU (64 cores@1.3GHz)Max. Data: query 5 M short reads in 1 K reads </td></tr></tbody></table>

The items 'Function' and 'Feature' of 'Description' briefly indicate the function of the tool/algorithm, and explain why it can run faster in contrast experiments. The 'Input Files' and 'Output Files' are the essential parameters listed in the tool manual. The value of 'Speedup' in 'Contrast Experiment' comes from the referred literature, or deduces from the results proposed in contrast experiments, which are carried on the listed hardware and compared with the 'Base' run. 'Max. Data' is the size of maximum dataset tested in literature, in which G means billion, M means million, and K means thousand.

SNP calling

SNP calling through genome sequence analysis is to find the different sites between a sequence and the genome reference. The main methods for SNP calling are heuristic methods and probabilistic methods [[59]]. The heuristic methods use quality information to identify polymorphic variants, while probabilistic methods use statics to measure the uncertainty. The probabilistic methods require more computational cost, because they use less information and need to perform likelihood calculation or HMM parameter calculation. SNP calling applications often parallelly process each sequence at the data level, and use instruction-level parallelization to find the proper parameters of the calling algorithm.

Li explores FermiKit [[60]], an SNP calling tool with de novo assembly. In this tool, the unitigs graph is used to call SNPs. Firstly, FermiKit represents the reads as unitigs and then uses an FM-index to summarize the notations of sequences. In the time-consuming part of constructing unitig, the author uses the labeled sequence and all-in-RAM policy to make the implementation of a multi-threading model easier.

BS-SNPer [[61]] is an SNP calling tool package for bisulfite-seq data. It proposes a novel dynamic matrix algorithm and approximate Bayesian modeling to improve memory and computation efficiency. The algorithms can be parallelized as the process can be divided into independent parts. BS-SNPer is 100 times faster in all tested cases than the popular Bis-SNP tool. Furthermore, BS-SNPer has higher sensitivity and uses less memory than Bis-SNP.

GSAlign [[62]] is an efficient sequence alignment tool for intra-species genomes. It aims at identifying the variants between genomes including SNPs, insertions, and deletions. The whole procedure of GSAlign is divided into three steps: the local maximal exact match identification, similar region identification, and alignment processing. The main ideas of the tool are derived from BWT arrays, PosDiff-based clustering and dynamic programming. By dividing the query sequence into blocks, GSAlign uses multi-threading technology to accelerate the local maximal exact match identification and alignment. The experiments demonstrate that GSAlign can achieve better performance in the aspects of runtime, memory usage, and accuracy than Minimap2 [[63]], MUMmer4 [[64]] and LAST [[65]].

PVCTools [[66]] is a multi-processing model toolset for variation calling. To accelerate the procedure on multiple CPUs or the computing cluster, it splits the reference genome and alignment BAM files into smaller pieces and sends each variation detection job to different CPU. Finally, it collects the variation results in the main process. The experiments on the datasets of Arabidopsis, rice and human genome improve that PVCTools achieve similar accuracy with the popular tools like samtools, genome analysis toolkit (GATK) [[67]], freebayes, etc. but the time consumption is much less than the competing tools.

GATK was developed by Aaron et al. [[67]] in 2010. It was the best-practice framework that provides pre-processing, variant identification and identification set refinement in whole genome and whole exome sequences. GATK uses the MapReduce framework to simplify the analysis of NGS data. Furthermore, it implements the TreeReducible interface, which makes it possible for GATK to merge two reduced results. GATK has a design that separates the basic data management structure from the analysis and calculation framework, which optimizes the stability and the efficiency of the framework. Furthermore, researchers develop other tools based on this framework that is suitable for NGS. For example, there are sequencing tools developed based on GATK for large projects including the 1000 Genome Project and the Cancer Genome Atlas Project.

Halvade framework proposed by Decap et al. [[68]] in 2015 executes sequencing pipelines parallelly and efficiently in the cluster computing mode. Based on the MapReduce framework, Halvade takes 3 h to process a 50x coverage human whole-genome dataset, and the parallel efficiency is up to 92.1% on the Cloudera cluster platform with 15 working nodes, of which each node has 62GB RAM and 2 Intel Xeon E5–2695 v2 @2.4GHz 24 cores CPUs. Besides, with the same cluster to process the NA12878 dataset, this tool can carry out in >3 h, which is very efficient. Even on single-core machines, Halvade is faster compared with a single tool that uses multiple threads in parallel. Halvade is more suitable for clinical environments, compared with MegaSeq [[69]] which is the same type of cluster tool as Halvade. This is because MegaSeq uses a specific large-scale computing platform to carry out a large quantity of high-throughput genomes data calculation, while Halvade uses local computer clusters to minimize the time consumption of achieving DNA data analysis of the individual patient.

Halvade-RNA proposed by Decap et al. [[70]] in 2017 is a multi-node RNA-seq variant pipeline based on GATK best practice recommendations. Halvade-RNA is based on the MapReduce framework to complete the creation, management and parallelization of data streams. It also supports multiple existing tools such as STAR [[71]] and GATK [[67]] running simultaneously. It is built with the Java language and applies to all Hadoop distributions including Cloudera and Amazon EMR. While single-threaded processing of typical RNA-seq samples takes ~28 h, Halvade-RNA on a small cluster with two 20-cores machines reduces the running time to 2 h. In addition, Halvade-RNA still performs well in single-core and multi-core operating environments, which can significantly shorten the running time and improve the efficiency of RNA-seq data processing.

SparkGA [[72]] is a DNA analysis pipeline similar to GATK, which contains the variant-detection function. The Spark framework is used in this pipeline to reduce the running time. However, SparkGA is different from GATK in its reads sorting step. SparkGA contains a static and dynamic load balancing policy to make the data distributed more reasonably, so that the scalability and memory efficiency is improved. It is about 1.71 times faster than the state-of-the-art tool Halvade [[68]].

ADS-HCSpark [[73]] is a MapReduce-based variant calling tool. It implements the HaplotypeCaller algorithm of GATK on the Spark framework, which makes it more efficient on large-scale genome data and scalable cluster computing resources. ADS-HCSpark uses an adaptive data segmentation algorithm to solve the computation skew of the original HaplotypeCaller algorithm. To improve the accuracy, ADS-HCSpark uses the Hadoop-BAM library to divide a BAM file into overlapped blocks. The experiment results showed that the scalability was satisfactory even in a Gigabit Ethernet environment. When compared with GATK and SparkGA, ADS-HCSpark is 1.58 times faster than GATK and 1.28 times faster than SparkGA.

There are mainly two kinds of GPU-based SNP calling tools. The first kind of tools is based on machine learning algorithms, such as Clairvoyante [[74]] and DeepVariant [[75]], which only use GPU for model training, and thus are not discussed in this paper. The other kind of tools is implemented on CPU originally and can be accelerated on GPU. For example, FamSeq [[76]] is a GPU-based variant calling program proposed in 2014. There are three important algorithms in FamSeq: the Bayesian network algorithm, the Elston-Stewart algorithm and the Markov Chain Monte Carlo algorithm. The Bayesian network has been implemented for the CPU version and GPU version. The GPU version helps to accelerate the whole processing, and is 10 times faster than the CPU version.

The mSNP [[77]] is a massively parallel algorithm for the large-scale SNP detection running on Xeon Phi, which is an improvement of SOAPsnp [[78]]. mSNP modifies the data structure of SOAPsnp to reduce the memory usage and proposes a read-based window division strategy to enlarge the throughput. Xeon Phi is mainly used to accelerate the Bayesian model computing in mSNP. Specifically, 38 times speedup is achieved when compared with mSNP on Xeon Phi and SOAPsnp on single-thread CPU, while the precision of the results is guaranteed.

In Table 2, we can see that NA12878 is the most widely used experiment data in SNP calling. SparkGA is the best choice when giving more than one host for computing. Comparatively, in the single host mode, GSAlign can be used for read files and PVCTools can be used for BAM files. The 'speedup' item in Table 2 shows that GATK is the most classical for variants detection, while SparkGA is the quickest tool and can be applied to big genome data.

Table 2 Studied applications for genome sequence SNP calling

<table><thead><tr><th>Tool name(Year). </th><th>Computing model. </th><th>Description. </th><th>Contrast experiment. </th></tr></thead><tbody><tr><td>FermiKit [<xref ref-type="bibr" rid="bibr60">60</xref>](2015) </td><td>Multi-threading </td><td>Function: assembly-based variant callerFeature: assemble read into unitigs, and calls variants from the alignment to reduced representation of raw data and run fastInput: read files (.fa/fq)Output: variants results (.vcf) </td><td>Speedup: -Base: -Hardware: 1 node with 16 cores, 85GB RAMMax. Data: NA12878 </td></tr><tr><td>BS-SNPer [<xref ref-type="bibr" rid="bibr61">61</xref>](2015) </td><td>Multi-threading </td><td>Function: detect variation from alignmentFeature: implement dynamic matrix algorithm and approximate Bayesian modeling to improve the performanceInput: Ref. file(.fa), alignment file (.bam/.sam)Output: variants results (.vcf), methylation results (.txt) </td><td>Speedup: 100 (in average in terms of runtime)Base: Bis-SNPHardware: 1 node with 6 cores (Intel Xeon E5650 CPU @2.66GHz), 32 GB RAMMax. Data: NA12878 </td></tr><tr><td>GSAlign [<xref ref-type="bibr" rid="bibr62">62</xref>](2019) </td><td>Multi-threading </td><td>Function: align sequence and identify variantsFeature: split sequence data and parallelize the local match identification and alignmentInput: read files (.fa)Output: alignment results(.aln), variants results (.vcf), dotplot file(.ps) </td><td>Speedup: 2.25 (in maximum in terms of runtime)Base: Minimap2Hardware: 1 node with at least 8 cores and 14 GB RAMMax. Data: NA12878 </td></tr><tr><td>PVCTools [<xref ref-type="bibr" rid="bibr66">66</xref>](2019) </td><td>Multi-Process </td><td>Function: detect variation from alignmentFeature: split reference genome into small pieces and parallelly call variationInput: Ref. file(.fa), alignment file (.bam)Output: variants results (.vcf) </td><td>Speedup: 5 (in maximum in terms of runtime)Base: GATKHardware: 20 node, each with 12 Cores (Intel Gold 6126), 512GB RAMMax. Data: 3.2GB genome file </td></tr><tr><td>GATK [<xref ref-type="bibr" rid="bibr67">67</xref>](2010) </td><td>Hadoop </td><td>Function: genome analysis toolkit including variant discoveryFeature: separate the basic data management structure from the analysis and calculation frameworkInput: Ref. file(.fa), alignment file (.bam)Output: variants results (.vcf), alignment file (.bam) </td><td>Speedup: ~11 (in maximum in terms of runtime)Base: GATK in sequential modeHardware: 1 node with at least 12 coresMax. Data: NA12878 </td></tr><tr><td>Halvade [<xref ref-type="bibr" rid="bibr68">68</xref>](2015) </td><td>Hadoop </td><td>Function: align sequence and identify variantsFeature: parallelly execute sequencing pipelines according to GATK best practicesInput: read files (.fq), alignment file (.bam)Output: variants results (.vcf), alignment file (.bam) </td><td>Speedup: ~45 (in maximum in terms of runtime)Base: Halvade in single nodeHardware: 15 nodes, each with 24 cores (Intel Xeon E5–2695 CPU@2.4GHz), 62GB RAMMax. Data: NA12878 </td></tr><tr><td>Halvade-RNA [<xref ref-type="bibr" rid="bibr70">70</xref>](2017) </td><td>Hadoop </td><td>Function: call variant from transcriptomic dataFeature: parallelize RNA-seq variant calling pipeline on multiple nodesInput: read files (.fq), alignment file (.bam)Output: variants results (.vcf), alignment file (.bam) </td><td>Speedup: ~14 (in maximum in terms of runtime)Base: Halvade-RNA in single coreHardware: 2 node, each with 20 cores (Intel Xeon E5–2660 CPU@ 2.60GHz), 256 GB RAMMax. Data: 175 M paired reads </td></tr><tr><td>SparkGA [<xref ref-type="bibr" rid="bibr72">72</xref>](2017) </td><td>Spark </td><td>Function: an optimization of GATK pipelineFeature: employ both static and dynamic load balancing policy make the data distributed more reasonablyInput: read files (.fq)Output: variants results (.vcf) </td><td>Speedup: 1.71 (in maximum in terms of runtime)Base: HalvadeHardware: 20 nodeMax. Data: 402GB genome file </td></tr><tr><td>ADS-HCSpark [<xref ref-type="bibr" rid="bibr73">73</xref>](2019) </td><td>Spark </td><td>Function: variation detection from alignmentFeature: parallelize GATK HaplotypeCaller algorithm and accelerate variant callingInput: alignment file (.bam)Output: variants results (.vcf) </td><td>Speedup: 1.58 (in maximum in terms of runtime)Base: GATKHardware: 2 node with 16 cores (2.60GHz), 128 GB RAMMax. Data: NA12878 </td></tr><tr><td>FamSeq [<xref ref-type="bibr" rid="bibr76">76</xref>](2014) </td><td>GPU </td><td>Function: detect variation from pedigree informationFeature: employ GPU to run Bayesian network algorithmInput: pedigree file (.ped), candidate variant position file (.vcf)Output: variants results (.vcf) </td><td>Speedup: 10 (in maximum in terms of runtime)Base: FamSeq in CPU version in sequential modeHardware: 1 node with 1 core (Intel Xeon CPU @3.07GHz), NVIDIA Tesla M2090 GPUMax. Data: VCF file with 1 M candidate variant positions </td></tr><tr><td>mSNP [<xref ref-type="bibr" rid="bibr77">77</xref>](2018) </td><td>Xeon Phi </td><td>Function: accelerate SOAPsnp by MICFeature: optimize the data structure of SOAPsnp to reduce the memory usage and enlarge throughput.Input: Sorted Soap ResultOutput: consensus file </td><td>Speedup: 38 (in maximum in terms of runtime)Base: SOAPsnp in single threadHardware: 1 node with 24 cores (Intel Xeon E5–2692 CPU@2.2GHz), 3 Xeon Phi 31S1P (each with 57 cores@1.1GHz)Max. Data: 1540 M reads with 1114 M sites </td></tr></tbody></table>

Pre-processing of genome sequence

Among various types of genome sequence pre-processing procedures, we introduce commonly used procedures, such as quality controlling and sequence filtering. The former is usually used to select reads from original datasets according to the quality scores, and the latter decreases the read number for reducing the computational cost. Pre-processing of genome sequence tasks is often parallelized at the data level, as there is almost no need to exchange data among these tasks.

Fastp is another multi-threading tool proposed by Chen et al. [[79]] in 2018. It is a fast FASTQ pre-processor that performs quality control, adapter trimming and quality filtering. By packing the FASTQ file with a suitable size and processing each pack by a single thread, FASTP achieves 2–5 times speedup when compared with other FASTQ pre-processing tools.

Recently, Cantu et al. [[80]] propose PRINSEQ++ in the application of quality control and sequence preprocessing, which is a high-performance implementation based on the popular tool prinseq-lite [[81]], supports read filtering, trims and reformats sequences to improve downstream analysis. PRINSEQ++ uses the POSIX thread (pthread) technology to accelerate computation. To identify the sequence replications, PRINSEQ++ performs quality control and preprocessing of each sequence pair on different threads independently to reduce the running time and memory footprint, achieving 16.47 times speedup compared with prinseq-lite.

FastProNGS [[82]] is a quality control process tool for NGS data running in a multi-threading model, using an adaptive strategy to remove unnecessary data. The removing adapters defined in FastProNGS use auxiliary matrices in dynamic programming algorithms to calculate error rate, helping to determine the removal and trimming parameters. It can work on both single-end and paired-end sequences. Compared with other multi-threading tools, such as FastQC and Fastp, FastProNGS need more memory but less I/O. While the results are the same, the speed of FastProNGS on an eight CPUs machine is much faster than other tools.

To effectively control and filter the sequence data, Patel et al. [[83]] develop the application NGS QC Toolkit. The toolkit supports the multi-process parallelism to improve data processing efficiency. In a multi-process model, each processor allocates and processes a file at the same time to reduce the whole processing time. The NGS QC is useful for the quality control of NGS data and helps researchers to perform better downstream analysis.

Researchers employ FPGA in bioinformatics mainly related to sequence alignment. GateKeeper [[84]] and Shouji [[85]] are two pre-alignment tools implemented on FPGA, which are used to accelerate sequence filter before mapping or alignment. The mapping and alignment procedure are both computational burden as the excessive number of the sequences are generated by HTS, so an efficient approach to filter candidate sequences by pre-alignment would reduce the computation cost massively.

GateKeeper is aimed at accelerating the pre-alignment in DNA short read mapping, which can quickly filter out most incorrect candidate locations when mapping short reads to the reference genome. GateKeeper contains three methods to achieve fast matching alignments and handling base-substitutions. These methods are fast approximate string matching, insertion and deletion detection, and minimizing hardware resource overheads, which are inspired by SHD [[86]], but are modified to be FPGA-friendly. GateKeeper firstly encodes reads at the host-side, then transfers the data to FPGA. Next, FPGA is employed to calculate the Hamming distance between every read and the reference parallelly. Finally, the results will be transferred back to the host-side. The testing results show that GateKeeper on a single FPGA is, on average, 90 times faster than Adjacency Filter on a single CPU and it is 130 times faster than SHD on a single CPU. In addition, the accuracy of GateKeeper experiment results is as good as those of the two CPU-based tools [[87]].

Shouji [[85]] is also a fast and efficient pre-alignment filter for sequence alignment tools, which adopts FPGA to filter sequences before dynamic programming procedure. Shouji employs the pigeonhole principle to decide whether or not two sequences are potentially similar. Moreover, it uses an approach to divide the neighborhood map into so-called search windows in the step of finding common subsequences, which are usually implemented by brute-force algorithms. In this way, the whole procedure can be carried on a scalable architecture. During the processing procedure, the look-up table size of FPGA is reduced, which simplifies the overall design and makes the implementation on FPGA easily. Researchers compared Shouji with other FPGA-based tools, MAGNET [[88]], GateKeeper filters, and also the Intel SSE instructions implemented CPU filter SHD. The results show that Shouji is extremely fast and accurate. Using Shouji as a pre-alignment step before five state-of-the-art sequence aligners can reduce the whole execution time to about 1/19.

Tools in Table 3 are mainly divided into two types, the quality control tools and pre-alignment tools. As the computation cost for quality control is not as big as sequence aligning, the tools run on a single host in the multi-threading model or multi-process model. The pre-alignment tools listed in Table 3 take advantage of FPGA to accelerate computation. With the same dataset as input, Shouji is faster than GateKeeper.

Table 3 Studied applications for genome sequence preprocessing

<table><thead><tr><th>Tool name(Year). </th><th>Computing model. </th><th>Description. </th><th>Contrast experiment. </th></tr></thead><tbody><tr><td>Fastp [<xref ref-type="bibr" rid="bibr79">79</xref>](2018) </td><td>Multi-threading </td><td>Function: control quality and preprocess FASTQ fileFeature: sample reads evenly to evaluate overrepresented sequences and eliminate partial distribution biasInput: read files (.fq)Output: read files (.fq) </td><td>Speedup: 5 (in maximum in terms of runtime)Base: FASTQCHardware:-Max. Data: 9316 MB genome file </td></tr><tr><td>PRINSEQ++ [<xref ref-type="bibr" rid="bibr80">80</xref>](2019) </td><td>Multi-threading </td><td>Function: control quality and preprocess FASTQ fileFeature: read and write compressed files without uncompressing whole file to reduce I/O costInput: read files (.fa/fq)Output: read files (.fq/.fa) </td><td>Speedup: 16.47 (in maximum in terms of runtime)Base: prinseq-liteHardware: 1 node with 24 cores (Intel Xeon CPU X5650 CPU@ 2.67GHz), 189 GB RAMMax. Data: 12.3 GB genome file </td></tr><tr><td>FastProNGS [<xref ref-type="bibr" rid="bibr82">82</xref>](2019) </td><td>Multi-threading </td><td>Function: control quality and preprocess FASTQ fileFeature: accelerate the quality control and adapter removal procedures by splitting raw data files internallyInput: read files (.fq)Output: read files (.fq) </td><td>Speedup: 2.3 (in maximum in terms of runtime)Base: FASTQCHardware: 1 node with 8 cores, 284 GB RAMMax. Data: 12.4GB genome file </td></tr><tr><td>NGS QC Toolkit [<xref ref-type="bibr" rid="bibr83">83</xref>](2012) </td><td>Multi-process </td><td>Function: control quality and preprocess FASTQ fileFeature: process large amount of sequence data parallellyInput: read file(.fq)Output: read files (.fq) </td><td>Speedup: 3.59 (in maximum in terms of runtime)Base: NGS QC Toolkit in single threadHardware: 1 node with 6 CPU, at least 1.5GB RAMMax. Data: 3.8 GB genome file </td></tr><tr><td>GateKeeper [<xref ref-type="bibr" rid="bibr84">84</xref>](2017) </td><td>FPGA </td><td>Function: filter incorrect data through pre-alignmentFeature: employ FPGA to accelerate pre-alignmentInput: read files (.fa)Output: read files (.fa) </td><td>Speedup: 130 (in average in terms of runtime)Base: SHD on single CPUHardware: 1 node with 4cores (Intel i7–3820 CPU @3.6 GHz), 8GB RAM, 1 Virtex-7 XC7VX690T- 2FFG1761C FPGAMax. Data: 30 M paired reads </td></tr><tr><td>Shouji [<xref ref-type="bibr" rid="bibr85">85</xref>](2019) </td><td>FPGA </td><td>Function: filter incorrect data through pre-alignmentFeature: optimize the finding common subsequence step through dividing neighborhood mapInput: read files (.fa)Output: read files (.fa) </td><td>Speedup: 18.8 (in maximum in terms of runtime)Base: MAGNETHardware: 1 node with 4cores (Intel i7–3820 CPU @3.6 GHz), 8GB RAM, 1 Xilinx Virtex 7 VC709 FPGAMax. Data: 30 M paired reads </td></tr></tbody></table>

Pattern mining and matching

Pattern mining and matching is an important case of genome sequence processing. It refers to finding out patterns from the number of sequences. In this review, we mainly discussed motif recognition and methylation measurement in pattern mining and matching. Pattern mining and matching often need the instruction level parallelization for acceleration, due to a lot of data exchange in the stage of constructing a searching strategy.

There are various algorithms for motif recognition in the past years. A normal motif discovery pipeline has three stages [[89]]: (1) pre-processing of input sequences including filtering, clustering, etc. (2) Motif discovering, to represent sequences and adopt a search strategy for motifs. (3) POST-processing, to evaluate the resultant motifs. Hashim et al. classified the regular motif discovery methods into two types: enumerative approach-based methods and probabilistic approach-based methods. All of the approaches are time-consuming and easily trapped in a local optimum. Nature-inspired algorithms such as GA, PSO, ACO, etc. are most used in solving these problems [[90]], and none of them can run fast without parallel computing.

EMPA [[91]] is a pattern-matching algorithm proposed by Tahir et al. It uses a suffix tree to solve the string matching problem and encodes every nucleotide into a 2-bits 'binary' to reduce the space consumption. In the searching and matching stage, the algorithm adopts the multi-threading technology to speed up computing. The researchers compared EPMA with the other five algorithms on different datasets and parameters, showing that the proposed algorithm outperforms the others.

MultiMotifMaker [[92]], a high-performance tool based on MotifMaker [[93]] for DNA methylation motif recognition. Li et al. have found that MotifMaker developed by Pacbio is inefficient in completing methylation motif recognition in the face of large genomic data. To reduce the time consumption, the researchers propose using pruning strategies and multi-threading techniques to improve the efficiency of identifying DNA methylation motifs. MultiMotifMaker optimizes the branch-and-bound search method and made each thread capable of gaining its local optimal solution independently. It then updates the global variables if the locally optimal solution is better than the stored global variables. With the 32-threads mode, MultiMotifMaker is 8–9 times faster than MotifMaker.

PMS6MC [[94]] considers the problem of the planted motif search (PMS), which is NP-hard and mostly solved by approximation approaches. It implements the PMS6 [[95]] algorithm in multicore hardware with multi-threading modeling. The parallel procedure has two strategies, the outer-level parallelism and the inner-level parallelism. The former mainly plans the threads and thread blocks, while the latter aims at balancing the load of threads. Compared with the original PMS6 algorithm, the best result of PMS6MC is 5.4 times faster in a 6-core CPU hardware system.

MCM-SPADE [[96]] is a parallel sequential pattern mining tool, which uses a multi-threading technique to speed up the SPADE and CM-SPADE [[97]] method. To balance the load between threads when executing the divide-and-conquer operation, MCM-SPADE contains a dynamic scheduling policy. The experiment results showed that the MCM-SPADE achieved better performance both in running time and memory usage than other sequential pattern mining tools. Moreover, it can be scaled properly when the number of input sequences increases.

To extract motifs from a single sequence with great length, Sahli et al. [[98]] proposed ACME. In this tool, the suffix tree is used to record sequence counts. To optimize the traversal order of search space, ACME utilizes CPU caches to store information, which makes it more efficient in the searching progress. The tool runs in a cluster environment with thousands of processors through the MPI library. In addition, ACME achieves a brilliant spatial and temporal memory efficiency, which is 34 times faster than the compared motifs extractor MADMX [[99]].

To discover motifs in large DNA datasets, Yu et al. [[100]] propose MCES, a MapReduce framework-based tool. The main algorithm of MCES transfers the problem of motifs identification to the problem of emerging substrings mining which is defined in the literature. The whole procedure is divided into the mining step and combining step. Both steps are designed in MapReduce schema, which makes it possible to run on a Hadoop cluster to deal with large input datasets.

For researchers working on DNA methylation measurements, efficiently performing bisulfite sequencing, an important step in DNA methylation measurement, has become a key issue in addressing the existing bottlenecks of large-scale DNA methylation. A bisulfite comparator BiSpark [[101]] proposed in 2018 is with great scalability, high efficiency and load balance. The core idea of BiSpark is to parallelize the 'three-letter' mapping algorithm, which is an important method for bisulfite comparison on Apache Spark to achieve the linear expansion. The comparative experiment shows that the more reads BiSpark process, the faster its speed is. When comparing the processing efficiency of BiSpark and Bismark [[102]] for 200 M data, it shows that the former 3.5 times faster. Besides, BiSpark reallocates unbalanced data in cluster nodes, which is a highly optimized load balancing strategy that improves the scalability of tools in large clusters and speeds up the alignment process.

Liu et al. [[103]] apply the MEME search algorithm to the CUDA model and proposed CUDA-MEME. This tool has two parallel methods of sequence-level parallelization and substring-level parallelization. It can superimpose computational allocation on both CPU and GPU and fully dispatch computer system by hybrid computing. In the performance test, the developer applied GeForce GTX 280 GPU to run the CUDA-MEME. The result of CUDA-MEME is consistent with that of the original MEME method, and it can achieve a 20.5 times speedup.

Peng et al. [[104]] propose another MEME algorithm-based tool MIC-MEME. Unlike CUDA-MEME, MIC-MEME runs on Intel Xeon Phi hardware. In their work, the starting point searching method of MEME algorithm is parallelized by splitting it into probability score computation and highest score sortation, and threads are divided into thread groups to decrease the synchronization overhead. The vectorization technology is used to accelerate the iteration updating strategy. The experiment results on a Xeon Phi hardware system with 224 cores achieve 26.6 times to 30.2 times average speedups according to different computing models. Compared with CUDA–MEME, MIC–MEME is 5.4 times faster in the test experiments.

In Table 4, most of the tools lack source code or manuals, so it is hard to infer the exact input and output file format. Although they are used for motif discovery, most of tools are designed to run in different scenes. Mult-MotifMaker uses Pacbio reads as input, and MCES uses Chip-seq data as input. ACME finds motif from a single sequence, while BiSpark finds motifs from bisulfite sequencing data. Ultimately, it is hard to give a conclusion on which method is the best. However, we can find the most appropriate method in a particular situation.

Table 4 Studied applications for genome sequence pattern mining/matching

<table><thead><tr><th>Tool name(Year). </th><th>Computing model. </th><th>Description. </th><th>Contrast experiment. </th></tr></thead><tbody><tr><td>EPMA [<xref ref-type="bibr" rid="bibr91">91</xref>](2017) </td><td>Multi-threading </td><td>Function: a matching algorithm for DNA sequencesFeature: execute searching and matching stage in multi-threading modeInput: read file (.fa)Output: read files (.fq) </td><td>Speedup: 4.44 (in maximum in terms of runtime)Base: EBOM [<xref ref-type="bibr" rid="bibr91">91</xref>]Hardware: 1 node with 2 cores (Intel i3-3220 CPU@3.30 GHz), 4 GB RAMMax. Data: 219 MB genome file </td></tr><tr><td>Multi-MotifMaker [<xref ref-type="bibr" rid="bibr92">92</xref>](2020) </td><td>Multi-threading </td><td>Function: identify methylation motifs from sequencesFeature: use pruning strategies and multi-threading to improve the efficiency.Input: Ref. file (.fa), modifications file (.gff)Output: motif result (.csv) </td><td>Speedup: ~9 (in maximum in terms of runtime)Base: MotifMakerHardware: 1 node with 40 cores (Intel Xeon E5-2630 CPU @ 2.20 GHz), 128 GB RAMMax. Data: 341 K reads </td></tr><tr><td>PMS6MC [<xref ref-type="bibr" rid="bibr94">94</xref>](2013) </td><td>Multi-process </td><td>Function: discover motifs based on PMS6 algorithmFeature: a multicore version of PMS6Input: query file, database fileOutput: alignment results (.psl) </td><td>Speedup: 5.4 (in maximum in terms of runtime)Base: PMS6Hardware: 1 node with 6 cores(Intel CPU@ 3.3GHz)Max. Data: 12 KB genome file </td></tr><tr><td>MCM-SPADE [<xref ref-type="bibr" rid="bibr96">96</xref>](2018) </td><td>Multi-process </td><td>Function: discover pattern for sequencesFeature: use pruning strategies to reduce searching space and use the divide-and-conquer policy to speed up processingInput: read file(.fa)Output: alignment results (.aln), tree results (.tre) </td><td>Speedup: ~3 (in maximum in terms of runtime)Base: CM-SPADEHardware: 1 node with 20 cores (Intel Xeon E5-2680 CPU@ 2.80 GHz), 768 GB RAMMax. Data: 1.1 M reads </td></tr><tr><td>ACME [<xref ref-type="bibr" rid="bibr98">98</xref>](2013) </td><td>Cluster </td><td>Function: discover motifs for a single long sequenceFeature: optimize the traversal of search space, and use CPU caches to accelerate data accessingInput: read files (.fa/fq), Ref. file (.fa/fq)Output: alignment results (.bed) </td><td>Speedup: ~34 (in maximum in terms of runtime)Base: MADMXHardware: 1 node with 32 cores (@ 2.27 GHz), 624 GB RAMMax. Data: 2.6GB genome file </td></tr><tr><td>MCES [<xref ref-type="bibr" rid="bibr100">100</xref>](2015) </td><td>Hadoop </td><td>Function: discover motifs for Chip-seq dataFeature: treat motifs identification as substrings mining and enable Hadoop-supportedInput: read files(.fa), control set file(.fa)Output: motif result (.txt) </td><td>Speedup: ~4 (in maximum in terms of runtime)Base: MCES on 2 nodesHardware: 8 node, each with 1 CPU@ 2.67 GHz, 4 GB RAMMax. Data: 5 M reads </td></tr><tr><td>BiSpark [<xref ref-type="bibr" rid="bibr101">101</xref>](2018) </td><td>Spark </td><td>Function: discover motifs for bisulfite sequencing dataFeature: reallocate unbalanced data in cluster nodes to optimize load balancingInput: Ref. file (.fa)Output: log file (.txt) </td><td>Speedup: 3.5 (in maximum in terms of runtime)Base: Bismark on a 24 cores server (Intel Xeon X5650 CPU@2.6GHz, 94GB)Hardware: 1 node with 4 cores (Intel E5-2407 CPU@2.2GHz), 8GB RAM, 20 node, each with 2 cores (Intel i3-3220 CPU@3.3GHz), 8GBMax. Data: 32GB genome file </td></tr><tr><td>CUDA-MEME [<xref ref-type="bibr" rid="bibr103">103</xref>](2010) </td><td>GPU </td><td>Function: discover motifs based on MEME algorithmFeature: implement MEME motif discovery algorithm using the CUDA programming modelInput: Ref. file (.fa)Output: motif result (.txt) </td><td>Speedup: 20.5 (in average in terms of runtime)Base: MEME in sequential modeHardware: 1 node with 2 cores (AMD Opteron 248 CPU@2.2 GHz), 2 GB RAM, 1 GeForce GTX 280 GPU.Max. Data: 2 MB genome file </td></tr><tr><td>MIC-MEME [<xref ref-type="bibr" rid="bibr104">104</xref>](2018) </td><td>Xeon Phi </td><td>Function: discover motifs based on MEME algorithmFeature: implement MEME motif discovery algorithm using the Xeon PhiInput: Ref. file (.fa)Output: motif result (.txt) </td><td>Speedup:5.4 (in maximum in terms of runtime)Base: CUDA-MEMEHardware: 1 node with 144 cores (8 Intel Xeon E7–8800 CPU @2.3GHz), 2048GB RAM, 2 Xeon Phi 3120 (57 cores @1.1 GHz), 2 Nvidia K40M GPUMax. Data: 2 MB genome file </td></tr></tbody></table>

Conclusion and future work

With the rapid development of NGS technologies, the sequencing cost reduces dramatically, and a vast variety of applications become available. In future studies, researchers may face the challenges of hyper-exponential growth of data, and the analysis and application of these data require unprecedented high-performance computing models and platforms [[105]]. In general, high-performance computing models and platforms can be developed from the aspects of massive data storage, algorithm design, and parallel computing. In this review, we have first introduced the background knowledge of parallel computing, and then divided the current popular parallel programming technologies into multi-core computing, cluster computing and heterogeneous computing according to the hardware type supporting parallelism. Then, we subdivide multi-core computing into multi-threading computing and multi-process computing according to their specific model. Next, we further divide cluster computing into Apache Hadoop cluster, Spark cluster and traditional distributed clusters. We also divide heterogeneous computing into GPU, FPGA and Xeon Phi according to their hardware type. For each of the frameworks, we present some of its popular applications over the past decade, providing high-performance tool references for the readers in need.

Although parallel computing models are different, the strategies for accelerating computation are similar, including increasing the throughput, improving algorithms and improving the scalability. We summarized these strategies below:

(i) Increasing data throughput in parallel computing models.

The data throughput refers to the amount of data that can be processed each time. Although with more hosts, CPUs, and GPUs the runtime can be reduced, the I/O load and data transferred among processors simultaneously require communication overhead. Consequently, the limitation of disk I/O or network speed becomes a huge obstacle of parallel computing [[106]]. To increase the data throughput, researchers have to pay additional attention to make full use of hardware features, such as the different level cache/memory of CPU and GPU, the AVX or SSE instructions. Besides, reducing the input data through encoding or indexing is also an important strategy to improve the data throughput.

(ii) Improving optimization algorithm and architecture.

Modern alignment algorithms such as hash algorithms, suffix trees, etc. have significantly reduced the time complexity compared to the previous short sequence alignment, but some algorithms have to be implemented at the expense of time complexity, such as the popular tool BLAST. The optimization algorithm has an essential influence on the design of subsequent parallel computing. Besides, we have also found that the algorithms with the same time complexity may have different performances on different architectures because each algorithm has different computational features and memory access patterns. When researchers consider parallel computing, it is necessary to analyze and adjust algorithms carefully and apply the architecture-aware optimization technology in parallel programming.

(iii) Improving the scalability and portability of the framework.

Scalability is an important consideration of parallel computing tools. To improve parallel scalability, researchers in bioinformatics use MapReduce-based frameworks to distribute independent computing tasks, and employ a cluster to run embarrassingly parallel programs. Through this approach, the scalability of the tools is improved, meanwhile, the robustness of the tool is enhanced by the MapReduce-based framework.

Portability refers to the ability of an application to perform computing in different hardware systems, operating system environments, even in different hardware architectures. Portability often conflicts with efficiency, but good portability can greatly reduce the workload of programmers. Researchers can use a cross-platform programming language like OpenCL, or scripting language like Python, Perl, etc. to improve the portability.

Although the existing parallel computing technologies have made a significant contribution to biological data research, especially in sequence processing, there is still much room for improvement.

(i) Optimizing the data storage in a distributed system.

Sequence data are usually stored in a disk array, which can be shared within a network. In the distributed data processing, the I/O cost is a significant factor that should not be ignored. Many applications such as assembling complex genomes and analyzing nanopore sequence signals require storing a huge amount of data. The I/O performance is especially important for these applications. Although the Hadoop framework can utilize HDFS to distribute data, it has to transfer data from a local disk to HDFS in the data preparing stage. Data stored in HDFS have lots of limitations. For example, API is not friendly enough for binary files. There are several solutions proposed in recent years, for example, running Hadoop with Lustre [[107]] or Ceph [[108]], and running Spark without HDFS. With the rapid increment of genome sequence data, it is worthwhile considering these solutions in the future.

(ii) Taking full use of GPUs in a heterogeneous server.

Promoted by the game industry and film industry, GPU hardware has evolved rapidly. In the past years, GPU hardware has been partially strengthened by machine learning. A lot of computing hosts with multi-GPUs can be found from research laboratories to cloud service providers. GPUs have powerful computing performance, especially for those deployed in supercomputing clusters. Taking the NVIDIA datacenter GPUs as an example, they have variant precision modes and tensor core units, which deliver different computing performances in different order-of-magnitudes [[109]]. To take advantage of this feature, researchers can store and process the genome sequence data in a suitable precision to improve the computing performance. On the other hand, designing a memory-saving data structure for sequence data and compressing data are important factors in making good use of GPUs in the future.

(iii) Mixing parallel computing and cloud computing rationally.

In recent years, cloud computing has also shown good prospects in combination with parallel computing [[110], [112]]. The cloud architecture, flexible virtualization technology and huge data storage in the cloud help cloud computing to play an important role in the future. Researchers also need to pay attention to issues such as time, cost, privacy and collaboration with different cloud service support teams in cloud technology applications.

In this paper, we provide a comprehensive review of modern parallel computing techniques for genome sequence processing. The future development of parallel computing goes beyond a single computational biology service and target complex bioinformatics applications. A broad range of bioinformatics services including sequence processing, biomolecular simulation, biological network analysis, protein structure modeling, together with larger, richer datasets, can be orchestrated by powerful parallel computing platforms to enable large-scale, sophisticated biological computation. This is expected to lead to scientific breakthroughs in deciphering genome and proteome, exploring disease mechanisms, designing effective and personalized drugs, and understanding evolution.

Key points

Review parallel computing models in genome sequence processing in three categories: multi-core computing model, cluster computing model and heterogeneous computing model. Each category is discussed with two or three common hardware architectures and is analyzed with their features.

Summarize four common applications of parallel computing for genome sequence processing: genome sequence alignment, single nucleotide polymorphism calling, genome sequence preprocessing, and pattern detection and searching. For each kind of application, its background is introduced firstly, and then a list of tools or algorithms are discussed in the aspects of principle, hardware platform and computing efficiency.

Discuss three directions worth exploring in the future for improving parallel computing performance in bioinformatics.

Funding

This work is supported in part by the National Natural Science Foundation of China under grants (Nos U1909208, 61732009, 61772557), Hunan Provincial Science and Technology Program (No. 2018WK4001).

<bold>You Zou</bold> is a Ph.D. student in the Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China. His current research interests include bioinformatics and high-performance computing.

<bold>Yuejie Zhu</bold> is a graduate student in the Hunan Provincial Key Lab of Bioinformatics, School of Computer Science and Engineering at Central South University, Changsha, China. Her current research interests include bioinformatics and data imputation.

<bold>Yaohang Li</bold> is an associate professor in computer science at Old Dominion University, USA. His research interests are in protein structure modeling, computational biology, bioinformatics, Monte Carlo methods, big data algorithms, and parallel and distributive computing.

<bold>Fang-Xiang Wu</bold> is a professor in the College of Engineering and the Department of Computer Science at the University of Saskatchewan, Saskatoon, Canada. His current research interests include bioinformatics and artificial intelligence.

<bold>Jianxin Wang</bold> is a professor in the School of Computer Science and Engineering at Central South University, Changsha, Hunan, China. His research interests include computational genomics and proteomics.

Author notes

Footnotes

1 You Zou and Yuejie Zhu contributed equally to this work.

REFERENCES

[1] Schneider GF, Dekker C. DNA sequencing with nanopores. Nat Biotechnol 2012 ; 30 (4): 326. Google Scholar Crossref Search ADS PubMed WorldCat

2 [2] Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol 2008 ; 26 (10): 1135 – 45. Google Scholar Crossref Search ADS PubMed WorldCat

3 [3] Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008 ; 9 (1): 387 – 402. Google Scholar Crossref Search ADS PubMed WorldCat

4 [4] Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science 2001 ; 291 (5507): 1304 – 51. Google Scholar Crossref Search ADS PubMed WorldCat

5 [5] Yin Z, Lan H, Tan G, et al. Computing platforms for big biological data analytics: perspectives and challenges. Comput Struct Biotechnol J 2017 ; 15 : 403 – 11. Google Scholar Crossref Search ADS PubMed WorldCat

6 [6] Yelick K, Buluç A, Awan M, et al. The parallelism motifs of genomic data analysis. Phil Trans R Soc A 2020 ; 378 (2166): 20190394. Google Scholar OpenURL Placeholder Text WorldCat

7 [7] Nobile MS, Paolo C, Andrea T, et al. Graphics processing units in bioinformatics, computational biology and systems biology. Brief Bioinform 2017 ;18(5): 870 – 885 Google Scholar OpenURL Placeholder Text WorldCat

8 [8] Quan Z, Xu-Bin L, Wen-Rui J, et al. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform 2014 ;15(4):637–647. Google Scholar OpenURL Placeholder Text WorldCat

9 [9] Alex U, Oswaldo T, Cornejo-García JA, et al. Review: high-performance computing to detect epistasis in genome scale data sets. Brief Bioinform 2016 ; 3 : 368. Google Scholar OpenURL Placeholder Text WorldCat

[10] Krishna BV, Baskaran K. Parallel computing for efficient time-frequency feature extraction of power quality disturbances. IET Signal Processing 2013 ; 7 (4): 312 – 26. Google Scholar Crossref Search ADS WorldCat

[11] Flynn MJ. Some computer organizations and their effectiveness. IEEE Trans Comput 1972 ; 100 (9): 948 – 60. Google Scholar Crossref Search ADS WorldCat

[12] Bhandarkar M. MapReduce programming with apache Hadoop. In: International Parallel and Distributed Processing Symposium, 2010, p. 1 – 1.

[13] Zaharia M, Xin R, Wendell P, et al. Apache spark: a unified engine for big data processing. Commun ACM 2016 ; 59 (11): 56 – 65. Google Scholar Crossref Search ADS WorldCat

[14] He B, Fang W, Luo Q, et al. Mars: a MapReduce framework on graphics processors. In: Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08), 2008, p. 260 – 269. Association for Computing Machinery, New York, USA.

[15] Shvachko K, Kuang H, Radia S, et al. The Hadoop Distributed File System. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010 p. 1 – 10. Incline Village, NV, USA.

[16] Vavilapalli VK, Murthy AC, Douglas C, et al. Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing, 2013, p. 1 – 16.

[17] Kashyap H, Ahmed HA, Hoque N, et al. Big data analytics in bioinformatics: a machine learning perspective. Comput Sci 2015 ; 5 (1): 28. Google Scholar OpenURL Placeholder Text WorldCat

[18] McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 2004 ; 32(suppl_2): W20–W25. Google Scholar OpenURL Placeholder Text WorldCat

[19] Li R, Li Y, Kristiansen K, et al. SOAP: short oligonucleotide alignment program. Bioinformatics 2008 ; 24 (5): 713 – 4. Google Scholar Crossref Search ADS PubMed WorldCat

[20] Li H, Ruan J, Durbin R, et al. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008 ; 18 (11): 1851 – 8. Google Scholar Crossref Search ADS PubMed WorldCat

[21] Lin H, Zhang Z, Zhang MQ, et al. ZOOM! Zillions of oligos mapped. Bioinformatics 2008 ; 24 (21): 2431 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

[22] Guo R, Zhao Y, Zou Q, et al. Bioinformatics applications on apache spark. GigaScience 2018 ; 7 (8): giy098. Google Scholar OpenURL Placeholder Text WorldCat

[23] Shoro AG, Soomro TR. Big data analysis: apache spark perspective. Global J Comp Sci Technol 2015 ; 15 (1): 1-C. Google Scholar OpenURL Placeholder Text WorldCat

[24] Fouad H, Mahmoud NM, Issawi MS, et al. Distributed and scalable computing framework for improving request processing of wearable IoT assisted medical sensors on pervasive computing system. Comput Commun 2020 ; 151 : 257 – 65. Google Scholar Crossref Search ADS WorldCat

[25] Sur S, Koop MJ, Panda DK. High-performance and scalable MPI over infiniband with reduced memory usage: an in-depth performance analysis. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). 2006, p.105–es. Association for Computing Machinery, New York, USA.

[26] Luebke D, Harris MJ, Govindaraju NK, et al. GPGPU: general-purpose computation on graphics hardware. In: Proceedings of the 2006 ACM/IEEE conference on Supercomputing (SC '06). 2006, p.208–es. Association for Computing Machinery, New York, NY, USA.

[27] Buck I. GPU computing with NVIDIA CUDA. In: ACM SIGGRAPH 2007 courses (SIGGRAPH '07). 2007, p.6–es. Association for Computing Machinery, New York, USA.

[28] Giansanti V, Beretta S, Cesini D, et al. Parallel computing in deep learning: bioinformatics case studies. In: A parallel, distributed and network-based processing, 2019, p. 329 – 33.

[29] Abadi M, Barham P, Chen J, et al. TensorFlow: a system for large-scale machine learning. In: Operating Systems Design and Implementation, 2016, p. 265 – 83.

[30] Alrfou R, Alain G, Almahairi A, et al. Theano: a Python framework for fast computation of mathematical expressions. arXiv preprint arXiv:1605.02688, 2016. Google Scholar OpenURL Placeholder Text WorldCat

[31] Collobert R, Bengio S, Mariéthoz J. Torch: A Modular Machine Learning Software Library. Idiap, 2002. Google Scholar Google Preview OpenURL Placeholder Text WorldCat COPAC

[32] Stone JE, Gohara D, Shi G. OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 2010 ; 12 (3): 66 – 73. Google Scholar Crossref Search ADS PubMed WorldCat

[33] Rahman R. Intel® Xeon Phi™ Coprocessor Architecture and Tools: The Guide for Application Developers. Apress, 2013. Google Scholar Crossref Search ADS Google Preview WorldCat COPAC

[34] Ni S, Guo R, Liao X, et al. Parallel bloom filter on xeon phi many-core processors. In: Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science, 9529: p. 388 – 405. Springer, Cham.

[35] Li H, Durbin R. Fast and accurate short read alignment with Burrows--Wheeler transform. Bioinformatics 2009 ; 25 (14): 1754 – 60. Google Scholar Crossref Search ADS PubMed WorldCat

[36] Li H, Durbin R. Fast and accurate long-read alignment with Burrows--Wheeler transform. Bioinformatics 2010 ; 26 (5): 589 – 95. Google Scholar Crossref Search ADS PubMed WorldCat

[37] Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:1303.3997, 2013. Google Scholar OpenURL Placeholder Text WorldCat

[38] Langmead B, Salzberg SL. Fast gapped-read alignment with bowtie 2. Nat Methods 2012 ; 9 (4): 357 – 9. Google Scholar Crossref Search ADS PubMed WorldCat

[39] Jo H, Koh G. Faster single-end alignment generation utilizing multi-thread for BWA. Biomed Mater Eng 2015 ; 26 (s1): S1791 – 6. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[40] Wang M, Kong L. Pblat: a multithread blat algorithm speeding up aligning sequences to genomes. BMC Bioinformatics 2019 ; 20 (1): 28. Google Scholar Crossref Search ADS PubMed WorldCat

[41] Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res 2002 ; 12 (4): 656 – 64. Google Scholar Crossref Search ADS PubMed WorldCat

[42] Mirarab S, Nguyen N, Guo S, et al. PASTA: ultra-large multiple sequence alignment for nucleotide and amino-acid sequences. J Comput Biol 2015 ; 22 (5): 377 – 86. Google Scholar Crossref Search ADS PubMed WorldCat

[43] Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 2009 ; 25 (11): 1363 – 9. Google Scholar Crossref Search ADS PubMed WorldCat

[44] Smith AD, Xuan Z, Zhang MQ. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics 2008 ; 9 (1): 128. Google Scholar Crossref Search ADS PubMed WorldCat

[45] Pireddu L, Leo S, Zanetti G. SEAL: a distributed short read mapping and duplicate removal tool. Bioinformatics 2011 ; 27 (15): 2159 – 60. Google Scholar Crossref Search ADS PubMed WorldCat

[46] Abuín JM, Pichel JC, Pena TF, et al. BigBWA: approaching the burrows--wheeler aligner to big data technologies. Bioinformatics 2015 ; 31 (24): 4003 – 5. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[47] Peters D, Luo X, Qiu K, et al. Speeding up large-scale next generation sequencing data analysis with pBWA. J Biocomput 2012 ; 1 (1): 10 – 4172. Google Scholar OpenURL Placeholder Text WorldCat

[48] Abuín José M, Pichel JC, Pena Tomás F, et al. SparkBWA: speeding up the alignment of high-throughput DNA sequencing data. PloS One 2016 ; 11 (5): e155461. Google Scholar OpenURL Placeholder Text WorldCat

[49] Shi L, Meng X, Tseng E, et al. SpaRC: scalable sequence clustering using apache spark. Bioinformatics 2019 ; 35 (5): 760 – 8. Google Scholar Crossref Search ADS PubMed WorldCat

[50] Abuin JM, Pena TF, Pichel JC. PASTASpark: multiple sequence alignment meets big data. Bioinformatics 2017 ; 33 (18): 2948 – 50. Google Scholar Crossref Search ADS PubMed WorldCat

[51] Klus P, Lam S, Lyberg D, et al. BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC Res Notes 2012 ; 5 (1): 27 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

[52] Liu Y, Schmidt B, Maskell DL. CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows--Wheeler transform. Bioinformatics 2012 ; 28 (14): 1830 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

[53] Li R, Yu C, Li Y, et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009 ; 25 (15): 1966 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

[54] Chen X, Wang C, Tang S, et al. CMSA: a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment. BMC Bioinformatics 2017 ; 18 (1): 1–10. Google Scholar OpenURL Placeholder Text WorldCat

[55] Zou Q, Hu Q, Guo M, et al. HAlign: fast multiple similar DNA/RNA sequence alignment based on the Centre star strategy. Bioinformatics 2015 ; 31 (15): 2475 – 81. Google Scholar Crossref Search ADS PubMed WorldCat

[56] Zhang J, Lan H, Chan Y, et al. BGSA: a bit-parallel global sequence alignment toolkit for multi-core and many-core architectures. Bioinformatics 2019 ; 35 (13): 2306 – 8. Google Scholar Crossref Search ADS PubMed WorldCat

[57] NVIDIA Corporation. NVBIO: a library of reusable components designed by NVIDIA corporation to accelerate bioinformatics applications using CUDA. http://nvlabs.github.io/nvbio (28 January 2021, date last accessed).

[58] Rahn R, Budach S, Costanza P, et al. Generic accelerated sequence alignment in SeqAn using vectorization and multi-threading. Bioinformatics 2018 ; 34 (20): 3437 – 45. Google Scholar Crossref Search ADS PubMed WorldCat

[59] Mielczarek M, Szyda J. Review of alignment and SNP calling algorithms for next-generation sequencing data. J Appl Genet 2016 ; 57 (1): 71 – 9. Google Scholar Crossref Search ADS PubMed WorldCat

[60] Li H. FermiKit: assembly-based variant calling for Illumina resequencing data. Bioinformatics 2015 ; 31 (22): 3694 – 6. Google Scholar Crossref Search ADS PubMed WorldCat

[61] Gao S, Zou D, Mao L, et al. BS-SNPer: SNP calling in bisulfite-seq data. Bioinformatics 2015 ; 31 (24): 4006 – 8. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[62] Lin HN, Hsu WL. GSAlign: an efficient sequence alignment tool for intra-species genomes. BMC Genomics 2020 ; 21 (1): 1–10. Google Scholar OpenURL Placeholder Text WorldCat

[63] Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018 ; 34 (18): 3094 – 100. Google Scholar Crossref Search ADS PubMed WorldCat

[64] Marçais G, Delcher AL, Phillippy AM, et al. MUMmer4: A fast and versatile genome alignment system. PLoS Comput Biol 2018 ; 14 (1): e1005944. Google Scholar Crossref Search ADS PubMed WorldCat

[65] Kiełbasa SM, Wan R, Sato K, et al. Adaptive seeds tame genomic sequence comparison. Genome Res 2011 ; 21 (3): 487 – 93. Google Scholar Crossref Search ADS PubMed WorldCat

[66] Jin J, Liu J, Yin Y, et al. PVCTools: parallel variation calling tools. Heliyon 2019 ; 5 (10): e2530. Google Scholar Crossref Search ADS WorldCat

[67] Mckenna A, Hanna M, Banks E, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010 ; 20 (9): 1297 – 303. Google Scholar Crossref Search ADS PubMed WorldCat

[68] Decap D, Reumers J, Herzeel C, et al. Halvade: scalable sequence analysis with MapReduce. Bioinformatics 2015 ; 31 (15): 2482 – 8. Google Scholar Crossref Search ADS PubMed WorldCat

[69] Cooper MA, Addison FT, Alvarez R, et al. Basin development and tectonic history of the Llanos Basin, eastern cordillera, and middle Magdalena Valley, Colombia. AAPG Bulletin 1995 ; 79 (10): 1421 – 42. Google Scholar OpenURL Placeholder Text WorldCat

[70] Decap D, Reumers J, Herzeel C, et al. Halvade-RNA: parallel variant calling from transcriptomic data using MapReduce. PloS One 2017 ; 12 (3): e174575. Google Scholar Crossref Search ADS WorldCat

[71] Dobin A, Davis CA, Schlesinger F, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 2013 ; 29 (1): 15 – 21. Google Scholar Crossref Search ADS PubMed WorldCat

[72] Mushtaq H, Liu F, Costa CH, et al. SparkGA: a spark framework for cost effective, fast and accurate DNA analysis at scale. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology,and Health Informatics (ACM-BCB '17), 2017, p. 148 – 57. Association for Computing Machinery, New York, NY, USA.

[73] Xiao A, Wu Z, Dong S, et al. ADS-HCSpark: a scalable HaplotypeCaller leveraging adaptive data segmentation to accelerate variant calling on spark. BMC Bioinformatics 2019 ; 20 (1). Google Scholar OpenURL Placeholder Text WorldCat

[74] Luo R, Sedlazeck FJ, Lam TW, et al. A multi-task convolutional deep neural network for variant calling in single molecule sequencing. Nat Commun 2019 ; 10(1): 1–11. Google Scholar OpenURL Placeholder Text WorldCat

[75] Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol 2018 ; 36 (10): 983 – 7. Google Scholar Crossref Search ADS PubMed WorldCat

[76] Peng G, Fan Y, Wang W, et al. FamSeq: a variant calling program for family-based sequencing data using graphics processing units. PLoS Comput Biol 2014 ; 10 (10): e1003880. Google Scholar Crossref Search ADS PubMed WorldCat

[77] Cui Y, Peng S, Lu Y, et al. mSNP: a massively parallel algorithm for large-scale SNP detection. IEEE Trans Parallel Distrib Syst 2018 ; 29 (11): 2557 – 67. Google Scholar Crossref Search ADS WorldCat

[78] Li R, Li Y, Fang X, et al. SNP detection for massively parallel whole-genome resequencing. Genome Res 2009 ; 19 (6): 1124 – 32. Google Scholar Crossref Search ADS PubMed WorldCat

[79] Chen S, Zhou Y, Chen Y, et al. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018 ; 34 (17): i884–i890. Google Scholar OpenURL Placeholder Text WorldCat

[80] Cantu VA, Sadural J, Edwards R. PRINSEQ++, a multi-threading tool for fast and efficient quality control and preprocessing of sequencing datasets. PeerJ Preprints 2019 ; 7 : e27551v – 3. Google Scholar OpenURL Placeholder Text WorldCat

[81] Schmieder R, Edwards R. Quality control and preprocessing of metagenomic datasets. Bioinformatics 2011 ; 27 (6): 863 – 4. Google Scholar Crossref Search ADS PubMed WorldCat

[82] Liu X, Yan Z, Wu C, et al. FastProNGS: fast preprocessing of next-generation sequencing reads. BMC Bioinformatics 2019 ; 20 (1): 1 – 6. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[83] Patel RK, Mukesh J, Zhanjiang L. NGS QC toolkit: a toolkit for quality control of next generation sequencing data. PLoS One 2012 ; 7 (2): e30619. Google Scholar Crossref Search ADS PubMed WorldCat

[84] Alser M, Hassan H, Xin H, et al. GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping. Bioinformatics 2017 ; 33 (21): 3355 – 63. Google Scholar Crossref Search ADS PubMed WorldCat

[85] Alser M, Hassan H, Kumar A, et al. Shouji: a fast and efficient pre-alignment filter for sequence alignment. Bioinformatics 2019 ; 35 (21): 4255 – 63. Google Scholar Crossref Search ADS PubMed WorldCat

[86] Xin H, Greth J, Emmons J, et al. Shifted hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 2015 ; 31 (10): 1553 – 60. Google Scholar Crossref Search ADS PubMed WorldCat

[87] Alkan C, Kidd JM, Marques-Bonet T, et al. Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 2009 ; 41 (10): 1061. Google Scholar Crossref Search ADS PubMed WorldCat

[88] Alser M, Mutlu O, Alkan C, et al. MAGNET: understanding and improving the accuracy of genome pre-alignment filtering. arXiv preprint arXiv:1707.01631 2017. Google Scholar OpenURL Placeholder Text WorldCat

[89] Hashim FA, Mabrouk MS, Alatabany W, et al. Review of different sequence motif finding algorithms. Avicenna J Med Biotechnol 2019 ; 11 (2): 130 – 48. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[90] Tahir M, Sardaraz M, Ikram AA, et al. EPMA: efficient pattern matching algorithm for DNA sequences. Expert Syst Appl 2017 ; 162 – 70. Google Scholar OpenURL Placeholder Text WorldCat

[91] Faro S, Lecroq T. Efficient variants of the backward-oracle-matching algorithm. Int J Found Comput Sci 2009 ; 20 (06): 967 – 84. Google Scholar Crossref Search ADS WorldCat

[92] Li T, Zhang X, Luo F, et al. MultiMotifMaker: a multi-thread tool for identifying DNA methylation motifs from Pacbio reads. IEEE/ACM Trans Comput Biol Bioinform 2020 ; 17 (1): 220 – 5. Google Scholar Crossref Search ADS PubMed WorldCat

[93] Pacific Biosciences. MotifMaker. https://github.com/PacificBiosciences/MotifMaker (28 January 2021, date last accessed).

[94] Bandyopadhyay S, Sahni S, Rajasekaran S, et al. PMS6MC: a multicore algorithm for motif discovery. Algorithms 2013 ; 6 (4): 805 – 23. Google Scholar Crossref Search ADS PubMed WorldCat

[95] Bandyopadhyay S, Sahni S, Rajasekaran S. PMS6: A fast algorithm for motif discovery. In: 2012 IEEE 2nd International Conference on Computational Advances in Bio and medical Sciences (ICCABS), 2012 p. 1 – 6. IEEE Press, USA.

[96] Huynh B, Trinh C, Huynh HM, et al. An efficient approach for mining sequential patterns using multiple threads on very large databases. Eng Appl Artif Intel 2018 ; 74: 242 – 251. Google Scholar OpenURL Placeholder Text WorldCat

[97] Fournier-Viger P, Gomariz A, Campos M, et al. Fast vertical mining of sequential patterns using co-occurrence information. In: Advances in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science, 8443: p. 40 – 52. Springer, Cham.

[98] Sahli M, Mansour E, Kalnis P, et al. Parallel motif extraction from very long sequences. In: Proceedings of the 22nd ACM international conference on Information & Knowledge Management (CIKM '13). 2013, p. 549 – 558. Association for Computing Machinery, New York, USA.

[99] Grossi R, Pietracaprina A, Pisanti N, et al. MADMX: a strategy for maximal dense motif extraction. J Comput Biol 2011 ; 18 (4): 535 – 45. Google Scholar Crossref Search ADS PubMed WorldCat

[100] Sharov AA, Ko MS. Exhaustive search for over-represented DNA sequence motifs with CisFinder. DNA Res 2009 ; 16 (5): 261 – 73. Google Scholar Crossref Search ADS PubMed WorldCat

[101] Soe S, Park Y, Chae H. BiSpark: a spark-based highly scalable aligner for bisulfite sequencing data. BMC Bioinformatics 2018 ; 19 (1): 472. Google Scholar Crossref Search ADS PubMed WorldCat

[102] Krueger F, Andrews SR. Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 2011 ; 27 (11): 1571 – 2. Google Scholar Crossref Search ADS PubMed WorldCat

[103] Liu Y, Schmidt B, Liu W, et al. CUDA--MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Patt Recogn Lett 2010 ; 31 (14): 2170 – 7. Google Scholar Crossref Search ADS WorldCat

[104] Peng S, Cheng M, Huang K, et al. Efficient computation of motif discovery on Intel many integrated Core (MIC) architecture. BMC Bioinformatics 2018 ; 19 (9): 101 – 10. Google Scholar PubMed OpenURL Placeholder Text WorldCat

[105] Zou Q, Li X, Jiang W, et al. Survey of MapReduce frame operation in bioinformatics. Brief Bioinform 2014 ; 15 (4): 637 – 47. Google Scholar Crossref Search ADS PubMed WorldCat

[106] Krishnajith APD, Kelly W, Hayward R, et al. Managing memory and reducing I/O cost for correlation matrix calculation in bioinformatics. In: 2013 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), IEEE, 2013, p. 36 – 43.

[107] Cluster File Systems, Inc. Lustre: A Scalable, High Performance File System. https://cse.buffalo.edu/faculty/tkosar/cse710%5fspring15/papers/lustre-whitepaper.pdf (28 January 2021, date last accessed).

[108] Weil SA, Brandt SA, Miller EL, et al. Ceph: a scalable, high-performance distributed file system. In: Proceedings of the 7th Symposium on Operating Systems Design and Implementation, 2006, p. 307 – 20.

[109] Pujol R, Tabani H, Kosmidis L, et al. Generating and exploiting deep learning variants to increase heterogeneous resource utilization in the nvidia xavier. In: 31st Euromicro Conference on Real-Time Systems (ECRTS 2019), 2019, p. 23. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany.

[110] Langmead B, Nellore A. Cloud computing for genomic data analysis and collaboration. Nat Rev Genet 2018 ; 19 (4): 208. Google Scholar Crossref Search ADS PubMed WorldCat

[111] Ekanayake J, Gunarathne T, Qiu J. Cloud technologies for bioinformatics applications. IEEE Trans Parallel Distrib Syst 2010 ; 22 (6): 998 – 1011. Google Scholar Crossref Search ADS WorldCat

[112] Li J, Zhang L, Xiao M, et al. The high performance computing applications for bioinformatics research. In: Proceedings of the 6th International Conference on Bioinformatics and Biomedical Science (ICBBS '17), 2017. p. 70–75. Association for Computing Machinery, New York, USA.

By You Zou; Yuejie Zhu; Yaohang Li; Fang-Xiang Wu and Jianxin Wang

Reported by Author; Author; Author; Author; Author

You Zou and Yuejie Zhu contributed equally to this work

Jianxin Wang, Hunan Provincial Key Laboratory on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China. Tel.: +86-731-88820212; Fax:+86-731-88877936