A Hybrid Architecture for the Approximate String Matching on an FPGA

Takuma Wada, Shunji Funasaka, Koji Nakano and Yasuaki Ito
Department of Information Engineering, Hiroshima University
Kagamiyama 1-4-1, Higashi-Hiroshima, Hiroshima, 739-8527 Japan
Email: {takuma, funasaka, nakano, yasuaki}@cs.hiroshima-u.ac.jp

Abstract—The main contribution of this paper is to present a very efficient FPGA implementation, which performs the Approximate String Matching (ASM) for a pattern string and a text string of length $m$ and $n$, respectively. It is well known that the ASM can be done in $O(mn)$ time by the dynamic programming technique. Myers has presented a sophisticated sequential algorithm called bit-vector algorithm, which performs the ASM in $O(n)$ time using $m$-bit addition and bitwise operations. Hoffmann et al. have implemented the bit-vector algorithm in the FPGA and evaluated the performance. However, the performance of the bit-vector circuit implemented in the FPGA is degraded for large $m$ due to a long critical path of length proportional to $m$. We will present a circuit with $O(1)$-length critical path that performs the ASM with very high clock frequency and throughput. Also, to reduce the hardware usage, we present a hybrid circuit of the bit-vector and our ASM circuits. The experimental results show that our hybrid circuit for the ASM is 20 times more efficient than the bit-vector circuit in terms of the performance per circuit resource. To see the potentiality of the ASM computation on the FPGA, we evaluated the performances of the ASM on the latest FPGA, GPU, and CPU. Our hybrid circuit implemented in Xilinx Virtex UltraScale+ XCVU9P FPGA is more than 58 times and 1400 times faster than parallel ASM computation on NVIDIA TITAN X GPU and Core i7-6700K CPU, respectively. Thus, the FPGA is promising as an accelerator of the ASM.

I. INTRODUCTION

An Field Programmable Gate Array (FPGA) is a programmable logic device designed to be configured by customers or designers after manufacturing. Since an FPGA chip maintains relative lower price and programmable features, it is widely used in those fields which need to update architecture or functions frequently such as communication and education. The most common architecture of recent FPGAs is an array of Configurable Logic Blocks (CLBs) [1], block RAMs [2], DSP slices [3], and programmable routing channels connecting them [4]. Although the architecture of the latest FPGAs is targeted for high performance digital signal processing [3], [5], it can be used for other applications and general purpose computing [6]. The main purpose of this section is to present logic circuits for the FPGA that performs the approximate string matching. Our logic circuits are implemented using CLBs in the FPGA, which consists of lookup tables (LUTs), flip-flops (FFs), multiplexers, and carry-chains.

Suppose that two strings $X$ (pattern) and $Y$ (text) of length $m$ and $n$ ($m \leq n$), respectively, are given. The Approximate String Matching (ASM) is a task to find a substring in $Y$ most similar to $X$. The similarity of two strings is measured by the edit distance of them, which is the number of three operations, insertion, deletion, and replacement of characters necessary to change one string into the other. The ASM has a lot of applications in the areas of signal processing, bio-informatics, natural language processing, among others. Thus, many sequential, parallel hardware algorithms for the ASM and the related computations have been presented [7], [8], [9], [10], [11], [12], [13]. It is well known that the ASM can be computed in $O(mn)$ time by the dynamic programming technique, which computes a matrix $d$ of size $(m+1) \times (n+1)$. Figure 1 shows the values of matrix $d$ for $X = ababa$ and $Y = aaabba$. Each element $d[i][j]$ of matrix $d$ is the edit distance of the prefix of $X$ of length $i$ and a substring of $Y$ ending at position $j$. Since the value of $d[i][j]$ can be computed from $d[i-1][j-1]$, $d[i-1][j]$, and $d[i][j-1]$, the ASM can be done by a column-major or a row-major computation of the matrix. Since each element $d[i][j]$ can be computed in $O(1)$ time, this dynamic programming algorithm runs in $O(mn)$ time. Also, it is easy to parallelize this sequential algorithm to run in $O(n)$ time using $m$ processors [11]. Each processor is assigned to a row of the matrix and compute elements of the assigned row from left to right.

<table>
<thead>
<tr>
<th>$X$</th>
<th>$a$</th>
<th>$b$</th>
<th>$b$</th>
<th>$a$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$a$</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>$b$</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>$a$</td>
<td>2</td>
<td>1</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>$b$</td>
<td>3</td>
<td>2</td>
<td>1</td>
<td>1</td>
</tr>
<tr>
<td>$a$</td>
<td>4</td>
<td>3</td>
<td>2</td>
<td>1</td>
</tr>
<tr>
<td>$b$</td>
<td>5</td>
<td>4</td>
<td>3</td>
<td>2</td>
</tr>
<tr>
<td>$a$</td>
<td>6</td>
<td>5</td>
<td>4</td>
<td>3</td>
</tr>
</tbody>
</table>

Figure 1. The values of matrix $d$ computed by the dynamic programming algorithm for the ASM

Myers [15] has presented the bit-vector algorithm, which accelerates the ASM using additions and bitwise operations for $m$-bit words. The bit-vector algorithm computes the differences of neighboring elements of the matrix. More
specifically, difference values $V[i][j] = d[i][j] - d[i][j - 1]$, $H[i][j] = d[i][j] - d[i][j - 1]$, and $D[i][j] = d[i][j] - d[i - 1][j - 1]$ of the matrix are computed for each position $(i, j)$ as shown in Figure 2. Since the values of $V[i][j]$ and $H[i][j]$ are either $-1$, $0$, or $1$, they can be encoded in 2 bits each. Also, the value of $D[i][j]$ is 0 or 1, it can be represented using only 1 bit. Using such bit encoding, the differences for all elements in a column are computed from those in the left column by a sequence of $m$-bit addition and bitwise operations [15]. Since commonly used computing devices such as CPUs and GPUs support 32/64-bit addition and bitwise operations, the bit-vector algorithm works very efficiently for pattern strings of length 32/64. Also, if $m$ is larger than 64, the bit-vector algorithm can be implemented by simulating $m$-bit addition and bitwise operations by iterative 32/64-bit operations in an obvious way. Thus, very roughly speaking, the speed up factor of the bit-vector algorithm can be 32/64 over the $O(mn)$-time dynamic programming algorithm, since 32/64 elements in a column of the matrix can be computed in $O(1)$ operations. Hoffmann et al. [16] have implemented the bit-vector algorithm on the FPGA. We call it bit-vector circuit. Their implementation simply uses an adder and logic gates to simulate the bit-vector algorithm as it is. However, the bit-vector circuit uses an $m$-bit adder and so the performance is degraded for large $m$ due to long carry chain of the adder.

The main contribution of this paper is to present logic circuits for the ASM and to implement them on an FPGA. We first present a small logic circuit $S$, which computes difference values $V[i][j]$, $H[i][j]$, and $D[i][j]$ from $V[i][j - 1]$ and $H[i - 1][j]$. We design circuit $S$ with $m$ logic circuits of $S$, which computes these values for $m$ elements of the matrix in one clock cycle and the ASM can be done in $n + m - 1$ clock cycles. Since the critical path length of circuit $S$ is $O(1)$, it runs in very high clock frequency and the throughput is quite large. However, circuit $S$ needs more flip-flops to store the intermediate values than the bit-vector circuit shown in [16]. Thus, we present a hybrid circuit $H$ to compute the difference values of the matrix. Hybrid circuit $H$ has a design parameter which determines the size of circuit modules. Since that with extreme values of the design parameter correspond to circuit $S$ and the bit-vector circuit, it is a hybrid of them. We can select the best design parameter that maximizes the performance of $H$. We measure the performance of FPGA implementations for the ASM by “elements per CLB-sec”, which is the number of elements in the matrix computed per one CLB in a second. The implementation results show that hybrid circuit $H$ with best parameter for $m = 1024$ runs in $775$MHz using 675 CLBs on Xilinx Virtex UltraScale+ XCVU9P-L2FLGA2104E FPGA. If $n = 1024^2 (=1M)$, the ASM can be done in 1.35 ms. Also, since 1024 elements computed in every clock cycle, the performance is $1.18 \times 10^9$ elements per CLB-sec. Since the XCVU9P FPGA has 147,780 CLBs, we can expect that 175 hybrid circuits can be embedded in 80% of all CLBs. Note that if we use all CLBs, the clock performance is drastically degraded due to the overhead of detour channel routing in the FPGA. Thus, it makes sense to assume that we use 80% of CLBs in the FPGA. If this is the case the total performance of the FPGA with 175 hybrid circuits is about $140 \times 10^{12}$ elements per FPGA-sec.

To see the advantage of FPGA over the CPU and GPU, we have implemented the bit-vector algorithm on NVIDIA TITAN X GPU and Core i7-6700K(4GHz). Our GPU implementation is essentially the same as that shown in [17], and reproduces the experimental results on the latest GPU. The experimental results show that the performance of the GPU is $2.38 \times 10^{12}$ elements per GPU-sec. Thus, the performance of our hybrid circuit implemented in XCVU9P FPGA is 58 times better than the best GPU implementation. We have also evaluated the performance of a sequential bit-vector algorithm [15] on Core i7-6700K. The performance using a single thread is about $0.0125 \times 10^{12}$ elements per CPU-sec. Hence, even if 8 hyper threads in 4 cores work perfectly in parallel, the performance can not be larger than $0.100 \times 10^{12}$. Thus, the performance of our hybrid circuit on the FPGA is more than 1400 times better than Core i7-6700K even if 8 hyper threads works completely in parallel.

This paper is organized as follows. We first define the Edit Distance (ED) and the Approximate String Matching (ASM) and review sequential and parallel algorithms for the ASM in Section II. Section III shows the difference computation technique, which computes the difference values of elements in the matrix of the ASM. We then go on to show logic circuits to compute the difference values in Section IV. Section V shows the experimental results on FPGA, GPU, and CPU. Section VI concludes our work.

II. APPROXIMATE STRING MATCHING AND EDIT DISTANCE

The main purpose of this section is to review Approximate String Matching (ASM) and the Edit Distance (ED). Please see [14], [18] for the details.
As a preliminary, we first define the Edit Distance (ED) of two strings. Suppose that source string $X = x_1x_2 \cdots x_m$ of length $m$ and destination string $Y = y_1y_2 \cdots y_n$ of length $n$ are given. Without loss of generality, we can assume that $m \leq n$. We want to change $X$ into $Y$ using the following three operations:

- insertion of a character,
- deletion of a character, and
- replacement of a character.

For example, $X = ababa$ can be changed into $Y = aaabbb$ in five operations as follows:

- $ababa \rightarrow aab\rightarrow aabb\rightarrow aabbb$. Alternatively, $X$ can be changed into $Y$ in three operations as follows:
- $ababa \rightarrow aaba \rightarrow aaabbb$. The ED of two strings is the minimum number of operations to change one string to the other. For example, the ED of $X$ and $Y$ above is three, because there exists a sequence of three operations to change $X$ into $Y$, and there exists no sequence of less than three operations to do the same thing. For later reference, let $ED(X, Y)$ denote the ED of $X$ and $Y$.

The approximate string matching, a more flexible version of the edit distance, is a task to compute the value of $ASM(X, Y)$ defined as follows:

$$ASM(X, Y) = \min \{ ED(X, Y') \mid Y' \text{ is a substring of } Y \}$$

Clearly, $ASM(X, Y)$ is small if $Y$ has a substring similar to $X$. It should be clear that $ASM(X, Y)$ is always less than or equal to $m$, and $ED(X, Y)$ takes a value between $n - m$ and $n$. For example, if $X$ and $Y$ share no character, then $ED(X, Y) = n$ and $ASM(X, Y) = m$. Also, if the prefix of $Y$ is $X$ then $ED(X, Y) = n - m$ and $ASM(X, Y) = 0$.

We use a matrix $d$ of size $(m + 1) \times (n + 1)$ to compute the ASM. Each $d[i][j]$ $(0 \leq i \leq m, 0 \leq j \leq n)$ is used to store the following value:

$$\min_{1 \leq f \leq j} ED(x_1x_2\cdots x_i, y_1y_2\cdots y_fy_{f+1}\cdots y_j).$$

Note that $x_1x_2\cdots x_i$ is a null string (i.e. string of length 0) if $i = 0$. After all values of $d$ is computed, we can compute the value of $ASM(X, Y)$ by the following formula:

$$ASM(X, Y) = \min_{0 \leq i \leq n} d[m][j]$$

Thus, if we have all values $d[m][j] \ (1 \leq j \leq n)$, we can compute $ASM(X, Y)$ very easily. Hence, in this paper we focus on the computation of these values.

Let us show how we compute all values of $d$. Suppose that $d[i - 1][j - 1], d[i - 1][j], \text{ and } d[i][j - 1]$ are already computed. Let "$x_i \neq y_j$" denote the binary value such that it is 1 if $x_i \neq y_j$ and 0 if $x_i = y_j$. The value of $d[i][j]$ can be computed by the following recursive formulas:

$$d[i][j] = \begin{cases} 0 & \text{if } i = 0 \\ i & \text{if } j = 0 \\ \min(d[i][j - 1] + 1, d[i - 1][j] + 1, d[i - 1][j - 1] + (x_i \neq y_j)) & \text{if } i > 0 \text{ and } j > 0. \end{cases}$$

Using this formula, all values of matrix $d$ can be computed as follows:

[Algorithm ASM]

for $j \leftarrow 1$ to $n$ do $d[0][j] \leftarrow 0$
for $i \leftarrow 0$ to $m$ do $d[i][0] \leftarrow i$
for $j \leftarrow 1$ to $n$ do $d[i][j] \leftarrow \min(d[i][j - 1] + 1, d[i - 1][j] + 1, d[i - 1][j - 1] + (x_i \neq y_j))$;
output($d[m][n]$);

Figure 1 shows the values of $d$ for two strings $X = ababa$ and $Y = aaabbb$.

![Figure 3. The values of matrix $d$ computed for each $k$](image)

Next, we will show a parallel algorithm for the ASM. The key idea is to compute the values of the matrix $d$ from the top-left corner to the bottom-right corner [11], [12]. The details of the parallel algorithm is spelled out as follows:

[Parallel ASM algorithm]

for $j \leftarrow 1$ to $n$ do in parallel $d[0][j] \leftarrow 0$
for $i \leftarrow 0$ to $m$ do in parallel $d[i][0] \leftarrow i$
for $k \leftarrow 1$ to $n + m - 1$ do
for $i \leftarrow 1$ to $m$ do in parallel
for $j \leftarrow k - i + 1$ to $n$ if $1 \leq j \leq n$ then
$$d[i][j] \leftarrow \min(d[i][j - 1] + 1, d[i - 1][j] + 1, d[i - 1][j - 1] + (x_i \neq y_j))$$
if $(j \geq 1)$ output ($d[m][j]$);

In the third for-loop, for each $k$ $(1 \leq k \leq n + m - 1)$, the values $d[1][k], d[2][k - 1], \ldots, d[m][k - m]$ are computed and stored. Figure 3 illustrates the computation performed.
in each $k$ for the same matrix $d$ of Figure 1. It should be clear that this parallel algorithm correctly computes the ASM.

III. DIFFERENCE COMPUTATION TECHNIQUE FOR THE ASM

We will show the difference computation technique, which computes the difference values of elements in the matrix of the ASM. This technique has been invented by Myers and used by the bit-vector algorithm [15].

Let $d$ be a matrix obtained by the dynamic programming algorithm for the ASM. We use three matrices $V$, $H$, and $D$ such that each of them is the difference of adjacent values of $d$ as follows:

\[ V[i][j] = d[i][j] - d[i-1][j] \]
\[ H[i][j] = d[i][j] - d[i][j-1] \]
\[ D[i][j] = d[i][j] - d[i-1][j-1] \]

Figure 2 illustrates the relations of these values. Also, Figure 4 shows the values of $V$, $H$, and $D$ for the same input strings as Figure 1. In Figure 4, each segment with no arrowhead represents 0 and each arrow shows the direction of increment of 1.

We will show the relations of $d$, $D$, $V$, and $H$. We omit “for all $i$ and $j$” to avoid iteration of the same redundant phrase. Since $d[i][j]$ is the minimum of $d[i][j-1] + 1$, $d[i-1][j] + 1$, and $d[i-1][j-1] + (x_i \neq y_j)$, we have $d[i][j] \leq d[i][j-1] + 1$, $d[i][j] \leq d[i-1][j] + 1$, and $d[i][j] \leq d[i-1][j-1] + 1$. Thus, we have $V[i][j] \geq 1$, $H[i][j] \leq 1$, and $D[i][j] \leq 1$. We will prove $V[i][j] \geq -1$, $H[i][j] \geq -1$, and $D[i][j] \geq 0$ by induction. More specifically, we assume that $V[i][j] \geq -1$ and $H[i-1][j] \geq -1$ hold and prove that $V[i][j] \geq -1$, $V[i][j] \geq -1$, and $D[i][j] \geq 0$ are satisfied. If $D[i][j] \leq -1$ then $d[i][j] = d[i][j-1] + 1 + (x_i \neq y_j)$ is not satisfied. Hence, $d[i][j] = d[i-1][j] + 1$ and/or $d[i][j] = d[i][j-1] + 1$ must be satisfied. If $d[i][j] = d[i-1][j] + 1$ holds, then $H[i][j] = 1$. Thus, $D[i][j] = V[i][j-1] + H[i][j] \geq 0$ is a contradiction. Hence, $D[i][j] \geq 0$ holds. Further, from $V[i][j-1] + H[i][j] = D[i][j] \geq 0$ and $V[i][j-1] \leq 1$, we have $H[i][j] \geq -1$. Similarly, we can prove $V[i][j] \geq -1$. Therefore, we have,

**Lemma 1:** For all $i$ and $j$ (1 $\leq i \leq m$ and 1 $\leq j \leq n$), we have $-1 \leq H[i][j] \leq 1$, $-1 \leq V[i][j] \leq 1$, and 0 $\leq D[i][j] \leq 1$.

Actually, in Figure 4, we can see segments and arrows of both directions for horizontal and vertical directions corresponding to $H$ and $V$. On the other hand, there is no diagonal arrows pointing the upper left, because the values of $D[i][j]$ cannot be $-1$.

From the definition of $H$, if we have all values of $H[m][1], H[m][2], \ldots, H[m][n]$, we can compute the value of $d[m][i]$ by the following recursive formula:

\[ d[m][0] = m \]
\[ d[m][i] = d[m][i-1] + H[m][i] \quad (1 \leq i \leq n) \]

In other words, $d[m][i]$ can be computed by the prefix-sums of $m, H[m][1], H[m][2], \ldots, H[m][n]$. Thus, we focus on designing circuits that output $H[m][1], H[m][2], \ldots, H[m][n]$ one by one. A simple accumulator, which increments or decrements the stored value by 1, can compute the prefix-sums for these outputs in an obvious way.

We will show important properties to determine the values of elements in $V$, $H$, and $D$. Since $d[i][0] = i$ for all $i$ (1 $\leq i \leq m$) and $d[0][j] = 0$ for all $j$ (1 $\leq j \leq n$), $V[i][0] = 1$ and $H[0][j] = 0$. Also, since $d[i][j]$ is the minimum of $d[i][j-1] + 1, d[i-1][j] + 1,$ and $d[i-1][j-1] + (x_i \neq y_j)$, $D[i][j]$ takes 1 if and only if all of the following three conditions are satisfied.

- $x_i \neq y_j$.
- $V[i][j-1] \geq 0$ (i.e. $d[i][j-1] \geq d[i-1][j-1]$), and
- $H[i-1][j] \geq 0$ (i.e. $d[i-1][j] \geq d[i-1][j-1]$).

In other words, if at least one of the three is not satisfied, then $D[i][j] = 0$. If $x_i = y_j$, then $d[i][j] \leq d[i-1][j-1]$. Thus, from Lemma 1, $D[i][j] = 0$. If $V[i][j-1] = -1$, then $d[i][j-1] + 1 = d[i-1][j-1]$. Hence, $d[i][j] = d[i-1][j-1]$ and so $D[i][j] = 0$. Similarly, if $H[i-1][j] = -1$ then $D[i][j] = 0$. Thus, we have the following formula: $D[i][j] = (x_i \neq y_j) \land (V[i][j-1] \geq 0) \land (H[i-1][j] \geq 0)$. Also, from the definition of $V$, $H$, and $D$, we have $D[i][j] = H[i][j] + V[i][j] + V[i][j-1] + H[i-1][j]$. We can also confirm these relations from Figure 2. We summarize these properties as follows:

**Lemma 2:** We have

(1) $V[i][0] = 1$ for all $i$ (1 $\leq i \leq m$), and
(2) $H[0][j] = 0$ for all $j$ (1 $\leq j \leq n$).

Also, for all $i$ and $j$ (1 $\leq i \leq m$, 1 $\leq j \leq n$), we have

(3) $D[i][j] = (x_i \neq y_j) \land (V[i][j-1] \geq 0) \land (H[i-1][j] \geq 0)$,
3-4-5 uses painting rules (3), (4), and (5) for \( V \) and \( H \) to apply rules (3), (4), and (5). Thus, we can think that the ASM is a problem of painting segments in Figure 5, such that all segments in \( V \) and \( H \) are painted. In boxes of the figure, combinational circuits to evaluate (3), (4), and (5) in Lemma 2 are implemented. Since each element of \( V \) and \( H \) takes \(-1, 0, \) or 1, two bits are necessary. We can use 2-bit words 01, 00, and 10 to represent the values of \(-1, 0, \) and 1 of each element, respectively.

IV. LOGIC CIRCUITS FOR THE ASM

This section first shows logic circuits based on operations 3-4-5 and 5-3-4. After that, we will present a hybrid circuit of them.

A. ASM circuit by operation 3-4-5

Figure 7 illustrates a combinational logic circuit \( S \) implementing operation 3-4-5 in Figure 6, which computes \( V[i][j] \) and \( H[i][j] \) from \( V[i][j - 1] \) and \( H[i - 1][j] \). In boxes of the figure, combinational circuits to evaluate (3), (4), and (5) in Lemma 2 are implemented. Since each element of \( V \) and \( H \) of \( m + 1 \) bits are performed to compute the values of \( V[1][j], V[2][j], \ldots, V[m][j] \). Since conventional processors supports addition and bitwise operations for 32/64-bit words, Myers’s bit-vector algorithm is efficient for \( m \leq 64 \). Since addition and bitwise operations for long words can be simulated by iteration of those for 32/64-bit word, it also work for very large \( m \). An FPGA implementation that simulates addition and bitwise operations for long words used in Myers’s bit-vector algorithm have been presented in [16].
We use circuit $S$ to design a combinational logic circuit $S$ that simulates Parallel ASM algorithm illustrated in Figure 3. It computes $V[1[k], H[1[k], V[2[k-1], H[2[k-1], \ldots, V[m[k-m+1], H[m[k-m+1] from $V[1[k-1], V[2[k-2], \ldots, V[m[k-m$ and $H[0[k], H[1[k-1], \ldots, H[m-1[k[k-m+1]$. Figure 8 illustrates circuit $S$, which simply has $m$ circuits $T$. The figure also shows how the values of $V$ and $H$ are computed. Clearly, the computation performed for each $k$ by Parallel ASM algorithm is simulated by $S$ in 1 clock cycle. Thus, $S$ runs $m+n-1$ clock cycles for the ASM. Also, the critical path length of circuit $S$ is $O(1)$ and $2 \cdot 2 \cdot m = 4m$ flip-flops are necessary to store the values of $V$ and $H$. Thus, we have,

**Theorem 3:** Circuit $S$ with $4m$ flip-flops of depth $O(1)$ can perform the ASM in $m+n-1$ clock cycles.

### B. ASM circuit by operation 5-3-4

Figure 9 illustrates a combinational logic circuit $T$ implementing operation 5-3-4 in Figure 6, which computes $V[i[j]$ and $D[i[j]$ from $V[i-1[j-1], V[i[j-1], and $D[i-1[j]$. We can design a circuit $T$ for the ASM using $m$ circuits $T$ as illustrated in Figure 10. They computes $V[1[k], V[2[k], \ldots, V[m[k$ from $V[1[k-1], V[2[k-1], \ldots, V[m[k-1$. Note that, to compute $V[1[k$, a circuit $T/(5)$, in which a circuit for (5) is omitted in circuit $T$, is used. Also, a circuit to compute (5) in Lemma 2 is used to output $H[m[k]$, which is necessary to compute $d[m[k]$. Thus, we can say that a circuit $T$ uses $m$ circuits $T$. Circuit $T$ works for the computation for each column, it runs $n$ clock cycles to complete the ASM. Also, it needs $2m$ flip-flops to store the outputs $V[1[k], V[2[k], \ldots, V[m[k$, which are the inputs of $T$ in the following clock cycle. Note that the critical path length of circuit $T$ is $O(m)$ since it has input/output chain corresponding to $V[i[k]$. Thus, we have,

**Theorem 4:** Circuit $T$ with $2m$ flip-flops of depth $O(m)$ can perform the ASM in $n$ clock cycles.

### C. Hybrid ASM circuit

We will show a hybrid ASM circuit, a combination of circuits $S$ and $T$. Recall that circuit $T$ has $m$ circuits of $T$. A hybrid circuit $H$ has $\frac{m}{m}$ circuits $T$ with $r$ circuits of $T$, where $r$ is a parameter of the circuit. If $r = m$ then a hybrid circuit $H$ is equivalent to circuit $T$. Also, $H$ is $S$ when $r = 1$, because a circuit for (5) is attached to $T/(5)$ in $T$ is circuit $S$. Thus, it is a hybrid of $S$ and $T$. Figure 11 illustrates a hybrid circuit $H$, which has 3 circuits $T$ with 3 circuits of $T$ each.

We will show that how hybrid circuit $H$ works. Similarly to circuits $S$ and $T$, it computes the values of $V$ from left to right. Clearly, $2r + n - 1$ clock cycles are necessary to complete the ASM. Also, each $T$ in $H$ needs $2(r+1)$ flip-flops to store the necessary values of $V$ and $H$. Thus, $H$ uses $\frac{m}{m} \cdot 2(r+1) = 2m(1 + \frac{1}{r})$ flip-flops. The depth of the circuit of $H$ is that of $T$, which is equal to $O(r)$. Thus, we have,

**Theorem 5:** Circuit $H$ with $2m(1 + \frac{1}{r})$ flip-flops of depth $O(r)$ can perform the ASM in $\frac{m}{m} + n - 1$ clock cycles.

### D. Bit-vector algorithm

This section briefly explains bit-vector algorithm [15] and the FPGA implementation [16]. Please see [15], [16] for the details. The idea of the bit-vector algorithm is to simulate circuit $T$ by addition and bitwise operations as follows. The values of $V[1[j-1], V[2[j-1], \ldots, V[m[j-1]$ are stored in two $m$ bits word. A certain number of addition and bitwise operations for $m$-bit words are performed to compute the values of $V[1[j], V[2[j], \ldots, V[m[j]$. Since commonly used processors supports addition and bitwise operations for 32/64-bit words, the bit-vector algorithm is efficient if $m = 32$ or 64. Since addition and bitwise operations for long words can be simulated by iteration of those for 32/64-bit words, it also work for very large $m$.

We show how circuit $T$ in Figure 10 is simulated. Let $V^+$ and $V^-$ be $m$-bit words such that the $i$-th bits of them are $1$ if $V[i][j-1] = 1$ and $-1$, respectively. The $i$-th bits of them are $0$ if $V[i][j-1] = 0$. Similarly, we use two $m$-bit words $H^+$ and $H^-$ to store $H[i][j]$. Also, let $D^0$ be an $m$-bit word such that each $i$-th bit is $1$ if $D[i][j] = 0$. Further, let $M$ be an $m$-bit word such that each $i$-th bit is $1$ if $x_i = y_j$. After executing the following computation, $V^+$ and $V^-$ are updated by the values of $V[i][j]$s:

**[Bit-vector algorithm]**

$$D^0 \leftarrow ((M \& V^+) \oplus V^+) \& M \mid V^-;$$

$$H^+ \leftarrow V^- \& (\overline{D^0 \oplus V^+});$$

$$H^- \leftarrow D^0 \& \overline{V^+};$$

$$V^+ \leftarrow (H^+ \ll 1) \mid (\overline{D^0 \oplus H^+ \ll 1});$$

$$V^- \leftarrow D^0 \& (H^- \ll 1);$$

The key of the bit-vector algorithm is addition “+” of $m$ bits to compute $D^0$. The carry chain of this addition simulates cascade connection of circuits $T$ in circuit $T$ illustrated in Figure 10. We can design circuits that simulates the bit-vector algorithm using $m$-bit adder and arrays of logic gates for bitwise operations in an obvious way. We call it bit-vector circuit. Further, we can design a hybrid circuit by replacing each circuit $T$ by the bit-vector circuit.

Essentially, the bit-vector algorithm is designed so that circuit $T$ is simulated by additions and bitwise operations. So, intuitively, it makes no sense to simulate the bit-vector algorithm by a logic circuit, because a circuit simulation algorithm is simulated by a circuit. It is expected that such iterative simulation results increase of hardware resource usage and degrade the clock performance. However, as we will show in Section V, the hardware usage of both modules are not so different. Also, for large $m$, the clock performance of the bit-vector circuit is better. This is because an adder
used in the bit-vector circuit can be implemented in embedded fact carry chain logic in the FPGA very efficiently. The experimental results show that the performance of both circuits is almost the same for small \( m \).

V. EXPERIMENTAL RESULTS

We have implemented circuit \( S \), circuit \( T \), our hybrid circuit \( H \), and the bit-vector circuit [16] on Xilinx Virtex UltraScale+ XCVU9P-L2FLGA2104E FPGA with 147,780 CLBs, and evaluated the performance. This FPGA is a popular high-end FPGA, which is used in Virtex UltraScale FPGA VCU118 evaluation board [19] as well as Amazon cloud computing service AWS EC2 F1 instance. The logic synthesis performed using Vivado Design Suite. For reference, we have used NVIDIA TITAN X and Core i7-6700K (4.00GHz) for evaluating the performance of the bit-vector algorithms on a GPU and a CPU. Also, since the main application of the ASM is analysis of DNA sequences with 4 bases A, T, C, and G [20], we assume that pattern and text are strings of 2-bit characters to encode 4 bases.

Table I shows the performance of our hybrid circuit \( H \) using circuits \( T \) of various sizes for \( m = 1024 \). It shows clock frequency in MHz, the number of LUTs, FFs, and CBs. The performance is evaluated for various size of \( T \) from 1 to 1024 for \( m = 1024 \). Note that, each CLB of Xilinx Virtex UltraScale+ has 8 LUTs and 16 FFs [1]. For example, when the size of \( T \) is 8, in 675 used CLBs with 5400 LUTs and 10800 FFs totally, only 3708 LUTs and 2687 FFs are used. If the size of \( T \) is 1 then our hybrid circuit is equivalent to circuit \( S \). Also, a single \( T \) is used when the size of \( T \) is 1024. We can see that a hybrid circuit with larger \( T \) uses fewer CLBs, because it uses fewer flip-flops to store \( H[i][j] \). Also, since a larger \( T \) has a longer critical path for \( V[i][j] \), the clock performance decreases. We can see that the hardware resource usage and the clock frequency in these tables follow Theorem 5. The tables also show the number of elements computed by a CLB in a second. For example, 128 circuits \( T \) of size 8 uses 675 CLBs and works in 775MHz.

Table I: Performance of hybrid circuit \( H \) using circuits \( T \) of various sizes for \( m = 1024 \).

<table>
<thead>
<tr>
<th>Size of ( T )</th>
<th>Clock Frequency (MHz)</th>
<th>Number of LUTs</th>
<th>Number of FFs</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>793</td>
<td>1024</td>
<td>1024</td>
</tr>
<tr>
<td>2</td>
<td>757</td>
<td>2048</td>
<td>2048</td>
</tr>
<tr>
<td>4</td>
<td>725</td>
<td>4096</td>
<td>4096</td>
</tr>
<tr>
<td>8</td>
<td>694</td>
<td>7424</td>
<td>7424</td>
</tr>
<tr>
<td>16</td>
<td>675</td>
<td>14848</td>
<td>14848</td>
</tr>
<tr>
<td>32</td>
<td>656</td>
<td>29696</td>
<td>29696</td>
</tr>
<tr>
<td>64</td>
<td>641</td>
<td>59376</td>
<td>59376</td>
</tr>
<tr>
<td>128</td>
<td>627</td>
<td>118752</td>
<td>118752</td>
</tr>
<tr>
<td>256</td>
<td>615</td>
<td>237504</td>
<td>237504</td>
</tr>
<tr>
<td>512</td>
<td>603</td>
<td>475104</td>
<td>475104</td>
</tr>
<tr>
<td>1024</td>
<td>600</td>
<td>950208</td>
<td>950208</td>
</tr>
</tbody>
</table>

Figure 8. A combinational logic circuit \( S \) using \( m \) circuits of \( S \)

Figure 9. A combinational logic circuit \( T \) for operation 5-3-4

...
Figure 10. A combinational logic circuit $\mathcal{T}$ using $m$ circuits of $T$

Figure 11. A hybrid circuit $\mathcal{H}$ using circuits of $T$

Table I

<table>
<thead>
<tr>
<th>Size of $T$</th>
<th>clock (MHz)</th>
<th>LUTs</th>
<th>FFs</th>
<th>CLBs</th>
<th>$10^9$ elements per CLB-sec</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>775</td>
<td>3865</td>
<td>7169</td>
<td>1619</td>
<td>0.490</td>
</tr>
<tr>
<td>2</td>
<td>775</td>
<td>3070</td>
<td>4607</td>
<td>938</td>
<td>0.846</td>
</tr>
<tr>
<td>4</td>
<td>775</td>
<td>2815</td>
<td>3327</td>
<td>800</td>
<td>0.992</td>
</tr>
<tr>
<td>8</td>
<td>775</td>
<td>3708</td>
<td>2687</td>
<td>675</td>
<td>1.18</td>
</tr>
<tr>
<td>16</td>
<td>465</td>
<td>3272</td>
<td>2366</td>
<td>535</td>
<td>0.890</td>
</tr>
<tr>
<td>32</td>
<td>408</td>
<td>4279</td>
<td>2207</td>
<td>643</td>
<td>0.650</td>
</tr>
<tr>
<td>64</td>
<td>213</td>
<td>5047</td>
<td>2126</td>
<td>879</td>
<td>0.248</td>
</tr>
<tr>
<td>128</td>
<td>94</td>
<td>4689</td>
<td>2086</td>
<td>791</td>
<td>0.122</td>
</tr>
<tr>
<td>256</td>
<td>77</td>
<td>3768</td>
<td>2105</td>
<td>592</td>
<td>0.133</td>
</tr>
<tr>
<td>512</td>
<td>38</td>
<td>3165</td>
<td>2057</td>
<td>523</td>
<td>0.0744</td>
</tr>
<tr>
<td>1024</td>
<td>19</td>
<td>3121</td>
<td>2051</td>
<td>514</td>
<td>0.0379</td>
</tr>
</tbody>
</table>

Table II

<table>
<thead>
<tr>
<th>Size of $T$</th>
<th>clock (MHz)</th>
<th>LUTs</th>
<th>FFs</th>
<th>CLBs</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>775</td>
<td>3865</td>
<td>7169</td>
<td>1619</td>
</tr>
<tr>
<td>2</td>
<td>775</td>
<td>3070</td>
<td>4607</td>
<td>938</td>
</tr>
<tr>
<td>4</td>
<td>775</td>
<td>2815</td>
<td>3327</td>
<td>800</td>
</tr>
<tr>
<td>8</td>
<td>775</td>
<td>3708</td>
<td>2687</td>
<td>675</td>
</tr>
<tr>
<td>16</td>
<td>465</td>
<td>3272</td>
<td>2366</td>
<td>535</td>
</tr>
<tr>
<td>32</td>
<td>408</td>
<td>4279</td>
<td>2207</td>
<td>643</td>
</tr>
<tr>
<td>64</td>
<td>213</td>
<td>5047</td>
<td>2126</td>
<td>879</td>
</tr>
<tr>
<td>128</td>
<td>94</td>
<td>4689</td>
<td>2086</td>
<td>791</td>
</tr>
<tr>
<td>256</td>
<td>77</td>
<td>3768</td>
<td>2105</td>
<td>592</td>
</tr>
<tr>
<td>512</td>
<td>38</td>
<td>3165</td>
<td>2057</td>
<td>523</td>
</tr>
<tr>
<td>1024</td>
<td>19</td>
<td>3121</td>
<td>2051</td>
<td>514</td>
</tr>
</tbody>
</table>

The performance of our hybrid circuit $\mathcal{H}$ using circuits $\mathcal{T}$ for $m = 1024$

The hybrid circuit is configured as 128 circuits $\mathcal{T}$ of size 8 each. Since XCVU9P FPGA has 147,780 CLBs, we can expect that the total performance of XCVU9P using 80% of CLBs is $1.18 \times 10^9 \times 147780 \times 0.8 = 139 \times 10^{12}$ elements per FPGA-sec. Note that if almost all of CLBs are used, the clock frequency is decreased due to detour connection routing and layout overheads. Thus, we assume that only 80% of CLBs not to overestimate the FPGA performance for the ASM. If this is the case, $\frac{0.8 \times 147780}{675} = 175$ hybrid circuits are implemented in the FPGA works in parallel. Since XCVU9P has 832 I/O pins, we can implement 175 hybrid circuits with 2 input pins to provide 2-bit characters text strings. Hence, it is feasible to embed 175 hybrid circuits in XCVU9P.

Table II shows the performance of our hybrid circuit using circuits $\mathcal{T}$ for $m = 4096$. The maximum performance is obtained when we use 128 circuits $\mathcal{T}$ of size 8 each, and the performance is $1.14 \times 10^9$ elements. Hence the performance evaluated by the number of elements computed per CLB-sec. is almost the same as that for $m = 1024$. Actually, circuits $\mathcal{T}$ of size 8 are used for both $m = 1024$ and 4096, the clock frequency is the same and the number of CLBs are proportional to the number of circuits $\mathcal{T}$. Hence, we can
We first compare the performance of the FPGA and the GPU in terms of the ASM. From Table I, only one hybrid circuit $H$ with 675 CLBs on the FPGA can perform 32K instances of the ASM with $m = 1024$ and $n = 1024^2$ computes $32K \cdot 4096 \cdot 1024^2 = 2^{37} = 1.40 \times 10^{14}$ elements. The GPU implementation runs 59.09 seconds, we can say that the performance is $\frac{1.40 \times 10^{14}}{59.09} = 2.38 \times 10^{12}$ elements per GPU-sec. Also, the CPU performance is about $0.0125 \times 10^{12}$ elements per CPU-sec. Note that, the CPU implementation just repeats the same bit-vector operation, the running time is proportional to the number of elements, and so the performance in terms of elements per CPU-sec is almost the same.

We have implemented the bit-vector algorithm in NVIDIA TITAN X GPU. Our implementation is essentially the same as that shown in [17] and reproduces their results on a latest GPU. Also, the sequential bit-vector algorithm is executed on Core i7-6700K CPU. Although it has multiple cores and multiple threads can run at the same time, we use one thread to execute the sequential bit-vector algorithm.

Tables V and VI show the performance for $m = 1024$ and 4096, respectively. The ASM is computed for text strings of length $n = 1024^2$. Also, the running time is evaluated for multiple input instances from 2048 to 32K pairs of pattern/text. To fully utilize the GPU resources, we should invoke as many threads as possible. Thus, we perform the ASM computation for multiple input instances. Thus, the ASM computation of 32K instances for $m = 4096$ and $n = 1024^2$ computes $32K \cdot 4096 \cdot 1024^2 = 2^{37} = 1.40 \times 10^{14}$ elements. The GPU implementation runs 59.09 seconds, we can say that the performance is $\frac{1.40 \times 10^{14}}{59.09} = 2.38 \times 10^{12}$ elements per GPU-sec. Also, the CPU performance is about $0.0125 \times 10^{12}$ elements per CPU-sec. Note that, the CPU implementation just repeats the same bit-vector operation, the running time is proportional to the number of elements, and so the performance in terms of elements per CPU-sec is almost the same.
times faster. Since we can implement 175 hybrid circuits \( H \) using 80% CLBs in a XCVU9P FPGA, we can say that XCVU9P is more than 1400 times faster than Core i7-6700K for the ASM.

VI. CONCLUSION

The main contribution of this paper is to present a hybrid circuit for the Approximate String Matching (ASM) on the FPGA. Our hybrid circuit in the FPGA is more than 20 times efficient than the previously published implementation based on the bit-vector algorithm. Also, our hybrid circuit implemented in a Xilinx Virtex UltraScale+ XCVU9P FPGA is more than 58 times faster than NVIDIA TITAN X GPU and more than 1400 times faster than Core i7-6700K (all 4 cores are used). Thus, the FPGA is promising as an accelerator of the ASM computation.

ACKNOWLEDGMENTS

The authors would like to thank Mr. Ryota Kadoto, who performed earlier stage of this work as his thesis.

REFERENCES