Energy Efficient FPGA based Hardware Accelerators for Financial Applications

Kenn Toft, Jakob; Nannarelli, Alberto

Published in:
Proceedings of 32nd NORCHIP Conference

Link to article, DOI:
10.1109/NORCHIP.2014.7004741

Publication date:
2014

Citation (APA):

General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

- Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
- You may not further distribute the material or use it for any profit-making activity or commercial gain
- You may freely distribute the URL identifying the publication in the public portal

If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim.
Energy Efficient FPGA based Hardware Accelerators for Financial Applications

Jakob Kenn Toft and Alberto Nannarelli
Dept. of Applied Mathematics and Computer Science
Technical University of Denmark
Kongens Lyngby, Denmark

Abstract—Field Programmable Gate Arrays (FPGAs) based accelerators are very suitable to implement application-specific processors using uncommon operations or number systems.

In this work, we design FPGA-based accelerators for two financial computations with different characteristics and we compare the accelerator performance and energy consumption to a software execution of the application.

The experimental results show that significant speed-up and energy savings, can be obtained for large data sets by using the accelerator at expenses of a longer development time.

I. INTRODUCTION

Hardware acceleration provides both speed-up and energy efficiency for computer systems. In recent years, two main classes of accelerators have emerged as the most relevant: Graphics Processing Units, or GPUs, and FPGA (Field Programmable Gate Array) based accelerators.

Modern GPUs are composed of a large array of identical computation cores to exploit parallelism. As GPUs are now fundamental components in supercomputers [1], their cores are designed to support standards to enable general-purpose computing (e.g., IEEE 756 standard for floating-point).

In contrast, FPGA based accelerators are more suitable to implement Application-Specific Processors, or ASPs, which optimize the hardware for the specific operations necessary for the application. For example, applications requiring a specific number system, such as the decimal system, or applications requiring modular operations necessary for cryptography.

The major drawback of FPGA based acceleration is the long development time compared to a software implementation of the application: the application is solely run on the CPU. There are a few solutions and suites of tools to reduce this development time, but the gap against software development is still huge [2]. In addition, the vast variety of FPGA board vendors makes hardware accelerators generally not portable across different platforms.

To summarize, FPGA based accelerators provide more hardware flexibility than GPU based accelerators because the processing cores can be tailored to execute specific operations in a specific (non-standard) number system. This flexibility is exploited to obtain better performance (lower latency, higher throughput, and lower power dissipation). However, the development costs of FPGA accelerators are the highest and the ASP is generally not portable across different platforms.

In this work, we present the results of the implementation of FPGA based accelerators for a financial computing application. The accelerator is realized by an ASP communicating to the CPU of the host computer through a standard bus. We measured the actual energy consumption by monitoring the current in both the CPU and the FPGA board.

The results demonstrate that significant speed-ups and power savings are obtained. For future developments, we propose a solution to shorten the design time and enhance the portability of ASPs developed for other platforms.

II. APPLICATIONS

We chose two applications with very different characteristics (number system and communication requirements) for the hardware acceleration.

The first one is a telephone billing application based on the decimal number system. Many financial and accounting applications resort to decimal arithmetic because some decimal fractions (e.g., 0.1) cannot be represented with a finite number of bits in binary, and, in some cases, rounding errors arise [3]. These errors are not acceptable in financial and accounting applications. Only a few processors have hardware support for decimal operations, and the rest of processors implement decimal arithmetic in time consuming software routines. Our accelerator should provide the necessary hardware to execute decimal operations.

The second application is a Monte Carlo simulation for pricing of European options. An increasing number of stock market transactions are made through computer systems exploiting small variations in the price of the asset and buying/selling in a few milliseconds (high frequency trading, or HFT). The choice of buying/selling an asset is made based on the results of batch simulations which statistically determine the price to make a profit. These simulations require a lot of computer power and they can be easily parallelizable.

We give some detail on the algorithms implementing the selected applications in the following.

A. Telephone Billing

The “TELCO” benchmark was developed by IBM to investigate the balance between I/O time and calculation time in a telephone company billing application [4]. The benchmark
Algorithm 1 Pseudo-code for the computations in TELCO benchmark.

```plaintext
if (calltype = L) then
  P = duration × Lrate
else
  P = duration × Drate
end if
Pr = RoundToNearestEven(P)
B = Pr × Btax
C = Pr + Trunc(B)
if (calltype = D) then
  D = Pr × Dtax
  C = C + Trunc(D)
end if
```

Algorithm 2 Monte Carlo European option pricing.

```plaintext
VsqrtT = σ√T
drift = (r - \frac{σ^2}{2})T
expRT = e^{rT}
sum = 0
for i = 1 to n do
  St = S_0 · e^{(drift + VsqrtT · Vrnd)}
  if (St - K > 0) then
    sum = sum + (St - K) · expRT
  end if
end for
S = sum/n
```

to evaluate the Black-Scholes-Merton equation

\[ S(T) = S(0) · e^{\left( \left( \mu - \frac{\sigma^2}{2} \right) T + \sigma \sqrt{T} \right)} \] (1)

for several samples and then by computing the mean value [6]. The Monte Carlo simulation of (1) can be implemented by Algorithm 2 with the following inputs: \( S_0 \) initial security price, \( K \) strike price, \( r \) risk-free interest rate, \( \sigma \) security volatility, \( T \) time to expiration (in years), \( n \) number of simulations. The parameters \( r \) and \( \sigma \) are constant, and, therefore variables \( VsqrtT, drift \) and \( expRT \) are constant for the simulation.

The algorithm requires to generate random numbers with normal distribution in \((-1,0,1,0)\).

Differently from the TELCO application, for typical simulation runs \( n > 10^5 \), the Monte Carlo Option Pricing (MCOP in the following) does not require much communication with the memory and most of the execution time is spent in the computations.

### III. IMPLEMENTATION PLATFORM

The hardware accelerator is implemented in the Xilinx Virtex-5 LX330T FPGA. This FPGA is embedded in the Alpha Data ADM-XRC-5T2 board and is connected to the host PC via the PCI bus. The board is also equipped with DRAM that can be accessed by the FPGA and by the host PC via Direct Memory Access (DMA). The CPU of the host PC is the Intel Core2 Duo processor clocked at 3 GHz.

The implementation of the DMA functions, along with others, is included in a Software Development Kit (SDK) provided with the Alpha Data board. The SDK includes an application-programming interface (API), VHDL functions and examples.

Fig. 1 sketches the architecture of the accelerator: CPU, FPGA, and communication. We implemented in the FPGA: the ASP for the application, a front-end processor (FEP) which handles the communication ASP-DRAM-CPU, and the DMA functions (not depicted in Fig. 1).

#### A. Power Measurements

To be able to evaluate the energy efficiency of the accelerator, we perform a measurement of the current consumption during the execution of the application in both the CPU and the FPGA board.
The power dissipated in the CPU is monitored by a Hall’s effect current sensor which is connected between the power supply and the motherboard’s CPU power socket. We connect a digital multimeter to the sensors’ terminals and dump the readings (voltage drop proportional to current flowing in the CPU) in a file for batch processing. More detail on the measurement set-up is described in [7].

The FPGA board power monitoring is done by reading the voltage drop across a 7.5 $\Omega$ shunt resistor in series with the FPGA board power supply $V_{CC}$. Also in this case, the readings $V_{meas}$ are done with a digital multimeter and stored in a file. The instantaneous power dissipation is then computed by $P = \frac{V_{meas}}{7.5 \times 10^{-3}} V_{CC}$.

The power measured in the FPGA board is inclusive of the FPGA chip itself, the DRAM and all the peripherals on the board. Although Virtex-5 family FPGAs are equipped with pins to monitor the FPGA core power dissipation, these pins are not accessible in the Alpha Data board.

IV. TELCO ACCELERATOR

The architecture of the accelerator for the TELCO application is derived from the one presented in [8] with some important enhancements.

First, the accelerator of [8] could only handle small sets of data (call duration to process) because data were sent from the CPU directly to the ASP (FPGA) by using a buffer and not to the DRAM on the FPGA board. In this version of the accelerator, we designed the front-end processor, implemented on the FPGA along the ASP, to handle both the data transfer (via DMA) with the host PC, and the communication CPU-ASP (via specific instructions).

Second, we apply some optimization on the ASP to make the computation more efficient when executed on FPGA platforms, and re-pipelined the unit to work at a clock frequency of 70 MHz.

Third, we perform a comprehensive performance evaluation based on actual measurements, including energy consumption.

IV. TELCO ACCELERATOR

The architecture of the accelerator for the TELCO application is derived from the one presented in [8] with some important enhancements.

First, the accelerator of [8] could only handle small sets of data (call duration to process) because data were sent from the CPU directly to the ASP (FPGA) by using a buffer and not to the DRAM on the FPGA board. In this version of the accelerator, we designed the front-end processor, implemented on the FPGA along the ASP, to handle both the data transfer (via DMA) with the host PC, and the communication CPU-ASP (via specific instructions).

Second, we apply some optimization on the ASP to make the computation more efficient when executed on FPGA platforms, and re-pipelined the unit to work at a clock frequency of 70 MHz.

Third, we perform a comprehensive performance evaluation based on actual measurements, including energy consumption.
range \([0, 10^6]\) \((10^6 \text{s corresponds to } 11.5 \text{ days})\), and a 5-digit decimal fractional number for the call rate and taxes. The call rate is selected by a multiplexer depending on the type of the call. The cost of each call and the total cost \((C \text{ and } C_{TOT} \text{, respectively, in Fig. } 2)\) are represented by 8-digit decimal numbers (6 integer and 2 fractional digits) for values up to 999,999.99 (e.g., euro). Detail on the blocks in Fig. 2 can be found in [8].

The ASP is pipelined in 11 stages and clocked at a frequency of 70 MHz for a latency of 10 × 14.3 ns = 143 ns, and an ideal\(^1\) throughput of 70 million of calls processed per second.

V. MCOP ACCELERATOR

Algorithm 2 is mapped in the ASP sketched in Fig. 3. The algorithm is easily parallelizable by unrolling the loop in \(P\) parallel paths, labeled “MCOP PATH X” in Fig. 3. The \(P = 8\) paths are then recombined by an adder tree. The algorithm is implemented in IEEE compliant \textit{binary32} floating-point (FP) format\(^2\).

Instead of performing the multiplication \(S_0 \cdot e^{(...)}\) in each cycle of the loop (Algorithm 2), we divide \(K\) by \(S_0\) as a pre-computation (in the CPU) and compare \(e^{(...)}\) directly to \(k_1 = K/S_0\). Similarly, we remove the multiplication by \(\expRT\) out of the loop. The correct value of \(S\) is restored in the last stage by performing the multiplication by \(k_2 = S_0 - \expRT \cdot \frac{1}{n}\).

The most critical FP-unit is the accumulator (at the end of each path). The FP-accumulator is designed to sustain a throughput of one result (per path) per clock cycle. Detail on the FP-units implementation can be found in [9].

For the MCOP accelerator the FEP provides the parameters: \(n, \sqrt{T}, \text{drift}, k_1\) and \(k_2\). The DRAM is not used.

The ASP is pipelined in 29 stages and clocked at a frequency of 80 MHz for a latency of 29 × 12.5 ns = 363 ns, and an ideal throughput of 22 M elements processed per second.

\(^1\)When the ASP is not slowed down by the I/O.

\(^2\)In the 2008 revision on the IEEE standard 754 \textit{binary32} replaces the wording \textit{single-precision}.

VI. PERFORMANCE AND POWER MEASUREMENTS

To evaluate the performance and energy consumption of the accelerators, we compared the execution of the application in software, run in the CPU of the PC hosting the FPGA board, and the execution in the accelerator: CPU and FPGA board.

We run several batches of data to derive trends and to see how the execution in the accelerator scales with the data set. The Intel Core2 two CPUs are set to the maximum performance, corresponding to a frequency of 3 GHz, for both the software and the accelerator execution.

We performed the measurements by reading timers in the C executable (both software and accelerator executions), and by reading the distance between edges in the waveforms logged by the multimeter.

A. TELCO Benchmark

We run several batches of call data (25,000 to 1,000,000 calls) for the TELCO application.

Fig. 4 shows the plots derived from the actual measurements performed for the 1,000,000 calls case for the accelerator (CPU+FPGA average power dissipation).

For all experiments, the FPGA set-up time, labeled (1) and (2) in Fig. 4, is 1.9 s on the average, while the FPGA closing time, labeled (4), is 0.5 s on the average. The total overhead for operating the FPGA is about 2.4 s, independently of the data-set size.

In Table II, we list the timing measurements for the software execution (CPU SW) and for the accelerator. For the accelerator, we list separately the latency of the ASP \(t_{ASP}\), labeled (3) in Fig. 4, and the total latency for the run \(t_{acc}\). We also list the average time to process one call in the SW execution \(t_{SW}^C\), and in the ASP \(t_{ASP}^C\). Moreover, we report the speed-up of the execution of the whole run \(t_{SW}^C/t_{acc}\), and the speed-up of the calculation per call \(t_{SW}^C/t_{ASP}^C\).

Table II shows that for 250,000 calls the software and accelerator execution have the same latency. For a smaller number of calls, the FPGA set-up overhead makes the accelerator execution much slower than the software solution.

By looking at the time per call in the ASP, we notice that for 50,000 calls and above, \(t_{ASP}^C\) settles between 0.72–0.76 \(\mu\text{s}\). Considering that ideally the ASP can process a call per clock cycle (14.3 ns), the I/O, and not the ASP, sets the throughput

\begin{table}[h]
\centering
\begin{tabular}{|c|c|c|c|c|}
\hline
Call Size & \(t_{SW}^C\) & \(t_{ASP}^C\) & \(t_{acc}\) & \(t_{SW}^C/t_{acc}\) \(t_{SW}^C/t_{ASP}^C\) \\
\hline
250,000 & \(0.72\mu\text{s}\) & \(0.76\mu\text{s}\) & \(1.9\) & \(14.3\) \(1\) \\
50,000 & \(0.72\mu\text{s}\) & \(0.76\mu\text{s}\) & \(1.9\) & \(14.3\) \(1\) \\
25,000 & \(0.72\mu\text{s}\) & \(0.76\mu\text{s}\) & \(1.9\) & \(14.3\) \(1\) \\
1,000,000 & \(0.72\mu\text{s}\) & \(0.76\mu\text{s}\) & \(1.9\) & \(14.3\) \(1\) \\
\hline
\end{tabular}
\end{table}
of the accelerator. As anticipated, the application execution is “I/O bound”: about 95% of $t_{ASP}$ is spent in I/O.

In Table II, we also list the energy consumption measured in the experiments. We show a visual comparison in Fig. 5 for the 1,000,000 calls run. The table and the figure are scaled to have power $P = 0$ when the FPGA is idle and the CPU is not running the application. The idle power for FPGA and CPU are 9.5 W and 4.0 W, respectively.

Despite the large FPGA set-up overhead, for data sets of 250,000 calls and above, the accelerator is more energy efficient.

### B. MCOP Simulation

For the Monte Carlo Option Pricing simulation, we run two batches of data: a small set of $n = 100,000$ random values (called elements in the following) and a large set of $n = 256 \times 10^6$ (256M) elements.

Fig. 6 shows the plots for the 256M elements simulation. Differently from the TELCO case, the DRAM on the FPGA board is not used and the phase labeled (2) is skipped in this case. Because the ASP is larger than the TELCO case, the programming phase (1) is longer for this processor: 1.6 s on the average. The closing time (4) is about 0.5 s on the average.

In Table III, we list the timing measurements and the energy consumption for the software execution (CPU SW) and for the accelerator. We list the average time to process one element in the SW execution $t_{SW}$, and in the ASP $t_{ASP}$. Moreover, we report the speed-ups and energy ratios.

Similarly to the TELCO case, for a relatively small number of elements the MCOP execution in the accelerator is totally inefficient as the software execution time is very small compared to the FPGA programming time. For a large number of elements, when millions of floating-point operations must be executed, the parallelism of the ASP and the throughput of 8 elements processed per clock cycle make the accelerator execution significantly faster.

Moreover, as the average power dissipation in the software execution is similar to that in the accelerator execution, lower latency results in lower energy.
The main drawback is the much longer development time necessary for the accelerator. However, the design of a more advanced front-end processor can drastically reduce this time. The FEP should seamlessly interface the accelerator system with any ASP to realize a design-and-plug ASP paradigm.

We are currently working on the optimization of the FPGA set-up and on the design of such front-end processor.

TABLE III
MCOP: LATENCY AND ENERGY CONSUMPTION FOR EXPERIMENTS WITH THE TWO DATA SETS.

<table>
<thead>
<tr>
<th></th>
<th>CPU SW</th>
<th>Accelerator</th>
<th>(CPU+FPGA)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$E_{SW}$ [J]</td>
<td>$P_{ave}$ [W]</td>
<td>$E_{acc}$ [J]</td>
</tr>
<tr>
<td>n</td>
<td>100,000</td>
<td>256M</td>
<td>100,000</td>
</tr>
<tr>
<td></td>
<td>0.54</td>
<td>1013</td>
<td>43.08</td>
</tr>
<tr>
<td></td>
<td>13.58</td>
<td>14.27</td>
<td>13.15</td>
</tr>
</tbody>
</table>

$P_{ave} = E_{run}/t_{run}$.  

VII. CONCLUSIONS AND FUTURE WORK

The experimental results of Table II and Table III show that for a sufficiently large number of elements to process, the accelerator execution is faster and more energy efficient. For the TELCO benchmark, from data sets of 500,000 calls, both computation speed-up and energy savings increase almost linearly with the set size.

However, the FPGA set-up and closing time constitute a large overhead and unless this time is reduced by modifying some of the SDK functions, the software solution is preferable for smaller data sets.

REFERENCES