Financial Software on GPUs: Between Haskell and Fortran

Cosmin E. Oancea¹, Christian Andreetta¹, Jost Berthold¹, Alain Frisch², Fritz Henglein¹
HIPERFIT, ¹Department of Computer Science, University of Copenhagen (DIKU) and ²Lexifi
cosmin.oancea@diku.dk, christian.andreetta@diku.dk, berthold@diku.dk, alain.frisch@lexifi.com, henglein@diku.dk

1. Introduction

The financial system is facing fundamental challenges because of their complexity, interconnectedness and speed of interaction. International banking and insurance regulations increasingly focus on analyzing and reducing the systemic effects of financial institutions on the financial system as a whole. For this reason, such institutions are asked to evaluate their reliability and stability in a large number of economic scenarios, with some of the scenarios presenting critical conditions that require large scale modeling efforts. In this context, Monte Carlo simulations, originally developed by physicists to efficiently investigate the stochastic behavior of complex, multidimensional spaces, have emerged as tools of choice in critical applications like risk modeling and pricing of financial contracts. These simulations are paradigmatic Big Compute problems that transcend the domain of embarrassingly parallel problems. From a hardware architecture perspective, they require employing and effectively exploiting massive parallelism. Interesting results have been achieved by efficient management of processes on grid farms and expert use of specialized hardware such as graphic processing units (GPUs) [26]. In particular, the latter unites the advantages of parallelization, low power consumption, and low latency in data transfer to efficiently execute a large number of single instructions on multiple data (SIMD). This kind of massively parallel hardware requires programming practices that differ from conventional imperative von-Neumann-machine-style programming, however.

The desirability of a programming model that supports high-level description of large-scale data transformations for modeling purposes, coupled with the need to target rapidly evolving massively parallel hardware architectures without letting these infiltrate the programs themselves has led us to concentrate on the well-established practices of functional programming. Functional languages are renowned for their good modularity, testability and code reuse [24], which drastically improves maintainability and transparency – crucial properties in areas where the success of a company depends on the correctness and reliability of its software. Furthermore, the purity of functional languages largely facilitates reasoning about the inherent parallelism of an algorithm, and effective parallelizations exist for common higher-order functions [22].

Functional languages are increasingly employed in financial institutions for modeling and high-productivity programming purposes, for instance DSLs for finance [4, 40]. Additionally, functional solutions have demonstrated their ability to exploit novel hardware, such as GPUs and FPGAs, without letting hardware specifics encroach on the programming model [12, 32]. It is the double match of functional programming with modeling in quantitative finance and with naturally expressing data parallelism that motivates our research into architecture-independent parallelization of financial code using a functional approach.

In the remainder of this section we provide a rationale for our case study and an overview of the optimization techniques evaluated. In the following sections we present the functional formulation of the pricing algorithm (Section 2), the optimizations for...
compiling it to OpenCL (Section 3), the empirical evaluation of the optimizations’ impact (Section 4), a review of related work on imperative and functional parallelization (Section 5), and finally our conclusions as to what has been accomplished so far and which future work this suggests (Section 6).

1.1 Notations

Throughout the paper, we denote by $\odot$ a binary-associative operator with neutral element $e_{0}$; \texttt{fold} $\odot$ $[a_{0}, ..., a_{n}] \equiv a_{0} \odot ... \odot a_{n}$, \texttt{scan} $\odot$ $[a_{0}, ..., a_{n}] \equiv [e_{0}, a_{0}, a_{1} \odot a_{2}, ..., a_{n-1} \odot a_{n}]$, and \texttt{map} $f$ $[a_{1}, ..., a_{n}] \equiv [f(a_{1}), ..., f(a_{n})]$. We also write (red $\odot$) as a shortcut for ($\texttt{fold}$ $\odot$ $e_{0}$). We use common-helper functions (i) $\texttt{dist} :: [a] \rightarrow [[a]]$ to split the input list into a list of $p$ lists of nearly equal lengths, and (ii) $\texttt{tile} :: [a] \rightarrow [a]$ to chunk the list into a list of lists containing each roughly $\epsilon$ elements.

1.2 Bird’s Eye View

While speeding up the runtime of financial software by hand-parallelizing the code for GPU execution is in itself of pragmatic importance, this paper takes a broader view, in which we use the gained insights to evaluate the language and compiler infrastructure needed to automate the process. The main objectives are twofold:

\textbf{Language}. We take the perspective that the language should provide what is necessary for the user (i) to express algorithmic invariants explicitly in the language, and, in general, (ii) to write an implementation that comes as close as possible to the “pure” algorithmic form. If the algorithm is inherently parallel, then we expect the implementation to preserve this property. In this sense, without having parallelism in mind, we have written a sequential, functional (Haskell) version of the generic-pricing algorithm to provide a baseline for comparison against the original imperative (C) code.

Not surprisingly, we find that the functional style, with better support for mathematical abstraction, makes parallelism (almost) explicit by means of higher-order functions such as $\texttt{map}$, $\texttt{fold}$ and $\texttt{scan}$ (i.e., $\texttt{do-all}$, reduction and prefix sum). On the other hand, imperative, production code is often optimized for sequential execution but obfuscates the inherent algorithmic parallelism to an extent that makes it difficult to recognize for both programmer and compiler. The latter was observed not only on our case study, but extent that makes it difficult to recognize for both programmer and compiler.

\textbf{Performance}. While we have argued that algorithmic clarity should come first, we also take the view that this should not be achieved by compromising performance. The second objective of this paper, outlined in Section 1.4, is to explore the compiler optimizations that have proved most effective for our case study, although they have been implemented by hand: \textit{First}, we present evidence of how user-specified invariants can drive powerful high-level optimizations (e.g. strength reduction). \textit{Second}, we reveal a rich optimization space that exhibits non-trivial cost models, which are best left in the care of the compiler. \textit{Third}, we discuss several lower-level, GPU-related optimizations that have to be the compiler’s responsibility if we require hardware transparency (i.e. write once - run anywhere).

1.3 Language Perspective

Figure 1 presents two semantically-equivalent functions, written in Fortran77 and Haskell, which are our instances of imperative and functional languages, respectively. The example is telling in that it combines several interesting coding patterns that appear in implementations of Sobol quasi-random sequences [10], and contrasted in that it does not produce random numbers.

\textbf{Haskell Code}. Let us examine first the \texttt{1body} function at lines 22 – 26: Indexes in $0..m-1$ are filtered based on the \texttt{test} predicate, e.g., testing whether index $k \in [0..m-1]$ in the bit-representation of $i$ is set. Next, (i) the $\texttt{xorV}$ function reduces the elements corresponding to the filtered indexes of a $\texttt{dirVs}$’s row with the $\texttt{xor}$ operator (i.e., $\texttt{fold}$ at line 25), and (ii) this is applied to each row of $\texttt{dirVs}$, i.e., the $\texttt{map}$ at line 26. The result of $\texttt{1body}$ is thus a list of the same length (denoted \texttt{d}) as $\texttt{dirVs}$.

The rest of \texttt{example}’s implementation is straightforward: (i) at line 27 \texttt{1body} is mapped to each integer in $[1..n]$, resulting in a list representation of a $n \times d$ matrix, named \texttt{ret}, and finally (ii) prefix-sum with operator $\texttt{xor}$ is applied to aggregate the elements in the same position in each row of \texttt{ret}, i.e., the $\texttt{scan}$ at line 29.

One can observe that parallelism is made (almost) explicit in the implementation by the sequence of $\texttt{map}$ and $\texttt{scan}$ at lines 27 and 29. The latter has depth $\log(n)$, while the former is embarrassingly parallel and exhibits nested\textsuperscript{1} parallelism that could be further optimized via flattening [8, 11].

\textbf{Fortran Code}. Examining the Fortran code, an experienced imperative programmer might recognize that (i) the $\texttt{do k}$ loop at lines 6 – 10 implements the filtering of indexes based on the $\texttt{test}$ predicate, and (ii) the $\texttt{do k}$ loop at lines 13 – 15 corresponds to the $\texttt{fold}$ at line 25. (Note that Fortran uses column-major arrays). The outermost loop and the $\texttt{do} j$ loop at lines 11 – 18 (minus line 17) correspond to the Haskell maps at lines 27 and 26, which compute the result array \texttt{ret}. The code is arguably less obvious than the one in Haskell, due to the lack of higher-order functions such as $\texttt{filter}$, $\texttt{fold}$, and due to the explicit array indexing.

However, even the experienced programmer might have difficulties understanding that in fact, line 17 implements a prefix-sum computation, i.e., the $\texttt{scan}$ at line 29. While the destructive update

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{figure1.png}
\caption{Contrived, but illustrative example: Fortran77 vs Haskell}
\end{figure}

1\textsuperscript{1} Since $\texttt{1body}$ is in itself a \texttt{map}, line 27 exhibits the composition of two \texttt{map}, which, if merged, would improve the parallelism degree from $n \to n \times d$. 

\textcopyright ACM, 2012. This is the authors version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in \textit{FHPF’12}. Sep.2012. http://doi.acm.org/10.1145/2364474.2364484
to \( \text{ret}(j,i) \) optimizes\(^2\) the sequential execution time, we note that, at least to some degree, it affects readability.

There are two main impediments to proving parallelism for the outermost loop do \(i\). The first issue refers to array \(ia\): the algorithm’s logic is that each iteration \(i\) works with its own (independent) set of filtered indexes, i.e., \(ia\) should be logically declared/allocated inside the loop. The implementation optimizes the sequential case by promoting \(ia\)’s declaration outside the loop.

However, this results in bogo cross-iteration read-after-write (RAW), write-after-read (WAR) and write-after-write (WAW) dependencies. To enable parallelism, one has to prove the validity of the reverse transformation, known as privatization, which reduces to proving that every read from \(ia\) is covered by a write to \(ia\) from the same iteration. A programmer might observe that loop do \(k\) at line 13 iterates precisely on the set of values computed by loop do \(k\) at line 6. However, most compiler solutions [21, 42] cannot establish this property, as their dependency analysis is restricted to cases where the array subscript can be expressed as a closed-form, typically affine, formula in the loop indexes. In our case, the conditional increment of \(laren\) at line 8 does not satisfy this requirement.

The second issue is even more discouraging: the prefix-sum pattern of line 17 appears as a cross-iteration dependency of constant distance 1, which forms a dependency cycle that cannot be easily broken. Furthermore, prefix-sum can be written imperatively in a number of ways, and we are not aware of compiler technique that would effectively parallelize this pattern. In contrast, parallel reduction is effectively supported by pattern-matching techniques [31].

1.4 Performance Perspective

The previous section hinted that it is significantly more difficult to uncover parallelism from an imperative program than it is to optimize a nearly-parallel functional version via imperative-like optimizations. This section outlines several such optimizations.

Space-Reuse of Map-Reduce Functions. It is well known that \((\text{red } \ominus) \mathit{.} \mathit{(map} \ f)\) can be formally transformed, via list homomorphism (LH) promotion lemma [6], to its equivalent form:

\[
(\text{red } \ominus) \mathit{.} \mathit{(map} \ f) \equiv (\text{red } \ominus) \mathit{.} \mathit{(map} \ (\text{red } \ominus) \mathit{.} \mathit{(map} \ f)) \mathit{.} \mathit{dist}_{p}\tag{1}
\]

i.e., the input list is split into number-of-processor lists, on which each processor performs the original computation (sequentially), and finally, the local results are reduced in parallel across processors. Note that the map-reducer in the middle does not need to instantiate the list result of \(\mathit{map} \ f\), i.e., destructive updates can be used to accumulate each output of \(f\) to the local result. This requires a total memory space proportional to \(p\) rather than \(N\) (the list’s length).

The latter form is typically preferred on massively parallel systems to optimize the communication cost, while the former is preferred on SIMD (vector) systems, which typically exhibit a rather uniform memory and very limited per-processor resources. GPUs are, in a sense, a mix of both: a GPU is pseudo-SIMD, but features a non-homogeneous memory, in which the local memory close to the core is several orders of magnitude faster than the global one. We identify an interesting trade-off: if the result of applying \(f\) fits in the fast (local) memory, then the application becomes compute-bound rather than memory-bound. The downside is that increasing the per-core resources decreases the parallelism degree of the system, and, as such, its effectiveness at hiding various kinds of latencies. Section 3.2 explores this trade-off in detail.

Strength Reduction is a transformation that replaces an expensive operation (*) with a recurrence that uses a cheaper operation (+). In the code snippet below, \(k0 \times 2 \times (i-1)\) has been replaced with the cheaper recurrence \(k = k0 \times 2\). The inverse transformation, induction

\[\mathit{enddo} \mathit{Ind. Var. Subst.} \ k = k0 + \mathit{MAX}(2*N,0)\]

variable substitution, replaces the recurrence with a closed-form formula in the loop index, and thereby enables parallelism extraction: \(i\) it eliminates the cross-iteration RAW dependency on \(k\) and allows the compiler to disprove cross-iteration WAW dependencies on array \(A\), i.e., \(k0 \times 2 \times (i-1) = k0 \times 2 \times (i2) \Rightarrow i1 = i2\).

\[k = k0\]

\[\text{do} \ i = 1, N \quad \text{Strength Red.} \quad \text{doall} \ i = 1, N\]

\[A[k] = . . \quad \text{------------} \quad A[k0 \times 2 \times (i-1)] = . .\]

\[k = k + 2 \quad \text{------------} \quad \mathit{enddo}\]

\[\text{Ind. Var. Subst.} \quad k = k0 + \mathit{MAX}(2*N,0)\]

Comilers typically support simple algebras that, for example, allow replacing multiplication/exponentiation with recurrent addition/multiplication formulas in the sequential case, and the reverse for the parallel case. Section 2.2 shows a more complex example of strength reduction and advocates that such invariants should be captured at language level, since they reveal a nontrivial and impactful optimization space, which is explored in Section 3.3.

Branch Divergence. Consider the target code \(\mathit{map} \ f\), where \(f i\) if(\(\text{test} i\)) then \(f1 i\) else \(f2 i\). When evaluating the parallel application of \(\mathit{map} \ f\) to all elements of an array on a symmetric multiprocessor (SMP), an asymptotic worst-case time-cost estimate is \(C(\mathit{map} \ f) = C(\text{test}) + M.A.X(C(f1), C(f2))\).

In contrast, when the code runs on a SIMD machine, in case the if branch diverges for at least one core (one element of the array), the runtime effectively corresponds to all cores executing both branches, i.e., \(C(\mathit{map} \ f) = C(\text{test}) + C(f1) + C(f2)\).

Our solution is to tile the computation via LH promotion lemma:

\[\mathit{map} \ f \equiv (\text{red } \ominus) . \mathit{(map} \ f) . \mathit{.} \mathit{tile}_{2}\]

where \(\mathit{map} \ f\) in the middle is intended to be executed sequentially, and to replace it (\(\mathit{map} \ f\)) with a semantics-preserving, efficient, imperative code that permutes map’s iteration space \(1 \ldots N\) such that the elements for which \(\text{test} \ succeeds\) are computed via \(f1\) before the ones for which \(\text{test} \ fails\) (via \(f2\)).

To see how this approach optimizes divergence, consider the case in which two GPU cores process tiled lists [1, 2, 2] and [4, 5, 5], where \(\text{test} \equiv \text{odd}\). Without the transformation, the two cores execute different branches for each pair of elements. With the transformation, the lists are processed in the orders [2, 2, 1] and [4, 5, 5], and only the middle elements cause branch divergence.

This technique, described in detail in Section 3.4, is complemented by copying-in and out the tiled lists to and from fast memory in order to not introduce un-coalesced accesses.

Memory Coalescing is achieved on GPU when a group of neighboring cores (e.g., 16), access in the same instruction, a contiguous chunk of memory (e.g., 64 bytes). Since the virtual memory is implemented interleaved on different memory banks, the whole chunk is brought to registers in one memory transfer (and in parallel with the accesses of all such groups of neighboring cores).

As explained in Section 3.5, one can transparently restructure arrays and indexes to enable coalesced accesses. For example, consider the code \(\mathit{map} \ (\text{red } \ominus) . [[[\mathit{Int}] \mapsto [\mathit{Int}]]\). Here, the input list represents a \(N \times 32\) matrix and \(\mathit{map} \ f\) is parallelized. In each instruction, each group of 16 cores accesses addresses 128 bytes apart from each other, which requires 16 memory transfers, resulting in inefficient bandwidth utilization. For example, if 16 divides \(N\), the layout can be changed to a three-dimensional array \(N/16 \times 32 \times 16\), and an access to \(x\) and \(y\) columns is mapped to index \((x \div 16, y, x \mod 16)\), achieving coalesced accesses.

1.5 Main Contributions

We consider the following main contributions of this paper:

- A side-by-side comparison of functional vs imperative code patterns that provides evidence that parallelism is easier to recog-
nize in the former style, while the latter style often requires the compiler to reverse-engineer sequentially-optimized code,

- Four optimizations that (i) take advantage of the map-reduce functional style to derive simple yet powerful imperative-style program transformations, and (ii) seem well-suited for integration into the repertoire of a GPU-optimizing compiler,

- An empirical evaluation on a real-world financial kernel that demonstrates (hints) that (i) the proposed optimizations have significant impact, and that (ii) the rich trade-off space is effectively exploited by the simple (proposed) cost models,

- From a pragmatic perspective, we show speedups as high as $70 \times$ and on average $43 \times$ against the sequential CPU execution on a mobile GPGPU, and $\sim 8 \times$ that on a mid-range GPGPU.

2. Generic Pricing Algorithm and Invariants

Section 2.1 provides the algorithmic background of our generic-pricing software, outlining the Monte Carlo method used and its salient configuration data. We then illustrate how the computational steps in the algorithm translate to a composition of functional basic blocks that expose the inherent parallelism of the algorithm as instances of well-known higher-order functions. We refer the interested reader to Hull (2009) [25] and Glasserman (2004) [18] for a more detailed description of the financial model and the employment of Monte Carlo methods in finance, respectively.

Section 2.2 advocates the need to express high-level invariants at language level: In the context of the Sobol quasi-random-number generator, we identify a strength-reduction pattern and demonstrate that (i) its specification can trigger important performance gains, but (ii) the latter should be compiler’s responsibility.

2.1 A Generic Pricing Kernel for Liquid Markets

Financial Semantics. Financial institutions play a major role in providing stability to economic activities by reallocating capital across economic sectors. Such crucial function is performed by insuring and re-balancing risks deriving from foreseeable future scenarios, quantified by means of (i) a probabilistic description of these (yet unknown) scenarios, and (ii) a method to evaluate at present time their economic impact. Risk management is then performed by allocating capital according to the foreseen value of the available opportunities for investment, while at the same time insuring against outcomes that would invalidate the strategy itself.

Option contracts are among the most common instruments exchanged between two financial actors in this respect. They are typically formulated in terms of trigger conditions on market events, mathematical dependences over a set of assets (underlyings) of the contract, and a set of exercise dates, at which the insuring actor will reward the option holder with a payoff depending on the temporal evolution of the underlyings. A vanilla European call option is an example of contract with one exercise date, where the payoff will be the difference, if positive, between the value of the single underlying at exercise date and a threshold (strike) set at contract issuing. Options with multiple exercise dates may also force the holder to exercise the contract before maturity, in case the underlyings crossed specific barrier levels before one of the exercise dates.

Three option contracts have been used to test our pricing engine: a European vanilla option over a market index, a discrete barrier option over multiple underlyings where a fixed payoff is function of the trigger date, and a barrier option monitored daily with payoff conditioned on the barrier event and the market values of the underlyings at exercise time. The underlyings of the latter contracts are the area indexes Dj Euro Stox 50, Nikkei 225, and S&P 500, while the European option is based on the sole Dj Euro Stox 50.

```haskell
nc_pricing n = sum gains / fromIntegral n
where c = init_pricing n

gains = map (payoff c) . black_scholes c . brownian_bridge c . gaussian c . sobolInd c
\[ \text{init} \to [0,1]^n \to \mathbb{R}^d \\]
```

Figure 2. Functional Basic Blocks and Types

The number of monitored dates is one for the European case, and 5 and 367 for the two barrier contracts, respectively.

Two key components are necessary for the appreciation at current time of the future value of these contracts: (i) a stochastic description of the underlyings, allowing to explore the space of possible trigger events and payoff values, and (ii) a technique to efficiently estimate the expected payoff by aggregating over the stochastic exploration. The kernel studied here uses the quasi-random population Monte Carlo method [18] for the latter. Samples are initially drawn from an equi-probable, homogeneous distribution, and are later mapped to the probability distributions chosen to model the underlyings. Since these exhibit very good liquidity and present no discontinuities, with good approximation they can be independently modeled as continuous stochastic processes following Normal distributions (Brownian motions) [7]. Further, these stochastic processes express an intrinsic regular behavior that can allow sampling of the value of the underlyings at the sole exercise dates. As a final step, correlation between the underlyings is imposed via Cholesky composition, by means of a positive-definite correlation matrix $L$ provided as input parameter.

The stochastic exploration proceeds as following: first, the Sobol multidimensional quasi-random generator [10] draws samples from a homogeneous coverage of the sampling space. Then, these samples are mapped to Normally distributed values (by quantile probability inversion [48]), which model the value of each underlying at the exercise dates. A Brownian Bridge scales these samples to ensure conservation of the properties of the stochastic processes also in non-observed dates, preserving modeling consistency [25]. These samples, once again scaled to express the expected correlations among the underlyings, now mimic a market scenario. They can therefore be provided as input to the payoff function, which returns the future gain from the contract given this particular scenario. This procedure is repeated for a large number of initial random samples, and an average gain calculated, which estimates the future payoff in its first order statistics. Finally, the impact of this aggregated future payoff at present time is estimated using a suitable discount model [25].

Functional Composition. Figure 2 shows how these algorithmic steps directly translate into composition of essential functions. The first step (given last in the function composition) is to generate independent pseudo-random numbers for all underlyings, $n$, and dates, $d$, by means of the Sobol’s quasi-random number algorithm (function sobolInd). This method is known to provide homogenous coverage of the sampling space, and thus to a stochastically efficient exploration of such space with a relatively low number of samples [18]. Additionally, it exhibits a strength-reduction invariant that enables an efficient parallel implementation providing identical semantics to its sequential algorithm. The uniform samples are then mapped to Normally distributed values by quantile probability inversion (function gaussian).

The next step, brownian_bridge, maps the list of random numbers to Brownian bridge samples of dimension $u \cdot d$. This step induces a dependency between the columns of the samples matrix, i.e. in the date dimension $d$. In the following function, black_scholes, the underlyings, stored in the rows of the matrix,
are mapped to their individual stochastic process and correlated by Cholesky composition, inducing a dependency in the row dimension. Finally, the payoff function computes the payoff from the monitored values of the underlyings.

The outer Monte Carlo level, expressed by the map function, repeats this procedure for each of the input samples, and first-order statistics are collected by averaging over the payoff values.

From a developer perspective, this functional outline allows to fully appreciate the composition possibilities and the reusability of this solution. In fact, the only function strictly dependent on the contract type is the payoff function, while all the other modules can be freely employed to price options having underlyings modeled with similar stochastic processes. Furthermore, Figure 2 evidences the inherent possibilities for parallelism: distribution (map) and reduction (sum) are immediately evident, and the functional purity allows to easily reason about partitioning work and dependencies.

The Haskell code shown here has in fact been written as a prototype for reasoning about potential parallelization strategies for a C+GPU version; while at the same time providing the basis for an optimized Haskell version for multicore platforms.

2.2 Algorithmic Invariants: Sobol sequences

Algorithm. A Sobol sequence [10] is an example of a quasi-random or low-discrepancy sequence of values \( \{z_0, x_1, \ldots, x_n, \ldots\} \) from the unit hypercube \([0, 1)^s\). Intuitively this means that any prefix of the sequence is guaranteed to contain a representative number of values from any hyperbox \( \prod_{j=1}^{s} [a_j, b_j) \), so the prefixes of the sequence can be used as successively better representative uniform samples of the unit hypercube. Sobol sequences achieve a discrepancy of \( O \left( \frac{\log^s n}{n} \right) \), which means that there is a constant \( c \) (which may depend on \( s \), but not \( n \)) such that for all \( 0 \leq a_j < b_j \leq 1 \):

\[
\left| \{ x_i \mid x_i \in \prod_{j=1}^{s} [a_j, b_j) \land i < n \} - n \sum_{j=1}^{s} (b_j - a_j) \right| \leq c \log^s n
\]

Let us denote the canonical bit representation of non-negative integer \( n \) by \( B(n) \), with \( B^{-1} \) mapping bit sequences back to numbers. The algorithm for computing a Sobol sequence for some degree \( d \) over the Galois Field \( GF(2) \), with \( a_0 \neq 0, a_d \neq 0 \). The second step is to compute a number of direction vectors \( m_k \) via a recurrent formula that uses \( P \)'s coefficients:

\[
m_k = (\bigoplus_{i=1}^{k} a_{d-i} m_{k-i}) \oplus 2^d m_{k-d}
\]

for \( k \geq d \), where \( m \oplus n = B^{-1}(B(m) \text{xor} B(n)) \) and \( \text{xor} \) denotes the exclusive-or on bit sequences. The values of \( m_i \) for \( 0 \leq i < d \) can be chosen freely such that \( 2^0 \leq m_i < 2^{i+1} \). In the third step, we compute Sobol proxies via the independent (as opposed to recurrent) formula

\[
x_i = \bigoplus_{j=0}^{\infty} B(i)_j m_j
\]

where \( B(i)_j \) denotes the \( j \)-th bit of \( B(i) \). (The 0-th bit is the least significant bit.) Finally, reading the binary representation of Sobol proxies as a fixed point number yields the Sobol number \( x_i \):

\[
x_i = \sum_{j=0}^{\infty} B(i)_j 2^{-j-1}
\]

Instead of using \( B(n) \) in the definition of Sobol proxies we can use the reflected binary Gray code \( G(n) \), which can be computed by taking the exclusive or of \( n \) with itself shifted one bit to the right:

\[\text{reflected gray code} \quad x_i = B(i) \oplus B(i) \ll 1\]

---

2 The nomenclature is misleading since a quasi-random sequence is neither random nor pseudo-random: It makes no claim of being hard to predict.

---

3 Our code actually computes the integers corresponding to the reverse bit representation of Sobol proxies. Functions involved in pricing.

---

Figure 3. Sobol Generator: Independent vs Recurrent Formulas.

\[
\text{REAL zd}(u,d), \text{wf}(u,d)
\]

DO \( i = 1, N \ldots \)

DO \( n = 1, u \)

\( \text{wf}(n, \text{bb}_{bi}(j)-1) = \text{bb}_{sd}(1) \times \text{zd}(n, 1) \);

DO \( j = 2, d \)

\( \text{wk} = \text{wf}(n, \text{bb}_{ri}(j)-1) \);

\( \text{zi} = \text{zd}(n, j) \);

\( \text{wf}(n, \text{bb}_{bi}(j)-1) = \text{bb}_{ru}(j) \times \text{wk} + \text{bb}_{sd}(j) \times \text{zi} \)

IF (\( \text{bb}_{li}(j) - 1 \) NE -1)

\( \text{wf}(n, \text{bb}_{bi}(j)-1) += \text{bb}_{lv}(j) \times \text{wf}(n, \text{bb}_{li}(j)-1) \)

ENDDO

\[\ldots\] res = res + wf(\ldots) \ldots \]

ENDDO

Figure 4. Brownian-Bridge Code Snippet

\[
G(n) = B(n)_j \oplus B(n)_{j+1}. \quad \text{This changes the sequence of numbers produced, but does not affect their asymptotic discrepancy. It enables the following recurrence formula for Sobol proxies:}
\]

\[
x_{n+1} = x_n \oplus m_c
\]

where \( c \) is the position of the least significant zero bit in \( B(n) \).

A Sobol sequence for \( s \)-dimensional values can be constructed by \( s \)-ary zipping of Sobol sequences for \( 1 \)-dimensional values.

Invariants. Figure 3 shows the essential parts of our Haskell implementation for \( s \)-dimensional quasi-random Sobol proxies.\footnote{Our code actually computes the integers corresponding to the reverse bit representation of Sobol proxies. Functions involved in pricing.}

The function \texttt{sobolInd} implements the independent formula with the optimization that \( n \)'s bits set to one are filtered and the result is reduced via \texttt{xor}. The recurrent formula is implemented by \texttt{sobolRec}: the least significant zero bit is used to select the set of direction vectors (\texttt{dirVs}) that are zorred with the corresponding entries of the previous vector (\texttt{zipWith xor prev}).

Section 1.4 has outlined an example of strength reduction, in which a repeated multiplication was replaced via a computationally cheaper, plus-recurrence formula. We observe that \texttt{sobolInd} and \texttt{sobolRec} match the strength reduction pattern: Computing the first \( n \) vectors via \texttt{sobolInd} is embarrassingly parallel, i.e., \texttt{map} in Figure 3, while the strength-reduced \texttt{sobolRec} is significantly cheaper but requires a \( \log n \)-depth algorithm (\texttt{scan}).

The imperative Sobol code, not presented here, exhibits the patterns discussed in Section 1.3 that would preclude parallelism discovery. Another illustrative example corresponds to the Brownian-bridge implementation, shown in Figure 4: each iteration \( i \) reuses the space of array \( \texttt{wf} \) and accumulates the result in \( \texttt{res} \). This space-saving technique, together with the indirect indexing makes it very difficult to prove that each read from \( \texttt{wf} \) in the same iteration \( i \), i.e., the loop \( \texttt{do i} \) can be parallelized by privatizing array \( \texttt{wf} \). The functional
style would likely expand array \( \text{vt} \) with an outermost dimension of size \( N \), and express the loop as an easily-parallelizable map-reduce pattern, in which map’s function is given by the do in loop.

**Discussion.** This paper takes the perspective that the compiler should be the depositary of the knowledge of how best to optimize a program, while the user should primarily focus on the algorithmic invariants that (i) are typically beyond the compiler’s analytical abilities and (ii) would enable the application of such optimizations. There are several reasons that support this view:

First, specifying such invariants requires minimal effort, e.g., \( \text{sobolInd} (i+1) \equiv \text{sobolRec} \left( \text{sobolInd} \cdot i \right) \) documents the strength reduction invariant: the independent formula can be described via a recurrence.

Second, the optimization strategy is often hardware-dependent, hence it is impossible for the user to write an optimal hardware-agnostic program. For instance, \( \text{scan sobolRec} \) is well suited to the sequential case, while map \( \text{sobolInd} \) can be better on a massively parallel machine that exhibits high communication costs.

Finally, program-level transformations are often nontrivial, and at least tedious even for the experienced user to do by hand: e.g., Section 3.3 presents how to optimize both the parallelism depth and time overhead: the computation is tiled via a factor \( t \), where the tile amortizes the cost of one \( \text{sobolInd} \) over \( t-1 \) (fast) executions of \( \text{sobolRec} \). Another good example is flattening [8].

### 3. Optimizations

This section describes in detail several compiler optimizations that had a strong impact on the pricing algorithm, and that we believe are likely to prove effective in a general context. Sections 3.2 and 3.3 describe optimizations and trade-offs related to exploiting coarse-grained parallelism and strength-reduction invariants. These are high-level transformations demonstrated using functional code snippets. Sections 3.4 and 3.5 present lower-level optimizations, related to branch divergence and memory coalescing, that are demonstrated on a Fortran intermediate representation.

#### 3.1 Language Assumptions

Throughout the paper, we use Haskell to illustrate the functional programming style, but disregard laziness issues and use lists instead of performance-oriented special types like vectors or arrays for the sake of clarity.

When discussing the imperative programming model, we use Fortran77 uniformly, because: (i) it accurately illustrates the original C code of the pricing algorithm, and, if anything, (ii) it eliminates the maybe-aliasing issue, which is a major hindrance to automatic parallelization. Furthermore, (iii) a vast amount of work in autoparallelization targets Fortran77. As a fourth point, Fortran77 code resembles the GPU API OpenCL which we use, in that it supports neither recurrence nor dynamic allocation (static arrays only).

Another aspect to be taken into account when discussing optimizations is data locality and thread grouping on GPUs. A GPU operates in thread blocks, and threads are grouped to SIMD groups (so-called warps) executed on one SIMD unit comprising multiple cores. To simplify our argument, we consider that each SIMD unit comprises 32 hardware cores. Technically this is not correct, as a warp resides on only 8 cores, which execute four-cycle instructions and need four threads to amortize the cost, but the analogy is valid for the points we are making. A block of size \( B \) yields \( B/32 \) hardware threads per core, which we call “virtual cores”.

#### 3.2 Vectorized vs Coarse-Grained parallelism

Section 1.4 has outlined the tradeoff related to selecting one of (at least) two possible implementations of a map-reduce computation. Figure 5 illustrates these two choices in the context of the generic-pricing algorithm shown in Figure 2.

The **vectorized version** distributes the outer map across each of the basic-block kernels, and reduces the result vector in parallel via the plus operator. (This transformation is the inverse of fusion and is known as loop distribution in the imperative context.)

On GPU, vectorization exhibits the advantage that each kernel needs fewer resources per virtual core than the fused version. This potentially increases the parallelism degree, which can be used for hiding latencies. In addition, vectorization enables each kernel to be further optimized, e.g., the gaussian kernel applies function map gaussian_elem, hence map gaussian exhibits nested parallelism that can be flattened to increase the degree of parallelism.

The downside is that the memory complexity is nonoptimal, i.e., proportional to \( n \), because all intermediate vectors need to be instantiated. It follows that \( \text{vt1}\ldots\text{vt5} \) have to be allocated in global storage, which is several order of magnitude slower than the local memory. (The superior parallelism degree hides to a certain level, but typically does not eliminate memory latency, i.e., spawning more computation may stress too much the memory system.)

The **coarse-grained version** is obtained via the transformation:

\[
\text{map } \circ \text{map } f \equiv \text{map } (\text{map } (\text{map } (\text{map } (\text{map } f))).dist, \text{dist}),
\]

that distributes the input list among processors, performs the original computation \( \text{map } f \) sequentially on all processors and post-reduces the local results in parallel. Space consumption is optimized via privatization: \( \text{vt1}\ldots\text{vt5} \) are allocated per virtual-core, and memory is reused via destructive updates for both the privatized variables and the (accumulated) result \( \text{res} \). Note that (i) \( \text{res} \) needs to be replicated for each sub-list before a final (parallel) reduction, and (ii) the iteration scheduling policy, i.e., the list distribution, is omitted in Figure 5, since it is handled automatically by GPU’s programming interface (OpenCL compiler).

The main advantage of the coarse-grained version is that the memory consumption is (asymptotically) optimal: its size is proportional to the number of virtual cores rather than to the data size. When all local variables fit in fast memory this leads to a computational, rather than memory-bound behavior (our example eliminates global-memory latency by encoding the input list via an affine formula on the loop index; this is not always possible). The downside is that (i) it requires more per-virtual-core resources than vectorization, hence exhibits a lower parallelism degree, and (ii) it is not applicable when the local resources do not fit in fast memory.

The **Cost Model** must be able to compute a maximum size of per-virtual-core resources, as an upper limit from which the benefits of using local memory are eliminated by the reduced parallelism degree failing to optimize other kinds of latency (e.g., cache and instruction latencies, register dependencies). An accurate model is difficult to implement because latencies are in general both program and data sensitive, e.g., global-memory latency depends on whether memory accesses are coalesced. In principle, this could be addressed via machine-learning and/or profile-guided techniques, but that study is beyond the scope of this paper.

---

**Figure 5.** Vectorized (Haskell) vs Coarse parallelism (Fortran)
A simple heuristic is to define the cutoff point by computing the per-virtual-core resources associated to a reasonably-minimal concurrency ratio $CR_{\text{min}}$. Since the technique eliminates the global-memory latency, $CR_{\text{min}}$ is related to arithmetic latency, which, on our GPU hardware requires a ratio of virtual to hardware cores between 9 and 18, depending on the existence of register dependencies. We choose $CR_{\text{min}} = (9+18)/2 = 14$, and compute the associated per-virtual-core resources $R_{th} = M_{\text{fast}}^{in}/(CR_{\text{min}}^{32})$, where $M_{\text{fast}}^{in}$ and the denominator denote the fast-memory size and the number of virtual cores per multiprocessor, respectively.

Our hardware exhibits $M_{\text{fast}}^{in} = 112$ bytes, thus $R_{th} = 256$ bytes. In our example, each virtual core (iteration) requires storage for three vectors, each of (flattened) size $u \cdot d$: the first two are necessary because some kernels cannot do the computation in-place and the third is necessary to record the previous quasi-random vector. Since $R_{th}$ is the strength-reduction optimization. In addition we need about 16 integers to store various scalars, such as loop vectors.

It follows that the cutoff point is $u \cdot d = 16$, which is close to the optimal in our case, but warrants a systematic validation. The cost model is implemented via a runtime test, and we observe speedups as high as $2x$ when the coarse-grained version is selected.

### 3.4 Branch-Divergence Optimization

**Intuition.** On SIMD hardware, branches that are not taken in the same direction by all cores exhibit a runtime equivalent to each core executing both targets of the branch. This section proposes an inspector-executor approach to alleviate this overhead: (i) the (parallel) loop is tiled, then (ii) the inspector computes a permutation of the iteration space of a tile that groups iterations corresponding to the true (false) branches together, and, finally, (iii) the executor processes the tile in the new (permuted) order. As outlined in introductory Section 1.4, organizing the (sequential) execution of a tile in this way minimizes the branch divergence across different tiles, which are processed in parallel (SIMD).

Consider the Haskell code map \( f \) ginp, where \( f \) is defined as:

\[
\text{f a = if (cond a) then (fun1 a) else let m = a * a in if (cond m) then (fun2 m) else (fun3 m)}
\]

The top-left part of Figure 7 shows the Fortran version of this code, where the outer loop has been tiled, and, for simplicity we assume that \( T \) divides \( N \). The bottom-right part of Figure 7 shows the inspector, \( \text{itPerm} \), associated to one branch target. The inspector executes the slice of the original code, i.e., \( \text{cloned_code} \), that is necessary to find the direction taken by the original branch, and replaces the bodies of the branch with code that aligns the indexes of true/false iterations contiguously in the first/last part of \( \sigma \), respectively. Finally, the split index is returned. Note that the input \( \sigma \) is not required to be ordered, any input permutation of the iteration space will be transformed in a permutation that groups the

---

**Figure 6.** Sobol Generator: Independent vs Recurrent Formulas.
true and false iterations continguously, hence σ, once initialized, does not need to be reset in the program.

The bottom-left part of Figure 7 shows the transformed code: The global-memory input associated to the tile is first copied to private space inp. Then, inspector itPerm is called to compute the permutation of the iteration space. Loop DO j = 1, s1 executes the true iterations of the outer if, and a similar loop was intermediary generated for the false iterations. The latter loop was recursively transformed to disambiguate its (inner) if branch.

This corresponds to the second call to itPerm on the remaining indexes σ(a1+1..TILE), in which the cloned code refers to the square-root computation of m in the original code that is used in branch condition cond(m). Finally, the loop is distributed across the true and false iterations of the inner if, and the result is copied out to global memory. (Without the copy-in/out to and from private storage, the permutation of the iteration-space may introduce non-coalesced, global-memory accesses).

**Implementation.** We observe that the transformation is valid only on independent loops (i.e., parallel, no cross-iteration dependencies), otherwise the iteration-space permutation is not guaranteed to preserve the original program semantics.

Consider the case when an independent loop contains only one outermost if branch. To apply the transformation: First, inline the code after the if-then-else construct inside each branch, or separate that code via loop-distribution to form another loop.

Second, extract the inspector by computing the transitive closure of loop statements necessary to compute the branch condition, and by inserting the code that computes the permutation.

Third, generate the (distributed) loops corresponding to the true/false iterations by cloning the loop, replacing the if construct with the body of the true/false branch, substituting the loop index j with σ(j), and simplifying, e.g., dead-code elimination. The procedure can be repeated to optimize inner branches in the two formed loops, where each loop further refines its iteration space recorded in its corresponding (contiguous) part of σ.

**Figure 7. Branch Divergence Example**

If the independent loop contains two branches at the same level, then one can distribute the loop around the two branches and apply the procedure for each branch, i.e., map (f₁, f₂) can be rewritten as (map f₁), (map f₂), and the if branches of f₁ and f₂ can be treated individually. Furthermore, the transformation can be applied uniformly via a top-down traversal of the control-flow (and call) graph of the original loop, where each if-branch target is transformed in the context of its enclosing loop.

**Cost Model.** One can observe that optimizing branch divergence exhibits both fast-memory and instructional overhead: The memory overhead is related to the size of the tile, which typically dictates the size of the private input and output buffers, and the size of σ. Splitting the computation into an inspector-executor fashion may introduce instructional overhead because both the if condition and the if body may be data-dependent on the same statements, e.g., the statement that computes m in Figure 7. Enabling transformations, such as loop distribution, may also require either statement cloning or array expansion to fix potential data-dependencies between the two distributed loops.

To determine the profitability of this transformation, static analysis should first identify good branch candidates, i.e., if statements that exhibit high computational granularity for at least one of their true and false branches (fun1 and fun2). Then, similar to Section 6, the maximal tile size can be computed so that the associated fast-memory overhead does not significantly affect latency hiding.

To improve precision, runtime profiling can be used to measure the divergence ratio and to what degree the transformation would reduce divergence. Finally, the instructional overhead should be taken into account to determine whether this optimization is profitable for the target branch. With our case study, this optimization exhibits speedups (slowdowns) as high (low) as 1.3× (0.95×).

### 3.5 Memory-Coalescing Optimization

This section presents a transformation that fixes potential non-coalesced accesses of a map construct, such as map fun inp, where the elements of inp are arrays of similar dimensionality.

One can observe that since map hides the iteration space, any array indexing inside fun would likely be invariant to the loop that implements map. For example, in the left side of Figure 8, DOALL i corresponds to the original map, and the D0 j loop implements fun, which processes an inner array of dimension M, indexed by σ(j). Executing the i loop on CPU leads to the access pattern depicted in the left-bottom part of Figure 8, in which B cores in a SIMD-group access in one instruction elements that are 4 bytes apart from each other, where we assumed for simplicity σ ≡ id.

We fix this behavior by reshaping uniformly such arrays via transformation T(x,y) = [x/B,y,x%B], in which x and y correspond to the row and column index in the original matrix (since Fortran uses column order, we would write AR(y,x)). Since B is a power of two the new index is computed using fast arithmetic.

In essence, we have trimmed the outermost dimension and added an innermost (row) dimension of size B, the size of the SIMD
group, such that one SIMD instruction exhibits coalesced access. The top-right part of Figure 8 shows the transformed code, where we made explicit the SIMD grouping via the DQALL k loop, while the outer DOALL 1 loop expresses the parallelism among SIMD groups. The bottom-right part of Figure 8 demonstrates that after transformation B consecutive cores access contiguous locations.

We observe that this transformation is effective for arrays of any dimensions, as long as the internal indexing is map-loop invariant. For example, assuming that the Brownian-bridge code of Figure 4 is written in map-reduce style, i.e., array expansion is applied to $u$ and $z$, this transformation results in coalesced accesses for arrays $uv$ and $z$, despite the indirect indexing exhibited on the dates (d) dimension. Finally, assuming that all computational-intensive kernels are executed on GPU, it is beneficial to reshape all relevant arrays in this fashion, since the potential overheads of the CPU-executed code are in this case negligible.

We conclude by observing that this technique (i) transparently solves any uncoalesced accesses introduced by other compiler optimizations such as tiling, and (ii) yields speed-ups as high as 28×.

4. Experimental Results

Experimental Setup. We study the impact of our optimizations on two heterogenous commodity systems: a desktop and an integrated mobile solution. We compile (i) the sequential-CPU kernel with the gcc compiler versions 4.6.1 and 4.4.3, respectively, with compiler option -O3, and (ii) a very similar version of the CPU code with NVIDIA’s nvcc compiler for OpenCL version 4.2 and 4.1, respectively, with default compiler options. Reported speed-ups were averaged among 20 independent runs.

We estimate the three contracts described in Section 2.1: (i) an European option, named Simple, (ii) a discrete barrier option, named Medium, and (iii) a daily-monitored barrier option, named Complex. These contracts are written in terms of a number of underlyings, $u$, and dates, $d$: $1 \times 1, 3 \times 5$ and $3 \times 367$, respectively. This amounts to very different runtime behavior, since $u$ and $d$ dictate (i) the amount of data processed per iteration and (ii) the weight each basic-block kernel has in the overall computation.

In addition, we estimate the contracts with both single precision (SimpleF) and double precision (SimpleD) floating points. From a compute perspective this accentuates the different runtime behavior, as double is more expensive than float operations (and require twice the space). From a financial perspective we note that the results of our parallel versions are equal to the sequential one, with precision higher than 0.001%. This is a consequence of the Sobol quasi-random generator being modeled as described in Section 2.2, where the parallel implementation preserves the modulo associativity semantics exhibited by the sequential version.

Figures 9, 10 and 11 show the speed-up measurements for the described contracts under different optimization conditions. Readings for the gaming system are reported as vertical labels over plain area bars, while readings for the mobile solution are reported as horizontal, white labels over crossed regions. All histograms present error bars indicating the standard deviation of the measurements, which seem mostly affected by bus transfer delays between host system and GPU. Missing histograms in the ComplexF and ComplexD cases are due to the Complex contract model exceeding the available fast memory. The remaining of this section discusses the impact of the proposed optimizations.

Vectorization vs Coarse-Grained. One of the main optimization choices the compiler has to make is whether to employ coarse-grained parallelism over vectorization, as described in Section 3.2. Figure 9, in which the reader should ignore for the moment the SR OFF bars, demonstrates the tradeoff: the coarse-grained version on Simple contract exhibits a small $u \cdot d$ value ($1.0\cdot$double), which results in (all) data fitting well in fast memory, while still allowing a good parallelism degree. It follows that the coarse-grained SimpleF/D is significantly faster than its vectorized analog.

As the per-core fast-memory consumption, i.e., $u \cdot d$, increases, latency is less efficiently hidden: (i) MediumF/D ($u \cdot d = 15$) is very close to the cutoff point between the two versions, and (ii) ComplexF/D cannot run the coarse-grained version simply because $u \cdot d = 3 \cdot 365$ does not fit in fast memory.

We remark that the cutoff point is (surprisingly) well estimated by the simple cost model of Section 3.2, and that, albeit tested (only) on the same application, it is consistent among the two hardware. At large, the top-end hardware exhibits similar behavior but superior speedups for coarse-grained and vectorized versions. The rest of this section evaluates the impact of the other three optimizations for both the coarse-grained and vectorized code.

Strength Reduction. The SR OFF bars in Figure 9 correspond to the obtained speed-up when all but the strength-reduction optimization were used. Comparing the SR OFF bars with their left neighbor, which correspond to the fully-optimized code, one can observe speed-ups as high as $3-4\times$ for Simple’s coarse-grained and vectorized code, respectively. As $u \cdot d$ increases, i.e., in Medium and Complex contracts, the optimization’s impact decreases because: (i) the weight of the Sobol kernel in the overall computation decreases and (ii) the tile sizes computed by the cost model also decrease. The latter corresponds to how many times we apply the recurrence formula to amortize the more expensive independent formula, and also explains the smaller impact on the code version that uses doubles. For ComplexF/D the ratio is four and two, respectively, and the gain is smaller. We remark that the empirical data seem to validate the cost model in that sequentializing some computations via the recurrent formula never generates slowdowns.

Branch Divergence. The results shown in Figure 10 correspond to optimizing the divergence of the only if branch, located in the gaussian kernel, that exhibits enough computational-granularity to trigger the branch-divergence (BD) optimization. Simple ex-


GPU SpeedUp: gaming vs. mobile... has been found to solve uniformly a number of difficult cases under negligible runtime overhead. While these im-
hibits little divergence overhead, i.e., less than 2% of the\textit{gaussian} kernel runtime, and thus the optimization results in about 5% slowdown due to the computational and memory overheads introduced by BD.\textit{Medium} and\textit{Complex} exhibit roughly 61% divergence overhead at the level of the gaussian kernel. Examining\textit{MediumF} we observe an interesting behavior: while the coarse-grained version exhibits slowdown, the vectorized version exhibits either no change, or a speedup. The reason is that the extra memory needed for the application of BD exacerbates the reduced degree of parallelism of the coarse-grained version, which is already using a significant amount of fast memory. Finally, we note that the use of\texttt{double} increases the computational granularity of the\texttt{if} branch, and, as such, the\texttt{double} version exhibits significantly better speedups (e.g., 1.3×), than the\texttt{float}-based one.

\textbf{Memory Coalescing.} Figure 11 demonstrates that achieving coalesced accesses is fundamental for extracting reasonable performance from the GPU hardware, and that the proposed transformation is effective in enabling well-coalesced accesses. We observe that in the case of\texttt{SimpleF/D} the uncoalesced accesses refer to the ones introduced by the strength-reduction optimization.

\textbf{Discussion.} Figure 11 shows that Memory Coalescing has by far the largest impact, enabling a factor 5×-28× speedup. This is followed by Strength Reduction, with a factor 1.3×-4×. Choosing between Vectorized and Core-Grained approaches delivers a speed increase up to 2×. The simple cost models implemented here are effective for this application; we would however like to proceed to more extensive empirical evaluations to assess opportunities of generalization. In particular, Branch Divergence seems to express its potential in applications with larger computation granularity, like the double case in the\textit{Medium} and\textit{Complex} contracts.

For a fixed set of optimizations, the ratio between the speedups obtained on the two hardware platforms is at large in the interval 5.7×-9×, which correlates well with the ratio of the number of available cores in the GeForce GTX 680 and Quadro 2000M GPUs (1536 and 192 respectively). Full utilization of these computing units is achieved by instanciating the entire algorithm on GPU, with little data transfer between host system and GPU. As result of this, the discussed hardwares allow to obtain speedups as high as 70× and 540× compared to the sequential CPU case.

\section{Related Work}

A considerable amount of work has been published on parallelizing financial computations on GPUs, reporting impressive speedups (see Joshi [26] or Giles [27], for example), or focusing on production integration in large banks’ IT infrastructure [36]. Our work differs in that we aim at systematizing and eventually\textit{automating} low-level implementation and optimization by taking an architecture-independent functional language and compilation approach.

\textbf{Imperative Auto-Parallelization.} Classical static dependency analysis [1, 16] examines an entire loop nest at a time and accurately models both the memory dependencies and the flow of values between every pair of read-write accesses, but the analysis is restricted to the simpler affine domain. Dependencies are represented via systems of linear (in)equations, disambiguated via Gaussian-like elimination. These solutions drive powerful code transformations to extract and optimize parallelism [41], e.g., loop distribution, interchange, skewing, tiling, etc., but they are most effective when applied to relatively small intra-procedural loop nests exhibiting simple control flow and affine accesses.

Issues become more complicated with larger loops, where symbolic constants, complex control flow, array-reshaping at call sites, quadratic array indexing, induction variables with no closed-form solutions hinder parallelism extraction [21, 39]. Various techniques have been proposed to partially address these issues: Idiom-recognition techniques [29] disambiguate a class of subscripted subscripts and push-back arrays, such as array $a$ in Figure 1, which is indexed by the conditionally-incremented variable $len$. The weakness of such techniques is that small code perturbations may render the access pattern unrecognizable and yield very conservative results. Another direction has been to refine the dependency test to qualify some non-affine patterns: for example Range Test [9] exploits the monotonicity of polynomial indexing, and similarly, extensions of Presburger arithmetic [42] may solve a class of irregular accesses and control flow.

The next step has been to extend analysis to program level by using various set algebras to summarize array indexes interprocedurally, where loop independence is modeled via an equation on (set) summaries of shape $S = \emptyset$. The set representation has taken the form of either (i) an array abstraction [21, 39], e.g., systems of affine constraints, or (ii) a richer language [43] in which irreducible set operations are represented via explicit $\cap, \setminus, \cup$ constructors. Array abstractions have been refined further to exploit (simple) control-flow predicates [34, 42] (i) to increase summary precision or (ii) to predicate optimistic results for undecidable summaries. The language-representation approach allows an accurate classification of loop independence at runtime, e.g., it can prove that array $a$ in Figure 4 is write first, hence privatizable, but the runtime cost may be prohibitive in the general case. This issue has been further addressed by a translation $T$ to an equally-rich language of predicates [37], i.e., $T(S) = S = \emptyset$, where the extracted predicates $T(S)$ has been found to solve uniformly a number of difficult cases under negligible runtime overhead. While these im-

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{Figure10.png}
\caption{Impact of Branch-Divergence Optimization.}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{Figure11.png}
\caption{Impact of Memory-Coalescing Optimization.}
\end{figure}
portant techniques are successful in disambiguating a large number of imperative (Fortran) loops, there still remain enough parallel benchmarks that are too difficult to solve statically [3]. In these cases, techniques that track dependencies at runtime [14, 38] may extract parallelism on multi-core systems, albeit at significant run-time overhead, but they have not been validated (yet) on GPU.

**Imperative GPGPU** work follows two main directions. The first one aims at ease of programming: CUDA Lite [47] abstracts over the complex GPGPU memory hierarchy, Lime [15] extends the type system of a subset of Java to express desirable invariants such as isolation and immutability, and finally, the popular OpenMP-annotated loops are translated [28] to CUDA, to mention only a few.

The second direction refers to GPGPU performance. Main principles are [44]: (i) achieving memory-coalescing via block tiling, where threads cooperatively copy-in/out the data block to/from on-chip (fast) memory, (ii) optimizing register usage via loop unrolling and (iii) prefetching data to hide memory latency at the expense of register pressure. Implementation of these principles as compiler optimizations ranges from (i) heuristics based on pattern-matching of code or array indexes [15, 47, 49], to (ii) the more formal modeling of affine transformations via the polyhedral model [2, 5], e,g., multi-level tiling code generation, or host-to-accelerator memory-transfer optimizations, to (iii) aggressive techniques that may be applicable even for irregular control-flow/accesses [28], e.g., loop collapsing/interchange exploits a statically-assumed and runtime-verified monotonicity of array values.

In comparison, we take the view that a (hardware-neutral) functional language presents opportunities for both automatic GPU translation and optimization, due to the better preservation of algorithmic invariants. For example, all our optimizations rely on properties of the map-reduce constructs, which appear naturally in the functional code and drive our higher-level (and perhaps simpler) code transformations: Memory-coalescing exploits the fact that the array-indexing used inside the mapped function is likely invariant across the mapped elements. Our (novel) technique is complementary to the thread-cooperating block tiling, in that it requires neither block-level synchronization nor the use of shared memory, but it exhibits restructuring overhead when the same (nested) array is traversed on different axes in different kernels.

Similarly, optimizing branch-divergence relies on map’s parallelism to ensure the validity of the employed iteration-space permutation. The closest related work is Strout’s inspector-executor technique [45] that improves cache locality by permitting the array layout to match the order in which elements are accesses at runtime. We have not found stated elsewhere the trade-offs related to the strength-reduction invariant and to the coarse-grained vs vectorized code, albeit they show significant impact in our case study.

**Functional Parallelization.** Functional languages and their mathematical abstraction allow for more expressive algorithm implementations, in which data parallelism appears naturally by means of higher-order functions like map, fold, and scan, for which efficient parallel implementations are known. Consequently, research work has focused less on completely automating the process, but rather on studying in a formal manner what classes of algorithms allow asymptotically efficient (parallel) implementations.

Previous research we draw upon here is the Bird-Meertens Formalism (BMF) [6], Functions \( f : (x+y) → (x + y) \) are homomorphisms between (i) the monoid of lists with concatenation operator and empty list as neutral element, and (ii) the monoid of the result type with operator \( ⊙ \) and neutral element \( ⊙ \), and can be rewritten in the map-reduce form which provides an efficient parallel implementation (at least when the reduction does not involve concatenation). List invariants like the promotion lemmas \( (\text{map } f) \cdot \text{(red ++)} ≡ \text{(red ++)} \cdot \text{(map } f) \) and \( (\text{red } ⊙) \cdot \text{(red ++)} ≡ \text{(red ++)} \cdot \text{(map } \text{(red } ⊙)) \), can be used to transform programs to a higher degree of parallelism and load balancing (↔ direction), or to distribute computations to available processors for reduced communication overhead (→ direction).

Other work follows this research strand and (i) studies how to extend a class of functions [13] to become list-homomorphisms (LH), or (ii) show how to use the third LH theorem [17] to formally derive the LH definition from its associated (and simpler) leftwards and rightwards forms [19, 35], or (iii) formulate a class of functions [20], such as scan, for which an asymptotically-optimal hypercube implementation can be formally derived, despite the fact that concatenation appears inside \( ⊙ \), or (iv) extend the applicability of BMF theory to cover programs of more general form [23].

All these techniques rely on a functional computation language, where referential transparency and the absence of side-effects allow for vast transformations and rewriting. Such program transformations (with known operators) play a major role in compilation of functional programs, for example in the implementation of data-parallel Haskell [11]. Other work targets GPU platforms [12] using two-stage execution techniques and JIT compilation. All Haskell’s data-parallel approaches rely heavily on fusion to adjust task granularity and to justify parallel overhead for the particular platform.

Our work is informed by the same reasoning for the high-level optimization, e.g., coarse-grain vs. vectorised code, strength reduction, but also addresses other important hardware-specific optimizations, e.g., memory coalescing and branch divergence. It is a general problem that functional approaches can lead to excess parallelism and too fine-grained tasks. More task-oriented parallelization techniques today follow a semi-explicit programming model of annotations (GPh [46]), or make parallelism completely explicit (Eden [30] and the Par monad [33]). Automatic parallelism in these approaches relates mainly to runtime system management, and to functional libraries that capture algorithmic patterns at a high abstraction level. Our work is more narrow in the algorithmic selection, and thereby allows for very specific optimizations.

### 6. Conclusions and Future Work

This paper (i) has shown evidence that real-world financial software exhibits computationally-intensive kernels that can be expressed in a list-homomorphic, map-reduce fashion, and (ii) has presented and demonstrated four relatively-simple optimizations that allowed substantial speedups to be extracted on commodity GPUs.

While functional languages have often been considered elegant but slow, GPU’s enticing parallelism and this paper’s results motivates a systematic investigation of what is necessary to transparently and efficiently extract parallelism from functional programs.

As future work we plan to implement and explore such imperative-style optimizations and their cost models, in the context of an array-calculus functional language, such as Single Assignment C.

We believe that code transformations can be guided by interprocedural analysis that summarizes array read/write accesses, in which the trade-off (cost model) can be modeled as equations on these summaries. When the trade-off cannot be answered statically, (higher-order) predicates (i) can be derived as sufficient-satisfiability conditions of the corresponding equation, and (ii) can be evaluated in parallel on GPU to select the most efficient offline-generated kernel. Such an approach has been validated for the (more difficult) problem of classifying loop parallelism in the Fortran context [37], and we believe it also suits our context well.

### Acknowledgments

This research has been partially supported by the Danish Strategic Research Council, Program Committee for Strategic Growth Technologies, for the research center ‘HIPERFIT: Functional High Performance Computing for Financial Information Technology’ (http://hiperfit.dk) under contract number 10-092299.
References


