1 : |
jhr |
233 |
%!TEX root = paper.tex
|
2 : |
|
|
%
|
3 : |
|
|
|
4 : |
|
|
\section{Implementation}
|
5 : |
|
|
\label{sec:impl}
|
6 : |
|
|
|
7 : |
|
|
\lang{} separates the concerns of defining the underlying dataset representation,
|
8 : |
|
|
deriving continuous fields from it by applying convolution kernels, and extracting interesting
|
9 : |
|
|
features from the field.
|
10 : |
|
|
We believe that this separation will provide a mathematically well-founded programming model for designers
|
11 : |
|
|
of image-analysis algorithms, but we are also keen to exploit parallelism to make the resulting
|
12 : |
|
|
implementations more efficient.
|
13 : |
|
|
Furthermore, our goal is \emph{portable parallel performance}, by which we mean that a given
|
14 : |
|
|
\lang{} program should perform well across a variety of parallel targets (\eg{}, clusters or
|
15 : |
|
|
GPUs).
|
16 : |
|
|
|
17 : |
|
|
Internally, we plan to treat \lang{} as a first-order functional language with a small number of
|
18 : |
|
|
built-in higher-order operators (\eg{}, composition).
|
19 : |
|
|
Because the actor methods are programmed in a language without pointers or mutable global state,
|
20 : |
|
|
it will be straightforward to translate them into this functional representation (similar to converting
|
21 : |
|
|
to SSA form~\cite{efficiently-computing-ssa}).
|
22 : |
|
|
|
23 : |
|
|
We expect that there will be many opportunities for optimization on the actor update
|
24 : |
|
|
methods.
|
25 : |
|
|
Performance profiling on examples of these algorithms shows that the dominant cost is in
|
26 : |
|
|
the computations that compute the continuous-field values by probing the discrete datasets.
|
27 : |
|
|
Thus, an important part of the implementation will be optimizing these probes by identifying
|
28 : |
|
|
common subexpressions and by organizing memory accesses to improve locality.
|
29 : |
|
|
While the former optimization is portable across targets, the latter depends on the
|
30 : |
|
|
target's memory organization, which varies significantly across the range of parallel
|
31 : |
|
|
targets.
|
32 : |
|
|
|
33 : |
|
|
The main research challenge of our implementation effort will be developing effective
|
34 : |
|
|
techniques for compiling the features of \lang{} to run on a variety of parallel platforms.
|
35 : |
|
|
We plan to build at least three backends to our compiler over the course of the project.
|
36 : |
|
|
These will include a sequential reference implementation, a parallel implementation that
|
37 : |
|
|
generates OpenCL code to run on GPUs, and a parallel implementation that runs on multicore
|
38 : |
|
|
SMPs.
|
39 : |
|
|
We discuss some of the technical challenges of the GPU implementation below.
|
40 : |
|
|
|
41 : |
|
|
\subsection{Targeting GPUs}
|
42 : |
|
|
|
43 : |
|
|
Graphics Processing Units (GPUs) are highly parallel multiprocessor chips that
|
44 : |
|
|
are designed for real-time graphics.
|
45 : |
|
|
Their design makes them especially suitable for programs that have high arithmetic
|
46 : |
|
|
intensity and lots of independent computations (as is the case in the graphics pipeline)
|
47 : |
|
|
and they provide very high performance at a relatively inexpensive price\footnote{
|
48 : |
|
|
For example, an Nvidia Tesla C1060 has an estimated peak double-precision
|
49 : |
|
|
floating-point performance of 78 GigaFLOPS at a cost of about \$1,200.
|
50 : |
|
|
}.
|
51 : |
|
|
The use of GPUs for applications other than graphics has lead to the development
|
52 : |
|
|
on programming languages, such as Nvidia's CUDA~\cite{cuda-ref-v23} and the industry standard
|
53 : |
|
|
OpenCL~\cite{opencl-spec-v1}.
|
54 : |
|
|
Unfortunately, while these languages support a more general-purpose programming
|
55 : |
|
|
model than graphics shader languages, they are still difficult to
|
56 : |
|
|
use and very hardware specific.
|
57 : |
|
|
For example, an OpenCL computation is organized into a multidimensional array of \emph{work-items},
|
58 : |
|
|
which, in turn, are divided into \emph{work-groups}, which have locally shared
|
59 : |
|
|
memory.
|
60 : |
|
|
|
61 : |
|
|
There are a number of features of \lang{} that make mapping it onto GPUs challenging.
|
62 : |
|
|
A particular challenge are the global coordination phases between iterations.
|
63 : |
|
|
Since there are no real global synchronization mechanisms available on the GPU,
|
64 : |
|
|
global operations that combine data across an array have to be carefully constructed,
|
65 : |
|
|
as straightforward implementations have
|
66 : |
|
|
very poor performance~\cite{Sengupta07scanprimitives}.
|
67 : |
|
|
Alternatively, the data can be copied from the GPU back to the CPU for the global binning
|
68 : |
|
|
process, but that incurs a latency penalty that can be significant if done frequently.
|
69 : |
|
|
The implementation of \lang{} will need to determine when it is profitable run code on the
|
70 : |
|
|
CPU, GPU, or both, taking into account any costs due to data motion and starting execution of
|
71 : |
|
|
programs on the GPU.
|
72 : |
|
|
|
73 : |
|
|
To get maximum performance we need to keep both the CPU and GPU busy. One approach to this
|
74 : |
|
|
problem may be to partition the space in which the actors exist into regions that can
|
75 : |
|
|
be processed independently.
|
76 : |
|
|
This division would allow the global and local computations to be pipelined: \ie{},
|
77 : |
|
|
the CPU could be computing neighbor information for one region, while the GPU is computing
|
78 : |
|
|
new actor states on a different region.
|
79 : |
|
|
Actors in the boundary areas between regions might have to be computed multiple times,
|
80 : |
|
|
but since our model is functional, that poses no problem.
|
81 : |
|
|
To make technique work will require compiler analyses that can determine how actors
|
82 : |
|
|
are spatially organized.
|
83 : |
|
|
|
84 : |
|
|
For the particle-system algorithms (\secref{sec:particles}),
|
85 : |
|
|
access to properties of neighboring particles is required to compute the particle's energy.
|
86 : |
|
|
The GPU does not have a traditional cached memory architecture.
|
87 : |
|
|
Instead, the programmer is responsible for handling the layout of data and partitioning of
|
88 : |
|
|
work to minimize both the number of fetches required from global video memory and the
|
89 : |
|
|
number of memory bank access conflicts generated by the parallel workers~\cite{cuda-ref-v23}.
|
90 : |
|
|
The GPU APIs allow programs to query properties such as the number of threads that will
|
91 : |
|
|
run on the same processor and the maximum data set sizes supported by the processor~\cite{opencl-spec-v1}.
|
92 : |
|
|
These APIs also allow runtime generation of programs.
|
93 : |
|
|
Using these APIs and properties of the \lang{} program, we can generate code that
|
94 : |
|
|
takes advantage of the specific hardware on which the program is executed.
|
95 : |
|
|
|
96 : |
|
|
%Another challenge is the representation of variable-sized data, such as the poly-lines
|
97 : |
|
|
%that are computed by fiber tractography.
|
98 : |
|
|
|
99 : |
|
|
% implementation strategies: embedded vs. standalone language
|
100 : |
|
|
% type inference
|
101 : |
|
|
|
102 : |
|
|
%\subsection{Automatic differentiation}
|
103 : |
|
|
%\cite{rall:auto-diff}
|