Home My Page Projects Code Snippets Project Openings diderot

# SCM Repository

[diderot] View of /trunk/doc/whitepaper/impl.tex
 [diderot] / trunk / doc / whitepaper / impl.tex

# View of /trunk/doc/whitepaper/impl.tex

Thu Aug 5 18:01:53 2010 UTC (11 years, 2 months ago) by jhr
File size: 6033 byte(s)
  A Didirot whitepaper based on the December 2009 NSF proposal

%!TEX root = paper.tex
%

\section{Implementation}
\label{sec:impl}

\lang{} separates the concerns of defining the underlying dataset representation,
deriving continuous fields from it by applying convolution kernels, and extracting interesting
features from the field.
We believe that this separation will provide a mathematically well-founded programming model for designers
of image-analysis algorithms, but we are also keen to exploit parallelism to make the resulting
implementations more efficient.
Furthermore, our goal is \emph{portable parallel performance}, by which we mean that a given
\lang{} program should perform well across a variety of parallel targets (\eg{}, clusters or
GPUs).

Internally, we plan to treat \lang{} as a first-order functional language with a small number of
built-in higher-order operators (\eg{}, composition).
Because the actor methods are programmed in a language without pointers or mutable global state,
it will be straightforward to translate them into this functional representation (similar to converting
to SSA form~\cite{efficiently-computing-ssa}).

We expect that there will be many opportunities for optimization on the actor update
methods.
Performance profiling on examples of these algorithms shows that the dominant cost is in
the computations that compute the continuous-field values by probing the discrete datasets.
Thus, an important part of the implementation will be optimizing these probes by identifying
common subexpressions and by organizing memory accesses to improve locality.
While the former optimization is portable across targets, the latter depends on the
target's memory organization, which varies significantly across the range of parallel
targets.

The main research challenge of our implementation effort will be developing effective
techniques for compiling the features of \lang{} to run on a variety of parallel platforms.
We plan to build at least three backends to our compiler over the course of the project.
These will include a sequential reference implementation, a parallel implementation that
generates OpenCL code to run on GPUs, and a parallel implementation that runs on multicore
SMPs.
We discuss some of the technical challenges of the GPU implementation below.

\subsection{Targeting GPUs}

Graphics Processing Units (GPUs) are highly parallel multiprocessor chips that
are designed for real-time graphics.
Their design makes them especially suitable for programs that have high arithmetic
intensity and lots of independent computations (as is the case in the graphics pipeline)
and they provide very high performance at a relatively inexpensive price\footnote{
For example, an Nvidia Tesla C1060 has an estimated peak double-precision
floating-point performance of 78 GigaFLOPS at a cost of about \\$1,200.
}.
The use of GPUs for applications other than graphics has lead to the development
on programming languages, such as Nvidia's CUDA~\cite{cuda-ref-v23} and the industry standard
OpenCL~\cite{opencl-spec-v1}.
Unfortunately, while these languages support a more general-purpose programming
model than graphics shader languages, they are still difficult to
use and very hardware specific.
For example, an OpenCL computation is organized into a multidimensional array of \emph{work-items},
which, in turn, are divided into \emph{work-groups}, which have locally shared
memory.

There are a number of features of \lang{} that make mapping it onto GPUs challenging.
A particular challenge are the global coordination phases between iterations.
Since there are no real global synchronization mechanisms available on the GPU,
global operations that combine data across an array have to be carefully constructed,
as straightforward implementations have
very poor performance~\cite{Sengupta07scanprimitives}.
Alternatively, the data can be copied from the GPU back to the CPU for the global binning
process, but that incurs a latency penalty that can be significant if done frequently.
The implementation of \lang{} will need to determine when it is profitable run code on the
CPU, GPU, or both, taking into account any costs due to data motion and starting execution of
programs on the GPU.

To get maximum performance we need to keep both the CPU and GPU busy.  One approach to this
problem may be to partition the space in which the actors exist into regions that can
be processed independently.
This division would allow the global and local computations to be pipelined: \ie{},
the CPU could be computing neighbor information for one region, while the GPU is computing
new actor states on a different region.
Actors in the boundary areas between regions might have to be computed multiple times,
but since our model is functional, that poses no problem.
To make technique work will require compiler analyses that can determine how actors
are spatially organized.

For the particle-system algorithms (\secref{sec:particles}),
access to properties of neighboring particles is required to compute the particle's energy.
The GPU does not have a traditional cached memory architecture.
Instead, the programmer is responsible for handling the layout of data and partitioning of
work to minimize both the number of fetches required from global video memory and the
number of memory bank access conflicts generated by the parallel workers~\cite{cuda-ref-v23}.
The GPU APIs allow programs to query properties such as the number of threads that will
run on the same processor and the maximum data set sizes supported by the processor~\cite{opencl-spec-v1}.
These APIs also allow runtime generation of programs.
Using these APIs and properties of the \lang{} program, we can generate code that
takes advantage of the specific hardware on which the program is executed.

%Another challenge is the representation of variable-sized data, such as the poly-lines
%that are computed by fiber tractography.

% implementation strategies: embedded vs. standalone language
% type inference

%\subsection{Automatic differentiation}
%\cite{rall:auto-diff}