Home My Page Projects Code Snippets Project Openings diderot
Summary Activity Tracker Tasks SCM

SCM Repository

[diderot] Annotation of /trunk/doc/whitepaper/impl.tex
ViewVC logotype

Annotation of /trunk/doc/whitepaper/impl.tex

Parent Directory Parent Directory | Revision Log Revision Log


Revision 233 - (view) (download) (as text)

1 : jhr 233 %!TEX root = paper.tex
2 :     %
3 :    
4 :     \section{Implementation}
5 :     \label{sec:impl}
6 :    
7 :     \lang{} separates the concerns of defining the underlying dataset representation,
8 :     deriving continuous fields from it by applying convolution kernels, and extracting interesting
9 :     features from the field.
10 :     We believe that this separation will provide a mathematically well-founded programming model for designers
11 :     of image-analysis algorithms, but we are also keen to exploit parallelism to make the resulting
12 :     implementations more efficient.
13 :     Furthermore, our goal is \emph{portable parallel performance}, by which we mean that a given
14 :     \lang{} program should perform well across a variety of parallel targets (\eg{}, clusters or
15 :     GPUs).
16 :    
17 :     Internally, we plan to treat \lang{} as a first-order functional language with a small number of
18 :     built-in higher-order operators (\eg{}, composition).
19 :     Because the actor methods are programmed in a language without pointers or mutable global state,
20 :     it will be straightforward to translate them into this functional representation (similar to converting
21 :     to SSA form~\cite{efficiently-computing-ssa}).
22 :    
23 :     We expect that there will be many opportunities for optimization on the actor update
24 :     methods.
25 :     Performance profiling on examples of these algorithms shows that the dominant cost is in
26 :     the computations that compute the continuous-field values by probing the discrete datasets.
27 :     Thus, an important part of the implementation will be optimizing these probes by identifying
28 :     common subexpressions and by organizing memory accesses to improve locality.
29 :     While the former optimization is portable across targets, the latter depends on the
30 :     target's memory organization, which varies significantly across the range of parallel
31 :     targets.
32 :    
33 :     The main research challenge of our implementation effort will be developing effective
34 :     techniques for compiling the features of \lang{} to run on a variety of parallel platforms.
35 :     We plan to build at least three backends to our compiler over the course of the project.
36 :     These will include a sequential reference implementation, a parallel implementation that
37 :     generates OpenCL code to run on GPUs, and a parallel implementation that runs on multicore
38 :     SMPs.
39 :     We discuss some of the technical challenges of the GPU implementation below.
40 :    
41 :     \subsection{Targeting GPUs}
42 :    
43 :     Graphics Processing Units (GPUs) are highly parallel multiprocessor chips that
44 :     are designed for real-time graphics.
45 :     Their design makes them especially suitable for programs that have high arithmetic
46 :     intensity and lots of independent computations (as is the case in the graphics pipeline)
47 :     and they provide very high performance at a relatively inexpensive price\footnote{
48 :     For example, an Nvidia Tesla C1060 has an estimated peak double-precision
49 :     floating-point performance of 78 GigaFLOPS at a cost of about \$1,200.
50 :     }.
51 :     The use of GPUs for applications other than graphics has lead to the development
52 :     on programming languages, such as Nvidia's CUDA~\cite{cuda-ref-v23} and the industry standard
53 :     OpenCL~\cite{opencl-spec-v1}.
54 :     Unfortunately, while these languages support a more general-purpose programming
55 :     model than graphics shader languages, they are still difficult to
56 :     use and very hardware specific.
57 :     For example, an OpenCL computation is organized into a multidimensional array of \emph{work-items},
58 :     which, in turn, are divided into \emph{work-groups}, which have locally shared
59 :     memory.
60 :    
61 :     There are a number of features of \lang{} that make mapping it onto GPUs challenging.
62 :     A particular challenge are the global coordination phases between iterations.
63 :     Since there are no real global synchronization mechanisms available on the GPU,
64 :     global operations that combine data across an array have to be carefully constructed,
65 :     as straightforward implementations have
66 :     very poor performance~\cite{Sengupta07scanprimitives}.
67 :     Alternatively, the data can be copied from the GPU back to the CPU for the global binning
68 :     process, but that incurs a latency penalty that can be significant if done frequently.
69 :     The implementation of \lang{} will need to determine when it is profitable run code on the
70 :     CPU, GPU, or both, taking into account any costs due to data motion and starting execution of
71 :     programs on the GPU.
72 :    
73 :     To get maximum performance we need to keep both the CPU and GPU busy. One approach to this
74 :     problem may be to partition the space in which the actors exist into regions that can
75 :     be processed independently.
76 :     This division would allow the global and local computations to be pipelined: \ie{},
77 :     the CPU could be computing neighbor information for one region, while the GPU is computing
78 :     new actor states on a different region.
79 :     Actors in the boundary areas between regions might have to be computed multiple times,
80 :     but since our model is functional, that poses no problem.
81 :     To make technique work will require compiler analyses that can determine how actors
82 :     are spatially organized.
83 :    
84 :     For the particle-system algorithms (\secref{sec:particles}),
85 :     access to properties of neighboring particles is required to compute the particle's energy.
86 :     The GPU does not have a traditional cached memory architecture.
87 :     Instead, the programmer is responsible for handling the layout of data and partitioning of
88 :     work to minimize both the number of fetches required from global video memory and the
89 :     number of memory bank access conflicts generated by the parallel workers~\cite{cuda-ref-v23}.
90 :     The GPU APIs allow programs to query properties such as the number of threads that will
91 :     run on the same processor and the maximum data set sizes supported by the processor~\cite{opencl-spec-v1}.
92 :     These APIs also allow runtime generation of programs.
93 :     Using these APIs and properties of the \lang{} program, we can generate code that
94 :     takes advantage of the specific hardware on which the program is executed.
95 :    
96 :     %Another challenge is the representation of variable-sized data, such as the poly-lines
97 :     %that are computed by fiber tractography.
98 :    
99 :     % implementation strategies: embedded vs. standalone language
100 :     % type inference
101 :    
102 :     %\subsection{Automatic differentiation}
103 :     %\cite{rall:auto-diff}

root@smlnj-gforge.cs.uchicago.edu
ViewVC Help
Powered by ViewVC 1.0.0