TY - RPRT
T1 - Automatic Loop Parallelization via Compiler Guided Refactoring
AU - Larsen, Per
AU - Ladelsky, Razya
AU - Lidman, Jacob
AU - McKee, Sally A.
AU - Karlsson, Sven
AU - Zaks, Ayal
PY - 2011
Y1 - 2011
N2 - For many parallel applications, performance relies
not on instruction-level parallelism, but on loop-level parallelism.
Unfortunately, many modern applications are written in ways
that obstruct automatic loop parallelization. Since we cannot
identify sufficient parallelization opportunities for these codes in
a static, off-line compiler, we developed an interactive compilation
feedback system that guides the programmer in iteratively
modifying application source, thereby improving the compiler’s
ability to generate loop-parallel code. We use this compilation
system to modify two sequential benchmarks, finding that the
code parallelized in this way runs up to 8.3 times faster on an
octo-core Intel Xeon 5570 system and up to 12.5 times faster on
a quad-core IBM POWER6 system.
Benchmark performance varies significantly between the systems.
This suggests that semi-automatic parallelization should be
combined with target-specific optimizations. Furthermore, comparing
the first benchmark to hand-parallelized, hand-optimized
pthreads and OpenMP versions, we find that code generated
using our approach typically outperforms the pthreads code
(within 93-339%). It also performs competitively against the
OpenMP code (within 75-111%). The second benchmark outperforms
hand-parallelized and optimized OpenMP code (within
109-242%).
AB - For many parallel applications, performance relies
not on instruction-level parallelism, but on loop-level parallelism.
Unfortunately, many modern applications are written in ways
that obstruct automatic loop parallelization. Since we cannot
identify sufficient parallelization opportunities for these codes in
a static, off-line compiler, we developed an interactive compilation
feedback system that guides the programmer in iteratively
modifying application source, thereby improving the compiler’s
ability to generate loop-parallel code. We use this compilation
system to modify two sequential benchmarks, finding that the
code parallelized in this way runs up to 8.3 times faster on an
octo-core Intel Xeon 5570 system and up to 12.5 times faster on
a quad-core IBM POWER6 system.
Benchmark performance varies significantly between the systems.
This suggests that semi-automatic parallelization should be
combined with target-specific optimizations. Furthermore, comparing
the first benchmark to hand-parallelized, hand-optimized
pthreads and OpenMP versions, we find that code generated
using our approach typically outperforms the pthreads code
(within 93-339%). It also performs competitively against the
OpenMP code (within 75-111%). The second benchmark outperforms
hand-parallelized and optimized OpenMP code (within
109-242%).
M3 - Report
T3 - IMM-Technical Report-2011
BT - Automatic Loop Parallelization via Compiler Guided Refactoring
PB - Technical University of Denmark
CY - Kgs. Lyngby, Denmark
ER -