A Performance-Prediction Model for PIC Applications on Clusters of Symmetric MultiProcessors: Validation with Hierarchical HPF
159
Scientific Programming 11 (2003) 159–176
IOS Press
A performance-prediction model for PIC
applications on clusters of Symmetric
MultiProcessors: Validation with hierarchical
HPF+OpenMP implementation
Sergio Briguglioa , Beniamino Di Martinob and Gregorio Vlada
a
Associazione EURATOM-ENEA sulla Fusione, C.R. Frascati, C.P. 65, 00044, Frascati, Rome, Italy
E-mail: {briguglio,vlad}@frascati.enea.it
b
Dip. Ingegneria dell’Informazione, Second University of Naples, Italy
E-mail:
Accepted July 16, 2002
Abstract. A performance-prediction model is presented, which describes different hierarchical workload decomposition strategies
for particle in cell (PIC) codes on Clusters of Symmetric MultiProcessors. The devised workload decomposition is hierarchically
structured: a higher-level decomposition among the computational nodes, and a lower-level one among the processors of each
computational node. Several decomposition strategies are evaluated by means of the prediction model, with respect to the
memory occupancy, the parallelization efficiency and the required programming effort. Such strategies have been implemented
by integrating the high-level languages High Performance Fortran (at the inter-node stage) and OpenMP (at the intra-node one).
The details of these implementations are presented, and the experimental values of parallelization efficiency are compared with
the predicted results.
1. Introduction
Particle-in-cell (PIC) simulation consists [2] in
evolving the phase-space coordinates of a particle population in certain fields computed (in terms of particle
contributions) only at the points of a discrete spatial
grid and then interpolated at each particle (continuous)
position. Two main strategies have been developed
for the workload decomposition related to porting PIC
codes on distributed memory parallel systems: the domain decomposition strategy and the particle decomposition one. Standard domain decomposition [1,6,7,10,
11] techniques assign different portions of the physical
domain and the corresponding portions of the grid to
different computational nodes, along with the particles
that reside on them. The distribution of all the arrays
among the computational nodes gives this method an
ISSN 1058-9244/03/$8.00 2003 – IOS Press. All rights reserved
intrinsic scalability of the maximum domain size (that
is, the maximum spatial resolution) that can be simulated with the number of nodes. This makes the domain decomposition approach very attractive, in principle. Two important problems with these techniques are
however represented by the communication overhead
and the need for dynamic load balancing, both associated to particle migration from one portion of the domain to another one. While the former problem could
possibly affect the parallelization efficiency, depending
on the effective amount of particle migration per time
step, the latter one can be by-passed, at the expense of
a deep restructuring of the original serial code and the
adoption of a message-passing approach. It is generally accepted, however, that such an approach, based
on manual partition of data, insertion of communication library calls, handling of boundary cases, is very
complicated, time consuming and error prone, and af-
160
S. Briguglio et al. / A performance-prediction model for PIC applications on clusters of Symmetric MultiProcessors
fects the portability of the resulting program. In order
to avoid these features, it is worth to resort, for distributed architectures, to the particle decomposition [5]
technique, which is suited to be implemented, with relatively little effort, by the use of high-level programming languages, such as the High Performance Fortran
(HPF) [8]. Particle decomposition consists in statically
distributing the particle population among the computational nodes, while replicating the data relative to grid
quantities. As no particle has to be transferred (reassigned) from one computational node to another, the
communication and load-balancing problems associated to particle migration are automatically overcome.
The implementation of such a strategy with high-level
languages is then, in principle, relatively straightforward. On the opposite side, an overhead on memory
occupancy, given by the replication of data related to
the domain, and a computation overhead related to the
updating of the fields (each node manages only the partial updating associated to its portion of particle population) forbid a good scalability of the maximum domain
size with the number of nodes, and limit the efficiency
of such a technique to cases in which both memory and
computational loads on each node are dominated by the
particle-related ones.
When porting a PIC code on a hierarchical distribut
ed-shared memory system such as a cluster of SMPs, a
two-stage workload decomposition can be envisaged:
a distributed-memory level decomposition (among
the computational nodes), and a shared-memory one
(among the processors of each node). The latter
decomposition qualitatively differs from that at the
distributed-memory level. Indeed, the alternative between particle and domain decomposition no longer
corresponds to the alternative between high-level and
low-level languages: even in the framework of a domain decomposition approach, particle migration from
one processor to another does not require communication, and a high-level parallel programming language
such as OpenMP [12] can still be used. Both the domain decomposition strategy and the particle decomposition one can then be implemented within the framework of a high-level language programming and integrated with the particle decomposition strategy devised
at the distributed-memory level, looking for an optimal
balance of merits and defects.
In this paper we present a performance-prediction
model describing the above mentioned different hierarchical workload decomposition strategies, in terms
of efficiency and memory occupancy. The predictionmodel results are compared with the experimental re-
sults from a high-level language based porting (obtained with integration of HPF and OpenMP) of the Hybrid MHD-Gyrokinetic Code (HMGC) [3], which includes all the relevant properties of general PIC codes.
The paper is structured as follows. Section 2 describes the main computational aspects of the chosen
application. It introduces the performance-prediction
model and its application to the different decomposition strategies devised, both on distributed memory architectures and on distributed-shared memory ones, analytically modeling and predicting their main features
in terms of the expected parallelization efficiency and
memory requests. The implementation of such strategies, based on integrating the HPF and OpenMP programming environments by means of the EXTRINSIC
feature of the HPF language, is presented in Section 3.
Section 4 reports the experimental results obtained by
running the corresponding parallel versions of HMGC
on a IBM SP. Final (...truncated)