A Performance-Prediction Model for PIC Applications on Clusters of Symmetric MultiProcessors: Validation with Hierarchical HPF (pdf)

Article PDF cannot be displayed. You can download it here:

http://downloads.hindawi.com/journals/sp/2003/691573.pdf

A Performance-Prediction Model for PIC Applications on Clusters of Symmetric MultiProcessors: Validation with Hierarchical HPF

159 Scientific Programming 11 (2003) 159–176 IOS Press A performance-prediction model for PIC applications on clusters of Symmetric MultiProcessors: Validation with hierarchical HPF+OpenMP implementation Sergio Briguglioa , Beniamino Di Martinob and Gregorio Vlada a Associazione EURATOM-ENEA sulla Fusione, C.R. Frascati, C.P. 65, 00044, Frascati, Rome, Italy E-mail: {briguglio,vlad}@frascati.enea.it b Dip. Ingegneria dell’Informazione, Second University of Naples, Italy E-mail: Accepted July 16, 2002 Abstract. A performance-prediction model is presented, which describes different hierarchical workload decomposition strategies for particle in cell (PIC) codes on Clusters of Symmetric MultiProcessors. The devised workload decomposition is hierarchically structured: a higher-level decomposition among the computational nodes, and a lower-level one among the processors of each computational node. Several decomposition strategies are evaluated by means of the prediction model, with respect to the memory occupancy, the parallelization efficiency and the required programming effort. Such strategies have been implemented by integrating the high-level languages High Performance Fortran (at the inter-node stage) and OpenMP (at the intra-node one). The details of these implementations are presented, and the experimental values of parallelization efficiency are compared with the predicted results. 1. Introduction Particle-in-cell (PIC) simulation consists [2] in evolving the phase-space coordinates of a particle population in certain fields computed (in terms of particle contributions) only at the points of a discrete spatial grid and then interpolated at each particle (continuous) position. Two main strategies have been developed for the workload decomposition related to porting PIC codes on distributed memory parallel systems: the domain decomposition strategy and the particle decomposition one. Standard domain decomposition [1,6,7,10, 11] techniques assign different portions of the physical domain and the corresponding portions of the grid to different computational nodes, along with the particles that reside on them. The distribution of all the arrays among the computational nodes gives this method an ISSN 1058-9244/03/$8.00  2003 – IOS Press. All rights reserved intrinsic scalability of the maximum domain size (that is, the maximum spatial resolution) that can be simulated with the number of nodes. This makes the domain decomposition approach very attractive, in principle. Two important problems with these techniques are however represented by the communication overhead and the need for dynamic load balancing, both associated to particle migration from one portion of the domain to another one. While the former problem could possibly affect the parallelization efficiency, depending on the effective amount of particle migration per time step, the latter one can be by-passed, at the expense of a deep restructuring of the original serial code and the adoption of a message-passing approach. It is generally accepted, however, that such an approach, based on manual partition of data, insertion of communication library calls, handling of boundary cases, is very complicated, time consuming and error prone, and af- 160 S. Briguglio et al. / A performance-prediction model for PIC applications on clusters of Symmetric MultiProcessors fects the portability of the resulting program. In order to avoid these features, it is worth to resort, for distributed architectures, to the particle decomposition [5] technique, which is suited to be implemented, with relatively little effort, by the use of high-level programming languages, such as the High Performance Fortran (HPF) [8]. Particle decomposition consists in statically distributing the particle population among the computational nodes, while replicating the data relative to grid quantities. As no particle has to be transferred (reassigned) from one computational node to another, the communication and load-balancing problems associated to particle migration are automatically overcome. The implementation of such a strategy with high-level languages is then, in principle, relatively straightforward. On the opposite side, an overhead on memory occupancy, given by the replication of data related to the domain, and a computation overhead related to the updating of the fields (each node manages only the partial updating associated to its portion of particle population) forbid a good scalability of the maximum domain size with the number of nodes, and limit the efficiency of such a technique to cases in which both memory and computational loads on each node are dominated by the particle-related ones. When porting a PIC code on a hierarchical distribut ed-shared memory system such as a cluster of SMPs, a two-stage workload decomposition can be envisaged: a distributed-memory level decomposition (among the computational nodes), and a shared-memory one (among the processors of each node). The latter decomposition qualitatively differs from that at the distributed-memory level. Indeed, the alternative between particle and domain decomposition no longer corresponds to the alternative between high-level and low-level languages: even in the framework of a domain decomposition approach, particle migration from one processor to another does not require communication, and a high-level parallel programming language such as OpenMP [12] can still be used. Both the domain decomposition strategy and the particle decomposition one can then be implemented within the framework of a high-level language programming and integrated with the particle decomposition strategy devised at the distributed-memory level, looking for an optimal balance of merits and defects. In this paper we present a performance-prediction model describing the above mentioned different hierarchical workload decomposition strategies, in terms of efficiency and memory occupancy. The predictionmodel results are compared with the experimental re- sults from a high-level language based porting (obtained with integration of HPF and OpenMP) of the Hybrid MHD-Gyrokinetic Code (HMGC) [3], which includes all the relevant properties of general PIC codes. The paper is structured as follows. Section 2 describes the main computational aspects of the chosen application. It introduces the performance-prediction model and its application to the different decomposition strategies devised, both on distributed memory architectures and on distributed-shared memory ones, analytically modeling and predicting their main features in terms of the expected parallelization efficiency and memory requests. The implementation of such strategies, based on integrating the HPF and OpenMP programming environments by means of the EXTRINSIC feature of the HPF language, is presented in Section 3. Section 4 reports the experimental results obtained by running the corresponding parallel versions of HMGC on a IBM SP. Final (...truncated)