Topology-Aware Strategy for MPI-IO Operations in Clusters
Hindawi
Journal of Optimization
Volume 2018, Article ID 2068490, 13 pages
https://doi.org/10.1155/2018/2068490
Research Article
Topology-Aware Strategy for MPI-IO Operations in Clusters
Weifeng Liu ,1 Jie Zhou,2 and Meng Guo
3
1
Institute of Applied Physics and Computational Mathematics, Beijing 100088, China
State Grid Shandong Electric Power Company, Information and Communication Company, China
3
Shandong Computer Science Center (National Supercomputer Center in Jinan), Shandong Provincial Key Laboratory of
Computer Networks, Qilu University of Technology (Shandong Academy of Sciences), China
2
Correspondence should be addressed to Meng Guo;
Received 2 March 2018; Revised 30 August 2018; Accepted 16 October 2018; Published 19 November 2018
Academic Editor: Wlodzimierz Ogryczak
Copyright © 2018 Weifeng Liu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
This paper presents the topology-aware two-phase I/O (TATP), which optimizes the most popular collective MPI-IO implementation of ROMIO. In order to improve the hop-bytes metric during the file access, topology-aware two-phase I/O employs the Linear
Assignment Problem (LAP) for finding an optimal assignment of file domain to aggregators, an aspect which is not considered in
most two-phase I/O implementations. The distribution is based on the local data stored by each process, and its main purpose is
to reduce the total hop-bytes of the I/O collective operation. Therefore, the global execution time can be improved. In most of the
considered scenarios, topology-aware two-phase I/O obtains important improvements when compared with the original two-phase
I/O implementations.
1. Introduction
A large class of scientific applications access a high volume
of data frequently during their execution. Scalable solutions
for efficient and concurrent access to storage are offered
by parallel file systems such as Lustre, PVFS, and GPFS.
The scientific applications access these parallel file systems
through interfaces such as POSIX and MPI-IO [1] or highlevel libraries which are based on MPI-IO. In this paper we
target optimizing the implementation of MPI-IO interface
inside ROMIO, which is the most popular MPI-IO distribution.
Most parallel applications do the computation and I/O
alternatively. During the I/O phase, each process often issues
a large amount of small noncontiguous I/O requests to access
a common data set. These requests usually cause severe
overall I/O performance degradation. In order to optimize
the performance of the I/O system, the two-phase I/O algorithm is used to merge small individual requests into larger
continuous requests. In this work we focus on improving twophase I/O technique. We have designed and evaluated the
topology-aware two-phase I/O technique in which file data
access is not only dependent on the data distribution of each
process but also dependent on the mapping of processes to
computing resources. The comparison with other version of
two-phase I/O shows that an important reduction of the run
time can be obtained through our technique.
Cluster systems now are moving towards exascale with
the high performance interconnection network and manycore architectures. Such systems are getting more and more
hierarchical in their interconnection network and node
architecture. Processes have different performance levels
when communicating at various hierarchies. It is therefore
critical for the MPI-IO libraries to reasonably handle the
communication demands during the I/O procedure of high
performance computing (HPC) applications on such hierarchical systems. MPI-IO is the predominant I/O standard
for HPC applications in clusters. During the collective I/O
procedure defined in MPI-IO, multiple aggregators exchange
data with specific processes. However current MPI-IO optimization strategies do not take the communication pattern
and network topology into consideration. In this work, we
have designed the topology-aware two-phase I/O, which
can improve the shuffle phase of collective I/O operations
by carefully placing the aggregators on proper nodes. We
have integrated the node physical architecture with network
2
topology and used graph theory inside MPI-IO library to
override the current trivial implementation.
On massively parallel clusters, parallel jobs typically
acquire a fraction of the available nodes, which are discontinuous and do not correspond to any regular topology, even when the cluster does. On the other hand for
modern machines, contention on specific links limits the
communication performance. By suitably assigning processes
on proper nodes of clusters, substantial communication and
performance improvements on large parallel machines can
be achieved. Recently the hop-bytes metric [2, 3], defined
as the sum over all the messages of the product of number
of hops the message has to traverse and the message size,
has attracted much attention. For cluster, this equals the total
communication volume. The reason of using the hop-bytes
metric is that if the total communication volume is high,
then the contention for specific links is also much more likely
to increase, and the links would then become communication bottlenecks. During MPI-IO procedure, selection of
aggregators and assignment of file domains that taking hopbytes into account can significantly reduce communication
overhead. Although the communication bottleneck caused
by link contention is not directly measured by this metric,
low values of this metric mean smaller communication overheads. When using this metric, we only have to measure the
machine topology; the routing information is not necessary.
This paper is structured as follows. Section 2 introduces
the related work. The implementation detail of two-phase I/O
is described in Section 3. Section 4 gives the description of
the topology-aware two-phase I/O. Section 5 overviews the
evaluated application, in addition to the evaluation results
that compare the topology-aware two-phase I/O with the
original version of two-phase I/O.
2. Related Work
Due to the increasing requirements of applications for data
movement to memory or storage, parallel I/O is an active
research topic now. From the perspective of file system,
highly scalable parallel file systems such as GPFS [4] or
Lustre [5] are widely used. At the application level, parallel
I/O libraries MPI-IO, which is part of the MPI-2 standard,
is commonly deployed. With MPI-IO, collective I/O allows
achieving improved performance. Various collective I/O
write algorithms are evaluated by Chaarawi et al. [6]. Some
researches try to optimize collective I/O with techniques such
as automatic collective I/O tuning with machine learning [7]
and process placement based on the I/O pattern [8]. Twophase I/O is the de facto collective I/O algorithm [9]. It adds a
sh (...truncated)