Cluster Cache Monitor: Leveraging the Proximity Data in CMP (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs10766-014-0339-0.pdf

Cluster Cache Monitor: Leveraging the Proximity Data in CMP

Guohong Li 0 1 Olivier Temam 0 1 Zhenyu Liu 0 1 Sanchuan Guo 0 1 Dongsheng Wang 0 1 0 O. Temam INRIA Saclay , Orsay , France 1 S. Guo National Computer Network Emergency Response Technical Team/Coordination Center of China , Beijing , China As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of 2 2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving - Fig. 1 Baseline architecture 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA. 1 Introduction and Motivation Each node of a multi-core usually contains a core, a private L1 cache, and L2 storage [4,7,28], see Fig. 1. This L2 storage can take two forms: a private L2 cache, or part of a distributed L2 cache shared by all nodes. For a large number of cores and large workloads, a shared L2 outperforms private L2s, where too much cache capacity is used for storing redundant data. We illustrate that point by comparing, in Fig. 2, the performance of two 64-core architectures with respectively a distributed shared L2 and private L2s of the same total storage, for different benchmark suites; we show the ratio of execution time (private over shared) and the memory footprint of the benchmarks.1 Beyond performance, there is also an increasingly stringent yield issue: because foundries have difficulties producing chips containing a too high amount of SRAM cells with a low enough total number of faults [2,27,30], the total SRAM capacity available on-chip may not increase as fast as the number of cores. In that context, it will become more efficient to share among cores all the available SRAM capacity rather than to use small private L2 caches. It is interesting to note, for instance, that the SPARC T3 architecture is composed of 16 cores and 6MB L2 storage distributed among private caches [28], while the recent Intel Xeon PhiTM Coprocessor 5110P has 60 cores and 30 MB distributed L2 cache. However, as the number of cores increases, the increased latency of L1 misses cancancel out the capacity advantage of the shared L2 [32]. With standard private L2 caches, an L1 miss is serviced by the local L2. With a shared L2, the L1 miss can be 1 The methodology for these experiments is described in Sect. 4. d rae1.5 h /eS 1 t a irv0.5 P Fig. 2 Shared versus private L2s Fig. 3 a Left long distance access to the home node. b Right congestion at the home node serviced by any of the L2 cache banks spread among all nodes; thus, as the number of nodes increases, the average L1 miss latency increases. The root cause of long latency is usually access to the memory block home node. Multi-cores with a large number of cores favor distributed directory coherence, where each memory block has a home node. Upon an L1 miss, this home node must be accessed by the requesting node, which updates the coherence state of the memory block, and directs the node currently owning a copy of the data to send it back to the requesting node. As the number of nodes increases, the average latency to access the home nodes can increase for two reasons: (1) the increased average distance between the requesting node and the home node, see Fig. 3a, and (2) the congestion at the home node, see Fig. 3b. This long access latency to the home node is all the more wasteful, both time-wise and energy-wise, if the requested data itself is located in a nearby node, as illustrated in Fig. 3, see Owner node. We observe that this case is actually frequent: upon an L1 miss in one node, there is a high probability that the requested data is located in the L1 cache of a nearby node, a form of node-level spatial locality. In Fig. 4, we show the fraction of all L1 misses which can be serviced by nodes from the same cluster (the (...truncated)