On the analysis of random replacement caches using static probabilistic timing methods for multi-path programs

Real-Time Systems, Dec 2017

Probabilistic hard real-time systems, based on hardware architectures that use a random replacement cache, provide a potential means of reducing the hardware over-provision required to accommodate pathological scenarios and the associated extremely rare, but excessively long, worst-case execution times that can occur in deterministic systems. Timing analysis for probabilistic hard real-time systems requires the provision of probabilistic worst-case execution time (pWCET) estimates. The pWCET distribution can be described as an exceedance function which gives an upper bound on the probability that the execution time of a task will exceed any given execution time budget on any particular run. This paper introduces a more effective static probabilistic timing analysis (SPTA) for multi-path programs. The analysis estimates the temporal contribution of an evict-on-miss, random replacement cache to the pWCET distribution of multi-path programs. The analysis uses a conservative join function that provides a proper over-approximation of the possible cache contents and the pWCET distribution on path convergence, irrespective of the actual path followed during execution. Simple program transformations are introduced that reduce the impact of path indeterminism while ensuring sound pWCET estimates. Evaluation shows that the proposed method is efficient at capturing locality in the cache, and substantially outperforms the only prior approach to SPTA for multi-path programs based on path merging. The evaluation results show incomparability with analysis for an equivalent deterministic system using an LRU cache. For some benchmarks the performance of LRU is better, while for others, the new analysis techniques show that random replacement has provably better performance.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs11241-017-9295-2.pdf

On the analysis of random replacement caches using static probabilistic timing methods for multi-path programs

On the analysis of random replacement caches using static probabilistic timing methods for multi-path programs Benjamin Lesage 0 1 2 David Griffin 0 1 2 Sebastian Altmeyer 0 1 2 Liliana Cucu-Grosjean 0 1 2 Robert I. Davis 0 1 2 B Benjamin Lesage 0 1 2 David Griffin 0 1 2 0 INRIA , Paris , France 1 University of Amsterdam , Science Park 904, Room C3.101, 1098 XH, Amsterdam , Netherlands 2 University of York , York , UK Probabilistic hard real-time systems, based on hardware architectures that use a random replacement cache, provide a potential means of reducing the hardware over-provision required to accommodate pathological scenarios and the associated extremely rare, but excessively long, worst-case execution times that can occur in deterministic systems. Timing analysis for probabilistic hard real-time systems requires the provision of probabilistic worst-case execution time (pWCET) estimates. The pWCET distribution can be described as an exceedance function which gives an upper bound on the probability that the execution time of a task will exceed any given execution time budget on any particular run. This paper introduces a more effective static probabilistic timing analysis (SPTA) for multi-path programs. The analysis estimates the temporal contribution of an evict-on-miss, random replacement cache to the pWCET distribution of multi-path programs. The analysis uses a conservative join function that provides a proper over-approximation of the possible cache contents and the pWCET - distribution on path convergence, irrespective of the actual path followed during execution. Simple program transformations are introduced that reduce the impact of path indeterminism while ensuring sound pWCET estimates. Evaluation shows that the proposed method is efficient at capturing locality in the cache, and substantially outperforms the only prior approach to SPTA for multi-path programs based on path merging. The evaluation results show incomparability with analysis for an equivalent deterministic system using an LRU cache. For some benchmarks the performance of LRU is better, while for others, the new analysis techniques show that random replacement has provably better performance. Extensions This paper builds upon previous work published in RTSS 2015 (Lesage et al. 2015a) with the following extensions: – we introduce and prove additional properties relevant to the comparison of the contribution of different cache states to the probabilistic worst-case execution time of tasks in Sect. 3; – an improved join transfer function, used to safely merge states from converging paths, is introduced in Sect. 5 and by construction dominates the simple join introduced in Lesage et al. (2015a); – we present and prove the validity of path renaming in Sect. 6 which allows the definition of additional transformations to reduce the set of paths considered during analysis; – our evaluation explores new configurations in terms of both the analysis methods used and the benchmarks considered (see Sect. 7). 1 Introduction Real-time systems such as those deployed in space, aerospace, automotive and railway applications require guarantees that the probability of the system failing to meet its timing constraints is below an acceptable threshold (e.g. a failure rate of less than 10−9 per hour for some aerospace and automotive applications). Advances in hardware technology and the large gap between processor and memory speeds, bridged by the use of cache, make it difficult to provide such guarantees without significant overprovision of hardware resources. The use of deterministic cache replacement policies means that pathological worstcase behaviours need to be accounted for, even when in practice they may have a vanishingly small probability of actually occurring. The use of cache with a random replacement policy means that the probability of pathological worst-case behaviours can be upper bounded at quantifiably extremely low levels, for example well below the maximum permissible failure rate (e.g. 10−9 per hour) for the system. This allows the extreme worst-case behaviours to be safely ignored, instead of always included in the estimated worst-case execution times. The random replacement policy further offers a trade-off between performance and cost thanks to a minimal hardware cost (Al-Zoubi et al. 2004) . The policy and variants have been implemented in a selection of embedded processors (Hennessy and Patterson 2011) such as the ARM Cortex series (2010), or the Freescale MPC8641D (2008) . Randomisation further offers some level of protection against side-channel attacks which allow the leakage of information regarding the running tasks. While methods relying solely on the random replacement policy may still be circumvented (Spreitzer and Plos 2013), the definition of probabilistic timing analysis is a step towards the analysis of other approaches such as randomised placement policies (Wang and Lee 2007; 2008). The timing behaviour of programs running on a processor with a cache using a random replacement policy can be determined using static probabilistic timing analysis (SPTA). SPTA computes an upper bound on the probabilistic Worst-Case Execution Time (pWCET) in terms of an exceedance function. This exceedance function gives the probability, as a function of all possible values for an execution time budget x , that the execution time of the program will exceed that budget on any single run. The reader is referred to Davis et al. (2013) for examples of pWCET distributions, and to Cucu-Grosjean (2013) for a detailed discussion of what is meant by a pWCET distribution. This paper introduces an effective SPTA for multi-path programs running on hardware that uses an evict-on-miss, random replacement cache. Prior work on SPTA for multi-path programs by Davis et al. (2013) used a path merging approach to compute cache hit probabilities based on reuse distances. The analysis derived in this paper builds upon more sophisticated SPTA techniques for the analysis of single path programs given by Altmeyer and Davis (2014, 2015). This new analysis provides substantially improved results compared to the path merging approach. To allow the analysis of the behaviour of caches in isolation, we assume the existence of a valid decomposition of the architecture with regards to cache effects with bounded hit and miss latencies (Hahn et al. 2015) . 1.1 Related work We now set the work on SPTA in context with respect to related work on both probabilistic hard real-time systems and cache analysis for deterministic replacement policies. The methods introduced in this paper belong to the realm of analyses that estimate bounds on the execution time of a program. These bounds may be classified as either a worst-case probability distribution (pWCET) or a worst-case value (WCET). The first class is a more recent research area with the first work on providing bounds described by probability distributions published by Edgar and Burns (2000 , 2001). The methods for obtaining such distributions can be categorised into three different families: measurement-based probabilistic timing analyses, static probabilistic timing analyses, and hybrid probabilistic timing analyses. The second class is a mature area of research and the interested reader may refer to Wi lhelm et al. (2008 ) for an overview of these methods. A specific overview of cache analysis for deterministic replacement policies together with a comparison between deterministic and random cache replacement policies is provided at the end of this section. 1.1.1 Probabilistic timing analyses Measurement-based probabilistic timing analyses (Bernat et al. 2002; Cucu-Grosjean et al. 2012) collect observations on the execution time of the task under study on the target hardware. These observations are then combined, e.g. through the use of extreme value theory (Cucu-Grosjean et al. 2012), to produce the desired worst-case probabilistic timing estimate. Extreme Value Theory may potentially underestimate the pWCET of a program as shown by Griffin and Burns (2010) . The work of CucuGrosjean et al. (2012) overcomes this limitation and also introduces the appropriate statistical tests required to treat worst-case execution times as rare events. The soundness of the results produced by such methods is tied to the observed execution times which should be representative of the ones at runtime. This implies a responsibility on the user who is expected to provide input data to exercise the worst-case paths, less the analysis results in unsound estimates (Lesage et al. 2015b) . These methods nonetheless exhibit the benefits of time-randomised architectures. The occurrence probability of pathological temporal cases can be bounded and safely ignored provided they meet requirements expressed in terms of failure rates. Path upper-bounding (Kosmidis et al. 2014) defines a set of program transformations to alleviate the responsibility of the user to provide inputs which cover all execution paths. The alternative paths of conditional constructs are padded with semanticpreserving instructions and memory accesses such that any path followed in the modified program is an upper-bound of any of the original alternatives. Measurementbased analyses can then be performed on the modified program as the paths exercised at runtime upper-bound any alternative in the original application. Hence, upperbounding creates a distinction between the original code and the measured one. It may also result in paths which are the sum of the original alternatives. Hybrid probabilistic timing analyses are methods that apply measurement-based methods at the level of sub-programs or blocks of code and then operations such as convolution to combine these bounds to obtain a pWCET for the entire program. The main principles of hybrid analysis were introduced by Bernat et al. (2002, 2003) with execution time probability distributions estimated at the level of sub-programs. Here, dependencies may exist among the probability distributions of the sub-programs and copulas are used to describe them (Bernat et al. 2005) . By contrast, SPTAs derive the pWCET distribution for a program by analysing the structure of the program and modelling the behaviour of the hardware it runs on. Existing work on SPTA has primarily focussed on randomized architectures containing caches with random replacement policies. Initial results for the evict-on-miss (Quinones et al. 2009) and evict-on-access (Cucu-Grosjean et al. 2012; Cazorla et al. 2013) policies were derived for single-path programs. These methods use the reuse distance of each access to determine its probability of being a cache hit. These results were superseded by later work by Davis et al. (2013) who derived an optimal lower bound on the probability of a cache hit under the evict-on-miss policy, and showed that evict-on-miss dominates evict-on-access. Altmeyer and Davis (2014) proved the correctness of the lower bound derived in Davis et al. (2013) , and its optimality with regards to the limited information that it uses (i.e. the reuse distance). They also showed that the probability functions previously given in Kosmidis et al. (2013) and Quinones et al. (2009) are unsound (optimistic) for use in SPTA. In 2013, a simple SPTA for multipath programs was introduced by Davis et al. (2013) , based on path merging. With this method, accesses are represented by their reuse distances. The program is then virtually reduced to a single sequence which upper-bounds all possible paths with regards to the reuse distance of their accesses. In 2014, more sophisticated SPTA methods for single path programs were derived by Altmeyer and Davis (2014) . They introduced the notion of cache contention, which combined with reuse distance enables the computation of a more precise bound on the probability that a given access is a cache hit. Altmeyer and Davis (2014) also introduced a significantly more effective method based on combining exhaustive evaluation of the cache behaviour for a limited number of relevant memory blocks with cache contention. This method provides an effective trade-off between analysis precision and tractability. Griffin et al. (2014a) introduces orthogonal Lossy compression methods on top of the cache states enumeration to improve the trade-off between complexity and precision. Altmeyer and Davis further refined their approach to SPTA for single path programs in 2015 (Altmeyer et al. 2015) , bridging the gap between contention and enumeration-based analyses. The method relies on simulation of the behaviour of a random replacement cache. As opposed to exhaustive state analyses however, focus is set at each step on a single cache state to capture the outcome across all possible states. The resulting approach offers an improved precision over contention-based methods, at a lower complexity than exhaustive state analyses. In this paper, we build upon the state-of-the-art approach (Altmeyer and Davis 2014) , extending it to multi-path programs. The techniques introduced in the following notably allow for the identification on control flow convergence of relevant cache contents, i.e. the identification of the outcomes in multi-path programs. The approach focuses on the enumeration of possible cache states at each point in the program. To reduce the complexity of such an approach, only a few blocks, identified as the most relevant, are analysed at a given time. 1.1.2 Deterministic architectures and analyses Static timing analysis for deterministic caches (Wi lhelm et al. 2008 ) relies on a two step approach with a low-level analysis to classify the cache accesses into hits and misses (Theiling et al. 1999) and a high-level analysis to determine the length of the worst-case path (Li and Malik 2006). The most common deterministic replacement policies are least-recently used (LRU), first-in first-out (FIFO) and pseudo-LRU (PLRU). Due to the high-predictability of the LRU policy, academic research typically focusses on LRU caches–with a well-established LRU cache analysis based on abstract interpretation (Alt et al. 1996; Theiling et al. 1999) . Only recently, analyses for FIFO (Grund and Reineke 2010) and PLRU (Grund and Reineke 2010; Griffin et al. 2014b) have been proposed, both with a higher complexity and lower precision than the LRU analysis due to specific features of the replacement policies. Despite the focus on LRU caches and its analysability, FIFO and PLRU are often preferred in processor designs due to the lower implementation costs which enable higher associativities. Recently, Reineke (2014) observed that SPTA based on reuse distances (Davis et al. 2013) results, by construction, in less precise bounds than existing analyses based on stack distance for an equivalent system with a LRU cache (Wilhelm et al. 2008). However, this does not hold for the more sophisticated SPTA based on cache contention and collecting semantics given by Altmeyer and Davis (2014) . Analyses for deterministic LRU caches are incomparable with these analyses for random replacement caches. This is illustrated by our evaluation results. It can also be seen by considering simple examples such as a repeated sequence of accesses to five memory blocks a, b, c, d, e, a, b, c, d, e with a four-way associative cache. With LRU, no hits can be predicted. By contrast, with a random replacement cache and SPTA based on cache contention, four out of the last five accesses can be assumed to have a non-zero probability of being a cache hit (as shown in Table 1 of Altmeyer and Davis 2014) , hence SPTA for a random replacement cache outperforms analysis of LRU in this case. We note that in spite of recent efforts (de Dinechin et al. 2014) the stateless random replacement policies have lower silicon costs than LRU, and so can potentially provide improved real-time performance at lower hardware cost. Early work (David and Puaut 2004; Liang and Mitra 2008) in the domain of SPTA for deterministic architectures relied for its correctness on knowledge of the probability that a specific path would be taken or that specific input data would be encountered; however, in general such assumptions may not be available. The analysis given in this paper does not require any assumption about the probability distribution of different paths or inputs. It relies only on the random selection of cache lines for replacement. 1.2 Organisation In this paper, we introduce a set of methods that are required for the application of SPTA to multi-path programs. Section 2 recaps the assumptions and methods upon which we build. These were used in previous work (Altmeyer and Davis 2014) to upperbound the pWCET distribution of a trace corresponding to a single path program. We then proceed by defining key properties which allows the ordering of cache states w.r.t. their contribution to the pWCET of a program (Sect. 3). We address the issue of multi-path programs in the context of SPTA in Sect. 4. This includes the definition of conservative (over-approximate) join functions to collect information regarding cache contention, possible cache contents, and the pWCET distribution at each program point, irrespective of the path followed during execution. Further improvements on cache state conservation at control flow convergence are introduced in Sect. 5. Section 6 introduces simple program transformations which improve the precision of the analysis while ensuring that the pWCET distribution of the transformed program remains sound (i.e. upper-bounds that of the original). Multi-path SPTA is applied to a selection of benchmarks in Sect. 7 and the precision and run-time of the different approaches compared. Section 8 concludes with a summary of the main contributions of the paper and a discussion of future work. 2 Static probabilistic timing analysis In this section, we recap on state-of-the-art SPTA techniques for single path programs (Altmeyer and Davis 2014) . We first give an overview of the system model assumed throughout the paper in Sect. 2.1. We further recap on the existing methods (Altmeyer and Davis 2014) to evaluate the pWCET of a single path trace using a collecting approach (Sect. 2.2) supplemented by a contention one. The pertinence of the model is discussed at the end of this section. The notations introduced in the present contributions have been summarised in Table 1. We assume an architecture for which a valid decomposition exists with regards to the cache, such that its timing contribution can be analysed in isolation from other components (Hahn et al. 2015) . Further, the overall execution time penalty emanating from cache misses and hits are assumed to be bounded by the latencies assumed by the analysis. Thus a local worst-case, a miss in the context of the cache, can be added to the local worst-case for other components to obtain a bound on the global worst case (Reineke et al. 2006). This enables analysis of the impact of the cache in isolation from other architectural features. 2.1 Cache model We assume a single level, private, N -way fully-associative cache with an evict-onmiss random replacement policy. On an access, should the requested memory block be absent from the cache then the contents of a randomly selected cache line are evicted. The requested memory block is then loaded into the selected location. Given that there are N ways, the probability of any given cache line being selected by the replacement policy is N1 . We assume a fixed upper-bound on the hit and miss latencies, denoted by H and M respectively, such that H < M. (We note that the restriction to a fully-associative cache can be easily lifted for a set-associative cache through the analysis of each cache set as an independent fully-associative cache.) 2.2 Collecting semantics We now recap on the collecting semantics introduced by Altmeyer and Davis (2014) as a more precise but more complex alternative to the contention-based method of computing pWCET estimates. This approach performs exhaustive cache state enumeration for a selection of relevant accesses, hence providing tight analysis results for those accesses. To prevent state explosion, at each point in the program no more than R memory blocks are relevant at the same time. The relevant accesses are ones heuristically identified as benefiting the most from a precise analysis. A trace t is defined as an ordered sequence [e1, . . . , en ] of n accesses to memory blocks, such that ei = e j if accesses ei and e j target the same memory block. If access Description ei is relevant, the block it accesses will be considered relevant until the next nonrelevant access to the same block. The precise approach is only applied for relevant accesses while the contention-based method outlined in Sect. 2.2.1 is used for the others, identified as ⊥ in the trace of relevant blocks. The set of elements in a trace becomes E⊥ = E ∪ {⊥}. The abstract domain of the analysis is a set of cache states. A cache state is a triplet C S = (C, P, D) with cache contents C , a corresponding probability P ∈ R, 0 < P ≤ 1, and a miss distribution D : N → R when the cache is in state C . C is a set of at most N memory blocks picked from E. A cache state which holds less than N memory blocks represents partial knowledge about the cache contents without any distinction between empty lines or unknown contents.1 The set of all cache states is denoted by CS. Miss distribution D captures for each possible number of misses n, the probability that n misses occurred from the beginning of the program up to the current point in the program. The method computes all possible behaviours of the random cache with the associated probabilities. It is thus correct by construction as it simply enumerates all states exhaustively. 1 This suits evict-on-miss caches which do not prioritize empty lines when filling the cache. The update function u describes the update for a single cache state upon access to element e ∈ E⊥. Upon accessing a relevant element e = ⊥, if e is present in the cache, its contents are left unchanged. Otherwise new cache states need to be generated considering that each element may be evicted with the same probability N1 (in the evict function). A miss is accounted for in the resulting distributions D only upon misses on a relevant access. Formally: u : CS × E⊥ → 2CS {(C, P, D)} if e ∈ C ∧ e = ⊥ u((C, P, D), e) = evict((C, P, D), e) otherwise evict((C, P, D), e) = {(C\{e } ∪ {e}, P · N1 , D ) | e ∈C} ∪ {(C ∪ {e}, P · N−N|C| , D )} if e = ⊥ (4) {(C\{e }, P · N1 , D ) | e ∈C} ∪ {(C, P · N−N|C| , D )} if e = ⊥ D (x) = ⎨⎧ 0D(x) if e = ⊥ if x = 0 ⎩ D(x − 1) otherwise The evict(s, e) function creates N different cache states, one per possible evicted element, some of which might represent the same cache contents. To reduce the state space, a merge operation combines two cache states if they contain exactly the same memory blocks. If merging occurs, each distribution is weighted by its probability: : 2CS → 2CS The analysis starts from the empty cache state {(∅, 1, Dinit)} where Dinit(x ) = 1 if x = 0 0 otherwise (1) (2) (3) (5) (6) (7) (8) (9) (10) ⎛ ⎧ (C0, P0, D0) ⎫ ⎞ ⎜⎝ ⎪⎨ ... ⎪⎬ ⎟⎠ = ⎪⎩ (Cn, Pn , Dn ) ⎭⎪ ⎛ ⎧ (C0, P0, D0) ⎫ ⎞ Merge ⎜⎝ ⎪⎨ ... ⎬⎪ ⎟⎠ = ⎪⎩ (Cn, Pn , Dn ) ⎭⎪ Merge (Ci , Pi , Di )|Ci = C j 0 ≤ j ≤ n C0, Pi , n i=0 n i=0 Pi k=0 Pk · Di n where p · D denotes the multiplication of the elements of distribution D, ( p · D)(x ) = p · D(x ), and D1 + D2 is the summation of two distributions, (D1 + D2)(x ) = D1(x ) + D2(x ). The update function can be defined for a set of cache states using the update function u for a single cache state and the merge operator as follows: U : 2CS × E⊥ → 2CS U (S, e) = {u(CS, e) | CS ∈ S} Given Sres the set of cache states at the end of the execution of a trace t , the miss distribution Dˆ miss of the relevant blocks in t is the sum of the individual distributions of each cache state weighted by their probability of occurrence: Dˆ miss = { P · D | (C, P, D) ∈ Sres} The corresponding execution time distribution, Dˆ , can then be derived, for a trace of n accesses, as follows: Dˆ (m × M + (n − m) × H) = Dˆ miss(m) 2.2.1 Non-relevant blocks analysis One possible naive approach for non-relevant blocks would be to classify them as misses in the cache and add the resulting latency to the previously computed distributions. The collecting approach proposed by Altmeyer and Davis (2014) relies on the application of the contention methods to estimate the behaviour of the non-relevant blocks in a trace. Each access in a trace has a probability of being a cache hit P(eihit), and of being a cache miss P(eimiss) = 1 − P(eihit). These methods rely on different metrics to lower-bound the hit probability of each access such that the derived bound can be soundly convolved. The reuse distance rd(e) of element e is the maximum number of accesses to consecutively different blocks since the last access to the same block. It captures an upper-bound on the maximum number of possible evictions between two accesses to the same block, similarly to the stack distance for LRU caches. It differs from the stack distance in that accesses to the same intermediate block may thus be accounted for multiple times if they may have been evicted during the access sequence. Should there be no such prior access to the same block, the reuse distance is defined as ∞. Given the set of all traces T and of all elements E, the reuse distance is formally defined as: (11) (12) r d : E × T → N ∪ {∞} ⎧⎪ |{k| j < k < i ∧ ek = ek−1}| rd(ei , [e1, . . . , ei−1]) = ⎨ ⎪⎩ ∞ if ei = e j ∧ ∀k : j < k < i, ei = ek otherwise (13) Note that this definition of the reuse distance is a variation of the one proposed in earlier work. The revised equation (13) computes the same property, but has to discard successive accesses to the same block. Successive accesses to the same memory block lead to guaranteed cache hits under an evict-on-miss cache replacement policy. Traces are thus collapsed in Altmeyer et al. (2015) to remove all successive accesses to the same memory block. The number of cache misses is not impacted and cache hits can later be accounted for as an additional contribution to the trace. This last step is not straightforward for multi-path programs as the number of guaranteed hits varies on different paths. Conversely, we define the forward reuse distance frd(e) of an element e as the maximum number of possible evictions before the next access to the same block. If its block is not reused before the end of the trace, the forward reuse distance of an access is defined as ∞: frd : E × T → N ∪ {∞} ⎧⎪ |{k| j < k < i ∧ ek = ek−1}| frd(ei , [ei+1, . . . , em ]) = ⎨ ⎪⎩ ∞ if ei = e j , ∀k : i < k < j, ei = ek otherwise The probability of ei being a hit is set to 0 if there are more blocks since the last access to the same block that contend for cache space than the N available lines. This is captured by the cache contention con(ei , t ) (Altmeyer and Davis 2014) of element ei in trace t . The definition of Pˆ (eihit) which denotes a lower bound on the actual probability P(eihit) of a cache hit is as follows: Pˆ (eihit) = 0 N −1 rd(ei ,t) N con(ei , t ) ≥ N otherwise The cache contention con(e) (Altmeyer and Davis 2014) of element e captures the number of cache blocks which contend with e for space in the cache. It includes all potential hits and the R relevant blocks, denoted relevant_blocks, since we have to assume they occupy a separate location in the cache. Contention depends on and contributes to the potential hits captured by Pˆ (ehjit), j < i , and is computed from the first accesses, where rd(ei , t ) = ∞, to the last. The contention also accounts for the first miss er which follows the previous access to the same memory block as ei and hence contends with ei . The replacement policy means that er always contends for space. The cache contention is formally defined as: con : E × T → N ∪ {∞} con(ei , t ) = ∞ |{ek |k ∈ con S(ei , t ) ∧ ek ∈/ r elevant _blocks}| + R otherwise if rd(ei , t) = ∞ with Example We now illustrate the distinction between cache contention and reuse distance in identifying accesses with a null hit probability in (15). Consider the following sequence of accesses, on a 4 line fully-associative cache, where the reuse distance of each access is given as a super-script: a, b, c, b1, d, f, a5, b3, c5, d4, f 4 All second accesses to blocks a, b, c, d, and f have a non-zero chance to hit when considered in isolation. However as highlighted in Altmeyer and Davis (2014) , those cannot be simply combined as the hit probability of a block depends on the behaviour of other blocks; the last 5 accesses of the sequence, each accessing a different block, cannot hit at the same time assuming a 4 line cache. The hit probability of an access need to be set to 0 in (15) if enough blocks are inserted in cache since the last access to the same block. Should the reuse distance be considered to identify whether or not an access is a potential hit, the last occurrences of a, c, d, and f would be considered as misses. Using cache contention, some accesses are assumed to be potential hits, occupying cache space to the detriment of others. Cache contention captures a specific but potential hit/miss scenario the occurrence of which is bounded using each access hit probability in (15). As proven in Altmeyer and Davis (2014) , the estimated hit probability of the overall sequence holds. In our example, contention identifies that a, b, and c can be kept in the cache simultaneously. Using the contention as a super-script, we have: a, b, c, b1, d, f, a2, b2, c3, d4, f 4 c3 implies that c may be present in cache, assuming only three other blocks may have been kept alongside it, a and b as potential cache hits, and d then replaced by f . This assumption regarding d and f is an important difference between contention and the stack distance metric used in LRU cache analysis. Using the stack distance, i.e. the number of different blocks accessed since the last access to c, d and f would be regarded as occupying a different line in cache, resulting in a guaranteed miss for c. d4 is classified as a miss: a2, b2 and c3 have been identified as potential misses, and f is a miss resulting in the eviction of the fourth and only cache line where d could be held. f 4 is similarly classified as a miss. Note that this definition of contention is an improvement on the one proposed in earlier work. Instead of accounting for each access independently, we account for their accessed blocks instead. The reasoning behind this optimisation is that if an accessed block hits more than once, it does not occupy additional lines. In the previous example, b is only accounted for once in the contention of a2 and c3. The subtle difference lies in (17) where the blocks e j are accounted for instead of each access j individually (ei = e j if they access the same block). The execution time of an element ei can be approximated with the help of the discrete random variable ξˆi which has a probability mass function (PMF) defined as: otherwise An estimated pWCET (Cucu-Grosjean 2013) distribution Dˆ of a trace, is an upperbound on the execution time distribution D induced by the randomised cache for the trace,2 such that ∀v, P(Dˆ ≥ v) ≥ P(D ≥ v). In other words, the distribution Dˆ is greater than D (López et al. 2008), denoted Dˆ ≥ D. The probability mass functions Eˆi are independent upper-bounds on the behaviour of corresponding accesses ei . An estimate for trace t can be derived by combining the probability mass function Eˆi for each of its composing memory accesses ei : where ⊗ represents the convolution of PMFs: Dˆ (t ) = ei ∈t ˆ Ei +∞ k=−∞ (ξˆi ⊗ ξˆ j )(x ) = ξˆi (k) · ξˆ j (x − k) (18) (19) (20) The resulting distribution for non-relevant accesses is independent of the relevant blocks considered in the cache during the collecting analysis step. A worst-case is assumed where the R blocks are always kept in cache. The distributions resulting from the two analysis steps, collecting and contention, can therefore be soundly convolved to estimate the execution time of a trace. The pWCET of a trace can then be derived by convolving the execution time distributions produced by the contention, and collecting approaches, as derived from Dˆ miss. 2.3 Discussion: relevance of the model The SPTA techniques described apply whether the contents of the memory block are instruction(s), data or both. While address computation (Huynh et al. 2011) may not be able to pinpoint the exact target of an access, e.g. for data-dependent requests, relational analysis (Hahn and Grund 2012) , introduced in the context of deterministic systems, can be used to identify accesses which map to the same or different sets, and access the same or different block. Two accesses which obey the same block relation can then be replaced by accesses to the same unique element, hence improving the precision of the analysis. The methods assume that there are no inter-task cache conflicts due to preemption, i.e. a run-to-completion semantics with non-preemptable program execution. Concur2 Note the precise execution time distribution is effectively that which would be observed by executing the trace an infinite number of times. rent cache accesses are also precluded, i.e. we assume a private cache or appropriate isolation (Chiou et al. 2000) . In practice, detailed analysis could potentially distinguish between different latencies for each access, beyond M and H, but such precise estimation of the miss latency requires additional analysis steps, e.g. analysis of the main memory (Bourgade et al. 2008) . Further, to reduce the pessimism inherent in using a simple bound, particularly for the miss latency, events such as memory refresh can be accounted for as part of higher level schedulability analyses (Atanassov and Puschner 2001; Bhat and Mueller 2011) . 3 Comparing cache contents The execution time distribution of a trace in our model depends solely on the behaviour of the cache. The contribution of a cache state to the execution time of a trace thus solely depends on its initial contents. The characterisation of the relation between the initial contents of different caches allows for a comparison of their temporal contribution to the same trace. This section introduces properties and conditions that allow this comparison. They are used in later techniques to improve the selection of cache contents on path convergence, and identify paths with the worst impact on execution time. An N -tuple represents the concrete contents of an N -way cache, such that each element corresponds to the block held by a single line. The symbol _ is used to denote an empty line. For each such concrete cache s, there is a corresponding abstract cache contents C which holds the exact same set of blocks. C might also capture uncertainty regarding the contents of some lines. Given cache state s = l1, . . . , lN ,3 s[li = b] represents the replacement of memory block or line li in cache by memory block b. Note that b can only be present once in the cache, b ∈ s ⇒ s[li = b] = s. s[−li ] is a shorthand for s[li = _] and identifies the eviction of memory block li from the cache. s[li = b][l j = e] denotes a sequence of replacements where b first replaces li in s, then e replaces l j . Two cache states s and s although not strictly identical may exhibit the same behaviour if they hold the exact same contents, e.g. a, _ = _, a are represented using the same abstract contents {a}. Under the evict-on-miss random replacement policy, there is no correlation between the physical and logical position of a block with respects to the eviction policy. We distinguish the execution time distribution of trace t using input cache state s with the notation D(t, s). The execution time distribution of the sequence [[b], t ], the concatenation of access [b] to trace t , can be expressed as follows: D([[b], t ], s = l1, . . . , lN ) = ⎧⎨ H + D(t, s) ⎩ M + 3 We assume a fully-associative cache, but this restriction can be lifted to set-associative caches through the independent analysis of each set. if b ∈ s otherwise (21) where the sum of distributions and the product of a distribution with N1 are defined as per (6), and (L + D)(x ) = L + D(x ) denotes the sum of distribution D with latency L. Upon a hit, the input cache state s is left unchanged, while evictions occur to make room for the accessed block upon a miss. The extension of this definition to the concatenation of traces requires the identification of the outcomes of an execution, i.e. the cache state C corresponding to each possible sequence of events, along with its occurrence probability P and execution time distribution D: D([t p, ts ], s) = P · (D ⊗ D(ts , C )) (22) where outcomes(t p, s) is the set of cache states produced by the execution of t p from input cache state s and ⊗ is the convolution of distributions. Theorem 1 The eviction of a block from any input cache state s cannot decrease the execution time distribution of any trace t , D(t, s) ≤ D(t, s[−e]). Corollary 1 In the context of evict-on-miss randomised caches, for any trace, the empty state is the worst initial state over any other input cache state s, D(t, s) ≤ D(t, ∅). The eviction of a block might trigger additional misses, resulting in a distribution that is no less than the one where the cache contents is left untouched. This provides evidence that the assumption upon a non-relevant access that a block in cache is evicted, as per the update function in (3), is sound. Similarly, the replacement of a block in the cache might trigger additional misses but might also result in additional hits instead upon reuse of the replacing block. The impact of such a behaviour is however bounded. Theorem 2 The replacement of a random block in cache triggers at most one additional hit. The distribution for any trace t from any cache state s is upper-bounded by the distribution for trace t after the replacement of a random block in s and assuming a single hit turns into a miss. The block selected for eviction impacts the likelihood of those additional latencies suffered during the execution of the subsequent trace. Intuitively, the closer the evicted block is to reuse, the worse the impact of the eviction. We use the forward reuse distance of blocks at the beginning of trace t , frd(b, t ) as defined in (14), to identify the blocks which are closer to reuse than others. Theorem 3 The replacement of a block in input cache state s by one which is reused later in trace t cannot result in a decreased execution time distribution: frd(b, t ) ≤ frd(e, t ) ≤ ∞ ∧ b ∈ s ∧ e ∈/ s ⇒ D(t, s) ≤ D(t, s[b = e]) 4 Application of SPTA to multi-path programs In this section, we improve upon the state-of-the-art SPTA techniques for traces (Altmeyer and Davis 2014) recapitulated in Sect. 2 and present methods for multi-path programs, that is complete control-flow graphs. A naive approach would be to compute all possible traces T of a task, analyse each independently and combine their distributions. However, there are two significant problems with such an approach. Firstly, while the merge operation (6) could be used to provide a weighted combination given the probability of each path being taken at runtime, such assumptions about path probability do not hold in general. This issue can however be resolved by taking the maximum distribution of the resulting execution-time distributions for each trace: D(t ) where we define the t∈T operation as follows with Da Db := D The operator computes the least upper-bound of the complementary cumulative distribution (1-CDF) of all its operands (similar to the upper-bound depicted in Fig. 1), a maximum of distributions which is valid irrespective of the path executed at runtime. By construction the following properties hold Da Db ≥ Da ∧ Da Db ≥ Db Da ≤ Db ⇒ Da Db = Db Secondly, the number of distinct traces is exponential in the number of control flow divergences, conditional constructs and loop iterations, which means that this naive approach is computationally intractable. A standard data-flow analysis is also problematic, since it is not possible to assign to each instruction a corresponding contribution to the execution time distribution. Our analysis on control-flow graphs resolves these problems. It relies on the collecting and the contention approaches for relevant and non-relevant blocks respectively, as (24) (25) (26) (27) (28) (29) per the cache collecting approach on traces given by Altmeyer and Davis (2014) . First, the loops in the control-flow graph are unrolled. This allows the implementation of the following steps, the computation of cache contention, the identification of relevant blocks and the cache collection, to be performed as simple forward traversals of the control flow graph. Approximation of the possible incoming states on path convergence keeps the analysis tractable. Finally, the contention and collecting distributions are combined using convolution. 4.1 Program representation We represent the possible paths in a program using a control-flow graph (CFG), that is a directed graph G = (V , L , vs , ve) with a finite set V of nodes, a set L ⊆ V × V of edges, a start node vs ∈ V and an end node ve ∈ V . Each node v corresponds to an element in E accessed at node v. A path π from node v1 to node vk is a sequence of nodes π = [v1, v2, . . . , vk−1, vk ] where ∀i : (vi , vi+1) ∈ L and defines a corresponding trace. By extension, [π, π ] denotes the path composed of path π followed by path π . Given a set of nodes V , the symbol Π (V ) denotes the set of all paths with nodes that are included exclusively in V , and Π (G) ⊆ Π (V ) the set of all paths of CFG G from vs to ve. Similarly to traces, the pWCET Dˆ (G) of a program is the least upper-bound on the execution time distributions (pET) of all possible paths. Hence, ∀π ∈ Π (G), Dˆ (G) ≥ D(π ). Figure 1 illustrates this relation using the 1-CDF (F (x ) = P(D ≥ x )) of different execution time distributions and a valid pWCET. We say that a node vd dominates vn in the control-flow graph G if every path from the start node vs to vn goes through vd , vs →∗ vn = vs →∗ vd →∗ vn, where vs →∗ vd →∗ vn is the set of paths from vs to vn through vd . Similarly, a node v p post-dominates vn if every path from vn to the end node ve goes through v p, vn →∗ ve = vn →∗ v p →∗ ve. We refer to the set of dominators and post-dominators of node vn as dom(vn) and post -dom(vn) respectively. We assume that the program always terminates. Bounded recursion and loop iterations are requirements to ensure this termination property of the analysed application. a b d c e f Fig. 2 Simple do-while loop structure with an embedded conditional. b is the loop head, with its body comprising {b, c, d, e} and the e to b edge as the back-edge. e and c are both valid exits The additional restrictions described below are for the most part tied to the WCET analysis framework (Wi lhelm et al. 2008 ) and not exclusive to the new method. These are reasonable assumptions for the software in critical real-time systems. Any cycle in the CFG must be part of a natural loop. We define a natural loop l = (vh , Vl ) in G with a header vh ∈ V and a finite set of nodes Vl ⊆ V . Considering the example in Fig. 2, b is the head of the loop composed of accesses Vl = {b, d, c, e}. The header is the single entry-point of the loop, ∀vn ∈ Vl , vh ∈ dom(vn). Conversely, a natural loop may exhibit multiple exits, e.g. as a result of break constructs. Loop l contains at least one back edge to vh , an edge whose end is a dominator of its source ∃vb ∈ Vl , (vb, vh ) ∈ L. All nodes in the loop can reach one of its back edges without going through the header vh . The transition from the header vh of loop l to one of its nodes vn ∈ Vl begins an iteration of the loop. The maximum number of consecutive iterations of each loop, iterations which are not separated by the traversal of a node outside Vl , is assumed to be upper-bounded by max-iter(l, ctx). The value of max-iter(l, ctx) might change depending on the context ctx, call stack and loop iteration, of loop l, e.g. to capture triangular loops. This guarantees a finite number of paths in the program. Calls are also subject to a small set of restrictions to guarantee the termination of the program. Recursion is assumed to be bounded, that is cycles or repetitions in the call graph of the analysed application must have a maximum number of iterations, similarly for loops in the control flow. Function pointers can be represented as multiple targets attached to a single call. Here, the set of target functions must be exact or an over-estimate of the actual ones, so as to avoid unsound estimates which do not take all valid paths into account. 4.2 Complete loop unrolling In the first analysis step, we conceptually transform the control-flow graph into a directed acyclic graph by loop unrolling and function in lining (Muchnick 1997 ). In contrast to the naive approach of enumerating all possible traces, analysis through complete loop unrolling has linear rather than exponential complexity with the number of loop iterations. Loop unrolling and function inlining are well-known techniques to improve the precision of data-flow analyses. A complete physical unrolling that removes all backedges significantly increases the size of the control-flow graph. A virtual unrolling and inlining is instead performed during analysis such that calls and iterations are processed as required by the control flow. The analysis then distinguishes the different call and iteration contexts of a vertex. In either case, the size of the graph explored during analysis and its complexity scales with the number of accesses in the program under consideration. Unrolling simplifies the analysis and significantly improves the precision. As opposed to state of the art analyses for deterministic replacement policies (Alt et al. 1996) , the analysis of random caches through cache state enumeration does not rely on the computation of a fixpoint. The abstract domain for the analysis is by nature growing with every access since it includes the estimated distribution of misses. Successive iterations increase the probability of blocks in the loop’s working set being in the cache, and in turn increase the likelihood of hits in the next iteration. The exhaustive analysis, if not supplemented by other methods, must process all accesses in the program. We assume in the following that unrolling is performed on all analysed programs. Section 6.4.2 discusses preliminary work to bypass this restriction. The analysis of large loops, with many predicted iterations, can be broken down into the analysis of a single iteration or groups thereof provided a sound upper-bound of the input state is used. The contributions of different segments are then combined to compute that of the complete loop or program. Such an upper-bound input can be derived as an example using cache state compression (Griffin et al. 2014a) to remove low value information. The definition of techniques to exploit the resulting trade-off between precision and analysis complexity is left as future work. 4.3 Reuse distance/cache contention on CFG To extend the concept of reuse distance to control-flow graphs, we lift the definition from a single trace to all traces and take the maximal reuse distance of all possible traces ending in the node v: The cache contention is extended accordingly: r d G : V → N ∪ {∞} rdG (v) = π=m[vsa,x...,v] (r d(v, π )) conG : V → N conG (v) = π=m[vsa,x...,v] (con(v, π )) (30) (31) (32) (33) An upper-bound of both metrics for each access can be computed through a forward data flow analysis. The reuse distance analysis uses the maximum of the possible values on path convergence. Similarly, we lift the definition of the forward reuse distance to control-flow graphs. It can be computed through a backward data flow analysis. The contention for each block at each point in the program is computed through a forward data flow analysis. The computation of the contention relies on the estimation of the set of contending cache blocks. Its analysis domain is more complex than the reuse distance as different sets of contending blocks may arise on different paths. The analysis tracks all such sets from incoming paths, as long as they are conclusive to a cannot decrease the execution time distributions over their counterparts for s thus: D(t, s[li = b]) ≤ D(t, s[−e][li = b]) (65) Hence, the sum of these distributions, in (21), cannot result in a decrease of the execution time distributions from s[−e] over s: M + b = e The first access [b] is a hit in s and t executes from input state s. From s[−e] = s[−b], the first access [b] is a miss. As in the previous case, the resulting cache states may be the same should the lines selected for eviction and replacement match, s[−b][l j = b] = s. Alternatively, another block is evicted from the cache to insert b again and the resulting state s[−b][l j = b] holds the same contents as s[−l j ]. From the induction hypothesis, we know that: As a consequence, for any j , s[−b][l j = b] holds the same contents as either s or s[−l j ] and we have: D(t, s) ≤ D(t, s[−l j ]) D(t, s) ≤ D(t, s[−b][l j = b]) (67) (68) (69) (70) (71) H + D(t, s) corresponds to the execution time of trace t after an initial hit in cache state s as per (21). Similarly the right hand term corresponds to an initial This can be extended to the execution time distribution of t = [[b], t ]. Since the property holds for any j , we expand the equation to a weighted sum across values of j : Since the selection of j has no impact on the left-hand term, we have: D(t, s) ≤ Because of the ordering between the hit and miss latencies, we can expand the equation by adding a hit and miss latencies respectively on the left and right hand-sides: H + D(t, s) ≤ M + fist miss from the state s[−b] before the execution of t . An access to b fits these behaviour and can be incorporated into both terms: D([[b], t ], s) ≤ D([[b], t ], s[−b]) Hence, ∀t, ∀s, D(t, s) ≤ D(t, s[−e]) H + D(t, s) = M + – Inductive case t = [[b], t ]: Assume the property holds for any trace t and any x : H + D(t , s) ≤ M + D(t , s[l j = x ]) Theorem 2 The replacement of a random block in cache triggers at most one additional hit. The distribution for any trace t from any cache state s is upper-bounded by the distribution for trace t after the replacement of a random block in s and assuming a single hit turns into a miss. Proof The property trivially holds if e is already present in the cache as the replacement then has no impact on the cache state, s[li = e] = s. We only consider input states s where e is absent and prove this property by induction. H + D(t, s) = M + D(t, s[li = e]) The property holds for any i and can be extended to the weighted sum over i : H + The same distribution is weighted and summed N times on the left-hand term, it can be simplified as such: (72) (23) (73) (74) (75) (76) (77) The execution time distribution of t = [[b], t ] from either s or one of the s[li = e] depends first on the presence or absence of the first accessed block b in the cache. We consider all alternatives and expand the execution time distribution of the trace as per (21), b = e, b = e ∧ b ∈ s, and b = e ∧ b ∈/ s. – b = e: The block is absent from the input cache state s and results in a miss and the eviction of a line l j : H + D([[e], t ], s) = H + M + When e is randomly inserted in the cache before the execution of the same sequence, its presence in s[li = e] results in a guaranteed hit from any of the N possible states. The resulting cache states are left unchanged: (79) The two expanded distributions (78) and (79) obviously have the same behaviour. The additional miss and hit latencies respectively balance the guaranteed hit and miss, while the resulting cache states are the same. Hence: H + D([[e], t ], s) = M + D(t, s) = H + D(t , s) From the input state s[li = e], different cases have to be considered depending on whether e replaced b or not, that is respectively a guaranteed miss or a hit. Upon a hit in particular, the cache state is left unchanged and replacement of line li can occur after or before the access without incidence: 1 D(t, s[li = e]) = MH++D(tj∈,[s1[,lNi ]=N e·]D)(t , s[b = e][l j = b]) iofthlie=rwbise (82) We expand the definition (82) of the contribution of t assuming a line li was first replaced by e in the cache. We sum the N different terms resulting from the replacement of b or one of the other block, as follows: We deduce from the induction hypothesis (77) an upper bound U on the execution time distribution of t from s: With the addition of a hit latency H on both sides: H + H + D(t , s) ≤ H + M + H+D(t , s) is equivalent to the execution time of t from s as expressed in (81): H + D(t, s) ≤ H + M + H + D(t, s) ≤ M + U (84) (85) (86) (87) (88) (89) (90) We further distinguish in U the cases where e specifically replaces b in the cache or any other line: U = H + Thanks to the induction hypothesis (77), we can define an upper-bound on the left-most term, where e replaces b in the cache: H + D(t , s[b = e]) ≤ M + Multiplying both sides by N1 , we have: 1 1 ⎛ N · H + D(t , s[b = e]) ≤ N ·⎝ We can then deduce that U is a lower bound on the execution time distribution of t when e first replaces a random line li in s. From (88) and (90): 1 ⎛ U ≤ N · ⎝ + From (83), it follows that: This can be combined with (87) such that M + U is an intermediate bound between (81) and (82): H + D(t, s) ≤ M + U ≤ M + – b = e, b ∈/ s: We need to consider two subcases depending on whether the random insertion of b or e results in the higher execution time distribution for t , i.e. the comparison between Db and De: (91) (92) (93) (94) (95) (96) Db = De = i∈[1,n] i∈[1,n] 1 N · D(t , s[li = b]) H + D(t , s[l j = b]) ≤ M + The property further holds for any j and extends to the weighted sum over j of the terms on each side of the inequality: H + Using the definition of U j (98), the right-hand term can be expanded, by distinguishing the cases where j and i denote the same line, that is when e replaces the randomly inserted b, s[l j = b][l j = e] = s[l j = e]: + 1 N · U j = 1 N · 1 N · (101) 1 N · D([[b], t ], s[li = e]) + 1 N · i∈[1,N ] 1 1 N · N · D(t , s[l j = b]) By substituting Db (95) in the above equation we get: 1 N · When lines i and j do not match, the ordering of the replacement of li and l j by e and b respectively is irrelevant, s[li = e][l j = b] = s[l j = b][li = e]. Hence the difference between (104) and the sum of U j (102) depends on the ordering between respectively Db and De. Since De ≤ Db, it follows from (104) and (102) that: M + 1 N · U j ≤ As a consequence of (105) and (100), it follows that: H + Adding a miss latency M on both sides of the inequality: M+H+ (107) The left-hand term collapses to the execution time distribution of t = [[b], t ] from s as per (21) as b is absent from the input cache state s. Hence, the property holds: H + D([[b], t ], s) ≤ M + • Db ≤ De: The induction hypothesis (77) gives us the following relationship: H + D(t , s[li = e]) ≤ M + 1 N · D(t , s[li = e][l j = b]) (109) We can reduce the right-hand term as per (21) given that b is absent from the initial cache state s: H + D(t , s[li = e]) ≤ D([[b], t ], s[li = e]) (110) The property, valid for any line li , holds for summation below: H+ Considering the ordering Db ≤ De between Db (95) and De (96), we conclude that: (112) (113) Through the expansion of Db (95) and the addition of a miss latency M on both sides of the inequality, we have: 1 N ·D(t , s[li = b]) ≤ M+ M+H+ The same distribution might not be the dominant one on the whole input domain; there might be segments where De is greater than Db and the converse is true on the rest of the input domain. However, the property holds in either case. Hence the theorem still holds on each segment. The property holds in all scenarios, whether b = e, or block b is absent or present in input cache state s. The random replacement of a line li by e can trigger an additional hit on the first subsequent access to e. The additional miss latency compensates for this potential hit. From the original cache state, this access is a guaranteed miss. The resulting cache states, and the behaviour of the rest of the sequence, match whether this first access to e results in a cache hit (from s[li = e]) or a miss (from s). Lemma 3 The replacement in input cache state s of a block by another one in trace t has no impact, timing and cache contents-wise, up to the first access to either block. The replacement can occur indifferently before trace t or before the first access to either block. frd(b, t ) ≤ frd(e, t ) ≤ ∞ ∧ t = [t p, [b], ts ] ∧ b ∈/ t p ⇒ D(t, s[b = e]) = D(t, s[e = b]) = = = (C ,P ,D )∈outcomes(tp,s[b=e]) Proof The property trivially holds if the input cache state s holds both b and e or neither as the replacement are then ineffective. We focus on states which hold either one but not both. s denotes the input cache where the replacement occurred, s = s[b = e] or s = s[e = b] The trace t can be divided as such t = [t p, [b], ts ] where [b] is the first reference to b in t . The subtrace t p holds no reference to b, nor to e as a consequence of the ordering between their forward reuse distances. The execution time distribution of trace t as per (21) is: D(t, s) = P · (D ⊗ D ([[b], ts ], C )) (117) (C,P,D)∈outcomes(tp,s) Accesses in t p are not impacted by the presence of either b or e in the input cache. The sequence of evictions from s which lead to cache state C with probability P and execution time distribution D is matched starting from s . It results in cache state C with the same probability P and execution time distribution D. If the replaced block is absent from C , it has been evicted by accesses in t p and similarly the replacing block has been evicted in C . If the replaced block is still present in C , the replacing block is similarly present in C . The other lines hold the same contents since we consider the same fixed sequence of evictions on t p from s and s . Theorem 3 The replacement of a block in input cache state s by one which is reused later in trace t cannot result in a decreased execution time distribution: frd(b, t ) ≤ frd(e, t ) ≤ ∞ ∧ b ∈ s ∧ e ∈/ s ⇒ D(t, s) ≤ D(t, s[b = e]) Proof If there is no reference to memory block e in the considered trace t , the replacement of b by e in input cache state s is equivalent to the eviction of b from the cache, e ∈/ t ⇒ s[b = e] = s[−b]. The theorem then holds as per Theorem 1. We therefore focus on the case where e is accessed in t , frd(e, t ) = ∞. We cut the trace t into different segments t = [t p, [b], tm , [e], ts ] such that t p holds no reference to b nor e as a consequence of their forward reuse distances. Similarly, we define tm such that it holds no reference to e. The first reference to b and e in trace t are respectively located after t p and tm . Because of Lemma 3, we know that the replacement has no impact on t p which holds no reference to either the replaced block b or the replacing one e. We focus on the execution time distribution of the trace t = [[b], tm , [e], ts ] from a state Cb, which holds b but not e. We further prove by induction that: Base case tm = ts = ∅, t = [b, e]: The property trivially holds as the execution of t from Cb results in a hit then a miss, D(t , Cb) = H + M, whereas it misses then may hit or miss from input Cb[b = e], D(t , Cb[b = e]) ≥ M + H. Inductive case t = [[b], tm , [e], ts ]: Suppose the property D(t , C ) ≤ D(t , C [x = y]) holds for any trace t = [[x ], tm , [y], ts ] where tm does not access y and any input state C which does not hold y. From Lemma 3, this hypothesis applies to arbitrary prefixes t p as long as they hold neither x nor y: x ∈/ t p ∧ y ∈/ t p ∧ y ∈/ tm ∧ y ∈/ C ⇒ D([t p, [x ], tm , [y], ts ], C ) ≤ D([t p, [x ], tm , [y], ts ], C [x = y]) The first access to b in t is a guaranteed hit from Cb and a miss from Cb[b = e]. The resulting execution time distributions can be expressed as per (21): i∈[1,N ] D(t , Cb[b = e]) = M + D(t , Cb) = H + D([tm , [e], ts ], Cb) 1 N · D([tm , [e], ts ], Cb[b = e][li = b]) (118) (119) (120) (121) (123) – If b is present in tm , there is an access to b prior to the first access to e in the remaining trace. [tm , [e], ts ] can be further split into [tm , [b], tm , [e], ts ] such that tm = [tm , [b], tm ] and tm holds no reference to b nor e. There is a reference to b before the next access to e. From the induction hypothesis (119), substituting b for x and e for y, we have: D([tm , [e], ts ], Cb) ≤ D([tm , [e], ts ], Cb[b = e]) (122) From Theorem 2, we know that: H+D([tm , [e], ts ], Cb[b = e]) ≤ M+ Hence from (122) and (123) the property trivally holds when b is in tm using D([tm , [e], ts ], Cb[b = e]) as an intermediate bound: H + D([tm , [e], ts ], Cb) ≤ H + D([tm , [e], ts ], Cb[b = e]) 1 ≤ M + N · D([tm , [e], ts ], Cb[b = e][li = b]) The leftmost and rightmost terms can be reduced to the property of interest using respectively (120) and (121): D(t , Cb) ≤ D(t , Cb[b = e]) – Now consider the case where b is absent from the trace tm as is e. We distinguish the case where the first miss from Cb[b = e] in t = [[b], tm , [e], ts ] selects the line that originally held b, Cb[b = e][li = b] = Cb, from the ones where a different line is selected. The latter results in a cache state equivalent to Cb[li = e]. By separating those cases in (21), we have: 1 D(t , Cb[b = e]) = M + N · D([tm , [e], ts ], Cb) + (125) (126) (127) (129) (130) This allows the definition of a lower-bound U of the contribution of the complete trace from Cb[b = e]: U = We further distinguish the case where li holds b from the others: 1 U = N · D([tm , [e], ts ], Cb[b = e]) + D([tm , [e], ts ], Cb[b = e]) ≤ D([tm , [e], ts ], Cb[b = e][e = b]) Replacing b by e then e by b has no impact on the cache contents, Cb[b = e][e = b] = Cb: D([tm , [e], ts ], Cb[b = e]) ≤ D([tm , [e], ts ], Cb) Dividing both sides by N we get: 1 1 N · (D([tm , [e], ts ], Cb[b = e])) ≤ N · (D([tm , [e], ts ], Cb)) (131) We add the same factor, the random replacement of a line other than b, on both sides: 1 N · D([tm , [e], ts ], Cb[b = e]) + M + U ≤ D(t , Cb[b = e]) From Theorem 2, we can compare the bound U to the execution time distribution of t from Cb: H + D([tm , [e], ts ], Cb) ≤ M + The rightmost term collapses to M + U from (128): 1 N · D([tm , [e], ts ], Cb[li = b]) (134) H + D([tm , [e], ts ], Cb) ≤ M + U Hence, from this equation and (133) the property holds: H + D([tm , [e], ts ], Cb) ≤ M + U ≤ D(t , Cb[b = e]) D(t , Cb) ≤ D(t , Cb[b = e]) Lemma 1 The convolution operation preserves the ordering between execution time distributions: D ≤ D ⇒ D ⊗ A ≤ D ⊗ A Proof Let us first assume that D ≤ D. This relation implies that D is greater than D, more formally: ∀v, P(D ≥ v) ≤ P(D ≥ v) This property applies to the sum of probabilities for all values greater than v: x=v ∀v, +∞ x=v (133) (135) (136) (137) (138) (139) It can be in particular extended to any value (v − k): ∀v, ∀k, As we are considering the sum to infinity of values D(x ), k can be subtracted indifferently from x or its lower bound v: ∀v, ∀k, D(x ) = D(x − k) ∀v, ∀k, D(x − k) ≤ D (x − k) x=v x=v From the two previous equations, we have: The occurrence probability of any element x in a distribution A is by definition a positive number, A(k) ≥ 0. We can factor the same both sides of the inequality with the same values A(k): (140) (141) (142) (143) (144) (147) ∀v, ∀k, A(k) · D(x − k) ≤ A(k) · D (x − k) ∀v, ∀k, A(k) · D(x − k) ≤ A(k) · D (x − k) As the inequality holds for any element k, it holds for their overall sum over k: A(k) · D(x − k) ≤ A(k) · D (x − k) (145) +∞ x=v x=v +∞ +∞ k=−∞ x=v +∞ +∞ x=v k=−∞ +∞ x=v ∀v, +∞ +∞ k=−∞ x=v ∀v, +∞ +∞ x=v k=−∞ A(k) · D(x − k) ≤ Thanks to the commutativity of the sum operands, we have: A(k) · D (x − k) (146) Both terms of the inequality correspond to the convolution of distributions as defined in (20): ∀v, (A ⊗ D)(x ) ≤ (A ⊗ D )(x ) This defines an order between the result of the convolution of D and D distribution A: Per commutativity of the convolution operator ⊗, we have: ∀v, P((A ⊗ D) ≥ v) ≤ P((A ⊗ D ) ≥ v) D ≤ D ⇒ D ⊗ A ≤ D ⊗ A Lemma 2 The contributions of merged sets of cache states S and A is the sum of their individual contributions: ∀t, D(t, S) + D(t, A) = D(t, S A) Proof S A can be divided into three categories, cache states C that exist only in S, only A or in both, denoted respectively OnlyS, OnlyA, Com(S,A). The contribution of states in OnlyS and OnlyA is unchanged by the merge operation. Only states in Com(S,A) are subject to the weighted merge in (6). We focus on proving the equivalence between the contribution of Com(S,A) and that of original states from A and S respectively ComA and ComS, Com(S,A) = ComS ComA. Each state in Com(S,A) is the combination of corresponding states from ComS and ComA. Without loss of generality, we assume there is a single matching state in ComS and ComA for each merged one in Com(S,A): ∀(C, P, D) ∈ Com(S,A), ∃(C, PA, DA) ∈ ComA ∧ ∃(C, PS, DS) ∈ ComS ∧ P = PA + PS ∧ D = ( PA P · DS) P · DA) + ( PS We can express the execution time contribution of Com(S,A) as per (37): D(t, Com(S,A)) = P · (D ⊗ D(t, C )) (C,P,D)∈Com(S,A) By replacing each merged distribution D with the original distributions and probabilities from S and A, we have: ⊗ D(t, C ) (153) By definition, the convolution of distributions and the multiplication of a distribution by a constant are associative operations, P ·(D ⊗D ) = ( P ·D)⊗D . We can therefore factor P inside the merged distributions: with (149) (150) (151) (152) This equation can be refined into the contribution of states in ComS and ComA as follows: PA PS P · ( P · DA) + P · ( P · DS) ⊗ D(t, C ) (( PA · DA) + ( PS · DS)) ⊗ D(t, C ) ( PA · DA) ⊗ D(t, C )+ ( PS · DS) ⊗ D(t, C ) ( PA · DA) ⊗ D(t, C ) ( PS · DS) ⊗ D(t, C ) D(t, Com(S,A)) = D(t, ComA) + D(t, ComS) (154) (156) (157) (158) Theorem 8 (Renamed path ordering) Given a path π divided into three sub-paths π = [πS, πV , πE ], where πV = [e, v1, . . . , vk , e]. The pWCET of π is smaller than or equal to that of the renamed sequence πr = [πS, πV (e → b), πE ], D(π ) ≤ D(πr ), if: – there is no access to b in πV ; – the reuse distance of e before πV is smaller than that of b at this point; – the forward reuse distance of e at the end of πV is smaller than that of b at this point. Proof We focus on the behaviour of the execution time distribution of the path π and its renamed alternative starting from the empty cache state, since is known to result in the worst execution time distribution over any other input state. Any valid pWCET must upper-bound this distribution, hence D(∅, π ) is a tight pWCET for path π . The execution of path πS generates an ensemble outcomes(πS, ∅) of cache states C . To each is attached an associated execution time distribution D, corresponding to the hit and miss latencies of prior accesses, and an occurrence probability P. The renaming does not impact the behaviour of accesses in πS, therefore outcomes(πS, ∅) is left unchanged. The execution time distribution of the renamed segment πV (e → b) is no greater nor smaller than that of πV from those cache states that hold neither b or e, or hold both, (e ∈ s ∧ b ∈ s) ∨ (e ∈/ s ∧ b ∈/ s) ⇒ D(πV (e → b), s) = D(πV ). States that hold neither b nor e result in the same hit and miss events on both paths except that b replaces e on the renamed path. This also produces the same cache states but where b replaces e after the renamed segment. As for states that hold both b and e, events which impact the line where b is held on the original path are as likely to impact that of e on the renamed one and vice-versa, e.g. the eviction of b on the original corresponds to that of e on the renamed one. This also results in the same cache states where b replaces e on the renamed path. The outcomes on πV then match the ones on πV (e → b) where b replaces e in the cache. When both blocks b and e are present in the outcomes on πV they match the ones on the renamed path πV (e → b). The execution time distribution of the last segment πS is the same in either case. When e is in the cache without b after πV , it is matched by a state after πV (e → b) where b replaces e. From the Suffix ordering condition the first access to e in πS is before the first access to b, frd(e, πS) < frd(b, πS). Theorem 3 applies; the execution of πS after πV , when e is in cache but not b, results in an execution time distribution that is no greater than the one after πV (e → b) when b replaces e in the cache. Note that b cannot be in cache without e after πV since the last access in πV targets e. We now focus on the contribution of states which hold one of e or b but not both, and prove their contribution to the renamed path outweighs that of the original. Re and Rb respectively distinguish between those states of outcomes(πS, ∅) which hold one of e or b. Base case πV = [e]: Every state in Rb shares a common ancestor state with a state in Re such that they hold the same contents but e replaces b. Indeed because of the condition on Prefix ordering, all states in Rb come from states where b and e were held in the cache simultaneously, after the last access to e in the prefix πS. There is then a sequence of events which evicts e from the cache while conserving b, hence resulting in a state belonging to Rb. There is a matching sequence of events from this common ancestor which conserves e instead of b. Simply assume that evictions on the line of e target that of b and vice-versa. The two sequences of events are exactly as likely to occur as there is no other access to either b or e from their common ancestor to the renamed segment. Consider the following four scenarios for a state s of Rb: 1. [e] executes from s, hence resulting in a miss and N output cache states s[li = e]. 2. [e] executes from the as likely s[b = e] of Re, hence resulting in a hit and the output cache state s[b = e]. 3. [b], the renamed sequence, executes from s, hence resulting in a hit and an output cache state s. 4. [b] executes from s[b = e], hence resulting in a miss and N output cache states s[b = e][li = b]. Scenarios 3 and 2 balance each other, resulting in a worse behaviour on the renamed sequence. Both suffer from the same execution latency for πV . Because of the Suffix ordering condition on the ordering between the forward reuse distances of b and e and Theorem 3, the execution time distribution of πE is worse starting from s than from s[b = e]. A similar argument can be made for scenarios 1 and 4. Each line li has the same probability to be selected for eviction in each scenario. If li is the line that held b in s, the output cache states in cases 1 and 4 respectively are s[b = e] and s[b = e][li = b] = s[b = e][e = b] = s. As per Theorem 3, it results in execution time distributions that are no lower for the renamed path than for the original one. If li is another line, the resulting cache states, s[li = e] and s[b = e][li = b], hold the same contents, s[b = e][li = b] = s[li = e], and result in the same execution time distribution for πE . As for the remaining states Ce in Re, the ones which do not mirror a state in Rb they respectively result in a hit on the original segment and a miss on the renamed one. On the renamed path, this results in the replacement of a line li by b. In other words, D([πV , πE ], Ce) = H + D(πE , s) and D([πV (e → b), πE ], Ce) = M + i∈[1,N ] D(πE , Ce[li = b]). The execution time distribution of the original path from Ce is therefore no greater than that of the renamed path according to Theorem 2. Overall the execution of the renamed path [b] results in execution time distributions that are no lower than those obtained through the execution of the original one [e]. General case πV = [e, v1, . . . , vk , e]: The arguments for the basic case can be extended to the general case where πV holds multiple accesses. The key observation is that the renaming has no impact on the reuse distance of accesses within πV except for the first. As in the base case, we focus on the contribution of states which hold one of b or e. Consider the same four scenarios for input state s ∈ Rb of πV = [e, v1, . . . , vk , e]: 1. [e, v1, . . . , vk , e] executes from s, hence resulting in a first miss and N cache states s[li = e]. 2. [e, v1, . . . , vk , e] executes from s[b = e] of Re, hence resulting in a first hit and cache state s[b = e]. 3. [b, v1, . . . , vk , b] executes from s, hence resulting in a first hit and cache state s. 4. [b, v1, . . . , vk , b] executes from s[b = e], hence resulting in a first miss and N cache states s[b = e][li = b]. From scenario 1 to 4, and scenario 2 to 3, b simply replaces e in both cache contents and trace of accesses πV . The behaviour of the first access in πV is the same in either the original or the renamed path, and there is no more misses on the original than on the renamed path since they have the same initial contents and trace where b simply replaces e. The reuse distance of all accesses but the first is left unchanged between these pairs of scenarios, D(πV , s) = D(πV (e → b), s[b = e]) and D(πV , s[b = e] = D(πV (e → b), s). The resulting cache states after πV (e → b) also match the ones after πV with b replacing e in cache. Because of the Suffix ordering condition, the first access to b in πE is preceded by an earlier access to e. Hence from Theorem 3 we have that the execution of πE after πV results in an execution time distribution that is no greater than the one starting from the matching input state s [e = b] after the renamed path πV (e → b). For some cache states Ce in Re, the input states of πV which hold e but not b, do not mirror a state in Rb. The first access in πV (e → b) is a miss from Ce. This intuitively increases the reuse distance of the remaining accesses in the renamed [πV (e → b), πE ] over the original trace [πV , πE ]. We prove by induction that: D([πV , πE ], Ce) ≤ D([πV (e → b), πE ], Ce) (159) The base case, when πV holds a single access to e, has already been proved thanks to Theorem 2. Our induction hypothesis is, assuming πV is a subtrace of πV : From Theorem 2, we have: H + D([[v1, . . . , vk , e](e → b), πE ], Ce) ≤ M + D([[v1, . . . , vk , e](e → b), πE ], Ce[li = b]) i∈[1,N ] The left-hand term corresponds to the execution of a trace where the renaming from block e to b occurs on πV = [v1, . . . , vk , e], after the first access to e in πV . The right-hand term simply exhibits the first miss on the execution of the renamed trace [πV (e → b), πE ] from Ce as per (21): H + D([[v1, . . . , vk , e](e → b), πE ], Ce) ≤ D([πV (e → b), πE ], Ce) From the induction hypothesis we have: D([[v1, . . . , vk , e], πE ], Ce) ≤ D([[v1, . . . , vk , e](e → b), πE ], Ce) By inserting a hit latency on both sides this equation becomes: H+D([[v1, . . . , vk , e], πE ], Ce) ≤ H+D([[v1, . . . , vk , e](e → b), πE ], Ce) (164) The first access in πV = [e, v1, . . . , vk , e] from Ce is a cache hit and leaves the cache state unchanged. The term on the left hand side can be expressed as: D([πV , πE ], Ce) ≤ H + D([[v1, . . . , vk , e](e → b), πE ], Ce) Hence, H+D([[v1, . . . , vk , e](e → b), πE ], Ce) is an intermediate bound between the execution time distributions of D([πV , πE ], Ce) and D([πV (e → b), πE , Ce). From (165) and (162), we have: D([πV , πE ], Ce) ≤ H + D([[v1, . . . , vk , e](e → b), πE ], Ce) ≤ D([πV (e → b), πE ], Ce) (166) Each possible input cache state s to the renamed segment has an as likely match s in the original trace such that the execution time distribution of the renamed segment from s is no lower than that of the original from s. Reineke J, Wachter B, Thesing S, Wilhelm R, Polian I, Eisinger J, Becker B (2006) A definition and classification of timing anomalies. In: 6th International workshop on worst-case execution time (WCET) analysis Spreitzer R, Plos T (2013) Cache-access pattern attack on disaligned AES T-tables. In: Proceedings of the 4th international conference on constructive side-channel analysis and secure design (COSADE’13), pp 200–214 Theiling H, Ferdinand C, Wilhelm R (1999) Fast and precise WCET prediction by separated cache and path analyses. Real Time Syst 18:157–179 Wang Z, Lee RB (2007) New cache designs for thwarting software cache-based side channel attacks. In: Proceedings of the 34th annual international symposium on computer architecture (ISCA ’07). ACM, New York, pp 494–505 Wang Z, Lee RB (2008 ) A novel cache architecture with enhanced performance and security. In: Proceedings of the 41st annual IEEE/ACM international symposium on microarchitecture (MICRO 41), pp 83–93 Wegener S (2012) Computing same block relations for relational cache analysis. In: 12th International workshop on worst-case execution time analysis, pp 25–37 Wilhelm R, Engblom J, Ermedahl A, Holsti N, Thesing S, Whalley D, Bernat G, Ferdinand C, Heckmann R, Mitra T, Mueller F, Puaut I, Puschner P, Staschu lat J, Stenström P (2008 ) The worst-case execution-time problem: overview of methods and survey of tools. ACM Trans Embed Comput Syst 7(3):1–53 Benjamin Lesage is a Research Associate in the Real-Time Systems Research Group at the University of York, UK. Benjamin received his PhD in Computer Science in 2013 from the University of Rennes, France. He has since been at the University of York as a Research Associate. He is currently working in the context of a Knowledge Transfer Partnership, in collaboration with industrial partners, to put into practice his knowledge of real-time systems’ timing analyses. David Griffin is currently a member of the Real Time Systems Group at the University of York, UK. His research has primarily been in the application of non-standard techniques to Real-time problems, utilising techniques from various other fields such as lossy compression, statistics and machine learning. Sebastian Altmeyer is Assistant Professor (Universitair Docent) at the University of Amsterdam. He has received his PhD in Computer Science in 2012 from Saarland University, Germany with a thesis on the analysis of preemptively scheduled hard real-time systems. From 2013 to 2015 he has been a postdoctoral researcher at the University of Amsterdam, and from 2015 to 2016 at the University of Luxembourg. In 2015, he has received an NWO Veni grant on the timing verification of real-time multicore systems, and he is program chair of the Euromicro Conference on Real-Time Systems (ECRTS) 2018. His research targets various aspects of the design, analysis and verification of hard real-time systems, with a particular interest in timing verification and multicore architectures. Liliana Cucu-Grosjean Photograph and Biography not available. Al-Zoubi H , Milenkovic A , Milenkovic M ( 2004 ) Performance evaluation of cache replacement policies for the SPEC CPU2000 benchmark suite . In: Proceedings of the 42nd annual Southeast regional conference. ACM , New York, pp 267 - 272 Alt M , Ferdinand C , Martin F , Wilhelm R ( 1996 ) Cache behavior prediction by abstract interpretation . In: Science of computer programming . Springer, Heidelberg, pp 52 - 66 Altmeyer S , Davis RI ( 2014 ) On the correctness, optimality and precision of static probabilistic timing analysis . In: 17th Conference on Design, Automation and Test in Europe (DATE) Altmeyer S , Cucu-Grosjean L , Davis RI ( 2015 ) Static probabilistic timing analysis for real-time systems using random replacement caches . Real Time Syst 51 : 77 - 123 Atanassov P , Puschner P ( 2001 ) Impact of DRAM refresh on the execution time of real-time tasks . In: Proceedings of IEEE international workshop on application of reliable computing and communication , pp 29 - 34 Ballabriga C , Cassé H ( 2008 ) Improving the WCET computation time by IPET using control flow graph partitioning . In: 8th International workshop on worst-case execution time analysis (WCET) Bernat G , Burns A , Newby M ( 2005 ) Probabilistic timing analysis: an approach using copulas . J Embed Comput 1 ( 2 ): 179 - 184 Bernat G , Colin A , Petters S ( 2002 ) WCET analysis of probabilistic hard real-time systems . In: 23rd IEEE real-time systems symposium (RTSS) , pp 279 - 288 Bernat G , Colin A , Petters S ( 2003 ) pWCET: a tool for probabilistic worst-case execution time analysis of real-time systems . Tech. Report YCS-353-2003 , Department of Computer Science, The University of York Bhat B , Mueller F ( 2011 ) Making DRAM refresh predictable . Real Time Syst 47 : 430 - 453 Bourgade R , Ballabriga C , Cassé H , Rochange C , Sainrat P ( 2008 ) Accurate analysis of memory latencies for WCET estimation . In: 16th Conference on real-time and network systems (RTNS) Burns A , Edgar S ( 2000 ) Predicting computation time for advanced processor architectures . In: Proceedings of the 12th Euromicro conference on real-time systems (Euromicro-RTS'00) Cazorla F , Quiñones E , Vardanega T , Cucu L , Triquet B , Bernat G , Berger E , Abella J , Wartel F , Houston M , Santinelli L , Kosmidis L , Lo C , Maxim D ( 2013 ) Proartis: probabilistically analysable real-time systems . ACM Trans Embed Comput Syst 1 ( 2s ): 1 - 26 Chiou D , Chiouy D , Rudolph L , Rudolphy L , Devadas S , Devadasy S , Ang BS , Angz BS ( 2000 ) Dynamic cache partitioning via columnization . In: Proceedings of design automation conference Colin A , Puaut I ( 2001 ) A modular and retargetable framework for tree-based WCET analysis . In: 13th Euromicro conference on real-time systems (ECRTS) , pp 37 - 44 Cortex-R4 and Cortex-R4F Technical Reference Manual ( 2010 ) http://infocenter.arm.com/help/index.jsp? topic=/com.arm.doc.set.cortexr/index.html Cucu-Grosjean L ( 2013 ) Independence-a misunderstood property of and for probabilistic real-time systems . In: Alan Burns 60th anniversary , York Cucu-Grosjean L , Santinelli L , Houston M , Lo C , Vardanega T , Kosmidis L , Abella J , Mezzetti E , Quiones E , Cazorla FJ ( 2012 ) Measurement-based probabilistic timing analysis for multi-path programs . In: 24th Euromicro conference on real-time systems (ECRTS) , pp 91 - 101 David L , Puaut I ( 2004 ) Static determination of probabilistic execution times . In: 16th Euromicro conference on real-time systems (ECRTS) , pp 223 - 230 , June 2004 Davis RI , Santinelli L , Altmeyer S , Maiza C , Cucu-Grosjean L ( 2013 ) Analysis of probabilistic cache related pre-emption delays . In: 25th Euromicro conference on real-time systems (ECRTS) de Dinechin BD , van Amstel D , Poulhiès M , Lager G ( 2014 ) Time-critical computing on a single-chip massively parallel processor . In: Conference on Design, Automation & Test in Europe (DATE) Edgar S , Burns A ( 2001 ) Statistical analysis of WCET for scheduling . In: 22nd IEEE real-time systems symposium (RTSS '01) Griffin D , Burns A ( 2010 ) Realism in statistical analysis of worst case execution times . In: 10th International workshop on worst-case execution time analysis (WCET'10) , July 2010 Griffin D , Lesage B , Burns A , Davis R ( 2014a ) Lossy compression for static probabilistic timing analysis of random replacement caches . In: 22st International conference on real-time networks and systems (RTNS '14) Griffin D , Lesage B , Burns A , Davis RI ( 2014b ) Lossy compression for worst-case execution time analysis of PLRU caches . In: Proceedings of the 22nd international conference on real-time networks and systems (RTNS '14) Grund D , Reineke J ( 2010 ) Precise and efficient FIFO-replacement analysis based on static phase detection . In: the 22nd Euromicro conference on real-time systems (ECRTS '10) , July 2010 Grund D , Reineke J ( 2010 ) Toward precise PLRU cache analysis . In: 10th International workshop on worst-case execution time analysis (WCET'10) , pp 28 - 39 , July 2010 Gustafsson J , Betts A , Ermedahl A , Lisper B ( 2010 ) The Mälardalen WCET benchmarks-past, present and future . In: Proceedings of the 10th international workshop on worst-case execution time analysis (WCET) , pp 137 - 147 Hahn S , Grund D ( 2012 ) Relational cache analysis for static timing analysis . In: 2012 24th Euromicro conference on real-time systems , pp 102 - 111 Hahn S , Reineke J , Wilhelm R ( 2015 ) Towards compositionality in execution time analysis: definition and challenges . In: SIGBED Review , vol 12 . ACM, New York, pp 28 - 36 Hennessy JL , Patterson DA ( 2011 ) Computer architecture: A quantitative approach, 5th edn . Morgan Kaufmann, Burlington Holsti N , Lngbacka T , Saarinen S ( 2000 ) Using a worst-case execution time tool for real-time verification of the DEBIE software . In: Proceedings of the DASIA 2000 (data systems in aerospace ) conference Huynh BK , Ju L , Roychoudhury A ( 2011 ) Scope-aware data cache analysis for WCET estimation . In: 17th Real-time and embedded technology and applications symposium (RTAS) Kosmidis L , Abella J , Quiñones E , Cazorla FJ ( 2013 ) A cache design for probabilistically analysable realtime systems . In: 16th conference on Design, Automation and Test in Europe (DATE) , pp 513 - 518 Kosmidis L , Abella J , Wartel F , Quinones E , Colin A , Cazorla F ( 2014 ) PUB: path upper-bounding for measurement-based probabilistic timing analysis . In: 26th Euromicro conference on real-time systems (ECRTS) Lesage B , Griffin D , Davis R , Altmeyer S ( 2013 ) On the application of static probabilistic timing analysis to memory hierarchies . In: Real-time scheduling open problems seminar (RTSOPS) Lesage B , Griffin D , Altmeyer S , Davis R ( 2015a ) Static probabilistic timing analysis for multi-path programs . In: Real-time systems symposium (RTSS) Lesage B , Griffin D , Soboczenski F , Bate I , Davis RI ( 2015b ) A framework for the evaluation of measurement-based timing analyses . In: 23rd International conference on real time and networks systems (RTNS) Li YT , Malik S ( 1997 ) Performance analysis of embedded software using implicit path enumeration . Trans Comput Aided Des Integr Circuit Syst 16 : 1477 - 1487 Liang Y , Mitra T ( 2008 ) Cache modeling in probabilistic execution time analysis . In: Proceedings of the 45th annual design automation conference (DAC) , pp 319 - 324 López J , Díaz J , Entrialgo J , García D ( 2008 ) Stochastic analysis of real-time systems underpreemptive priority-driven scheduling . Real Time Syst 40 : 180 - 207 Maxim D , Houston M , Santinelli L , Bernat G , Davis RI , Cucu-Grosjean L ( 2012 ) Re-sampling for statistical timing analysis of real-time systems . In: 20th International conference on real-time and network systems (RTNS) , pp 111 - 120 MPC8641D Integrated Host Processor Family Reference Manual ( 2008 ) http://www.nxp.com/products/ microcontrollers-and -processors/power-architecture-processors/integrated-host-processors/highperformance-dual-core-processor:MPC8641D?fpsp=1&tab=Documentation_Tab Muchnick SS ( 1997 ) Advanced compiler design and implementation . Morgan Kaufmann, San Francisco Nemer F , Cassé H , Sainrat P , Bahsoun J.P , Michiel M D ( 2006 ) PapaBench: a free real-time benchmark . In: 6th International workshop on worst-case execution time analysis (WCET'06) , vol 4 of OpenAccess Series in Informatics (OASIcs) Pasdeloup B ( 2014 ) Static probabilistic timing analysis of worst-case execution time for random replacement caches . Tech. Report , INRIA, Rennes Peleska J , Löding H ( 2008 ) Static analysis by abstract interpretation . University of Bremen, Centre of Information Technology, Bremen Puschner P , Koza C ( 1989 ) Calculating the maximum, execution time of real-time programs . Real Time Syst 1 ( 2 ): 159 - 176 Quinones E , Berger ED , Bernat G , Cazorla FJ ( 2009 ) Using randomized caches in probabilistic real-time systems . In: 21st Euromicro conference on real-time systems (ECRTS) , pp 129 - 138 Reineke J ( 2014 ) Randomized caches considered harmful in hard real-time systems . LITES 1 ( 1 ): 03 : 1 - 03 : 13


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1007%2Fs11241-017-9295-2.pdf

Benjamin Lesage, David Griffin, Sebastian Altmeyer, Liliana Cucu-Grosjean, Robert I. Davis. On the analysis of random replacement caches using static probabilistic timing methods for multi-path programs, Real-Time Systems, 2017, 1-82, DOI: 10.1007/s11241-017-9295-2