Scientific workflows for process mining: building blocks, scenarios, and implementation
Int J Softw Tools Technol Transfer
DOI 10.1007/s10009-015-0399-5
SW
Scientific workflows for process mining: building blocks,
scenarios, and implementation
Alfredo Bolt1 · Massimiliano de Leoni1 · Wil M. P. van der Aalst1
© The Author(s) 2015. This article is published with open access at Springerlink.com
Abstract Over the past decade process mining has emerged
as a new analytical discipline able to answer a variety of
questions based on event data. Event logs have a very particular structure; events have timestamps, refer to activities
and resources, and need to be correlated to form process
instances. Process mining results tend to be very different
from classical data mining results, e.g., process discovery
may yield end-to-end process models capturing different perspectives rather than decision trees or frequent patterns. A
process-mining tool like ProM provides hundreds of different process mining techniques ranging from discovery and
conformance checking to filtering and prediction. Typically, a
combination of techniques is needed and, for every step, there
are different techniques that may be very sensitive to parameter settings. Moreover, event logs may be huge and may
need to be decomposed and distributed for analysis. These
aspects make it very cumbersome to analyze event logs manually. Process mining should be repeatable and automated.
Therefore, we propose a framework to support the analysis of
process mining workflows. Existing scientific workflow systems and data mining tools are not tailored towards process
mining and the artifacts used for analysis (process models
and event logs). This paper structures the basic building
blocks needed for process mining and describes various
analysis scenarios. Based on these requirements we implemented RapidProM, a tool supporting scientific workflows
for process mining. Examples illustrating the different scenarios are provided to show the feasibility of the approach.
B Wil M. P. van der Aalst
1
Department of Mathematics and Computer Science,
Eindhoven University of Technology, Eindhoven,
The Netherlands
Keywords Scientific workflows · Process mining · Large
scale process analysis · RapidProM
1 Introduction
Scientific Workflow Management (SWFM) systems help
users to design, compose, execute, archive, and share workflows that represent some type of analysis or experiment.
Scientific workflows are often represented as directed graphs
where the nodes represent “work” and the edges represent paths along which data and results can flow between
nodes. Next to “classical” SWFM systems such as Taverna
[23], Kepler [33], Galaxy [20], ClowdFlows [27], and jABC
[40], one can also see the uptake of integrated environments
for data mining, predictive analytics, business analytics,
machine learning, text mining, reporting, etc. Notable examples are RapidMiner [22] and KNIME [4]. These can be
viewed as SWFM systems tailored towards the needs of data
scientists.
Traditional data-driven analysis techniques do not consider end-to-end processes. People are process models by
hand [e.g., Petri nets, UML activity diagrams, or Business
Process Modeling Notation (BPMN) models], but this modeled behavior is seldom aligned with real-life event data.
Process mining aims to bridge this gap by connecting end-toend process models to the raw events that have been recorded.
Process-mining techniques enable the analysis of a wide
variety of processes using event data. For example, event
logs can be used to automatically learn a process model
(e.g., a Petri net or BPMN model). Next to the automated
discovery of the real underlying process, there are processmining techniques to analyze bottlenecks, to uncover hidden
inefficiencies, to check compliance, to explain deviations,
to predict performance, and to guide users towards “better”
123
A. Bolt et al.
processes. Hundreds of process-mining techniques are available and their value has been proven in many case studies. See
for example the twenty case studies on the webpage of the
IEEE Task Force on Process Mining [24]. The open source
process mining framework ProM [58] provides hundreds of
plug-ins and has been downloaded over 100,000 times. The
growing number of commercial process mining tools (Disco,
Perceptive Process Mining, Celonis Process Mining, QPR
ProcessAnalyzer, Software AG/ARIS PPM, Fujitsu Interstage Automated Process Discovery, etc.) further illustrates
the uptake of process mining.
For process mining typically many analysis steps need
to be chained together. Existing process mining tools do not
support such analysis workflows. As a result, analysis may be
tedious and it is easy to make errors. Repeatability and provenance are jeopardized by manually executing more involved
process mining workflows.
This paper is motivated by the observation that tool support for process mining workflows is missing. None of the
process mining tools (ProM, Disco, Perceptive, Celonis,
QPR, etc.) provides a facility to design and execute analysis
workflows. None of the scientific workflow management systems including analytics suites like RapidMiner and KNIME
support process mining. Yet, process models and event logs
Fig. 1 Overview of the
framework to support process
mining workflows
are very different from the artifacts typically considered.
Therefore, we propose the framework to support process mining workflows depicted in Fig. 1.
This paper considers four analysis scenarios where
process mining workflows are essential:
– Result (sub-)optimality Often different process mining
techniques can be applied and a priori it is not clear
which one is most suitable. By modeling the analysis
workflow, one can just perform all candidate techniques
on the data, evaluate the different analysis results, and
pick the result with the highest quality (e.g., the process
model best describing the observed behavior).
– Parameter sensitivity Different parameter settings and
alternative ways of filtering can have unexpected effects.
Therefore, it is important to see how sensitive the results
are (e.g., leaving out some data or changing a parameter
setting a bit should not change the results dramatically).
It is important to not simply show the analysis result without having some confidence indications.
– Large-scale experiments Each year new process mining
techniques become available and larger data sets need to
be tackled. For example, novel discovery techniques need
to be evaluated through massive testing and larger event
Analysis scenarios for process mining
Result (sub-)
optimality
Large-scale
experiments
Repeating questions
Event data
transformation
Process model
extraction
Process model and
event data analysis
Add data to event
data (AddED)
Import process
model (ImportM)
Analyze process
model (AnalyzeM)
Filter event data
(FilterED)
Discover process
model from event
data (DiscM)
Evaluate process
model using event
data (EvaluaM)
Select process
model form
collection
(SelectM)
Compare process
mode (...truncated)