Visual software analytics for the build optimization of large-scale software systems
Alexandru Telea
0
Lucian Voinea
0
0
L. Voinea SolidSource BV, Eindhoven,
The Netherlands
Visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces. In this paper, we present an adaptation of the visual analytics framework to the context of software understanding for maintenance. We discuss the similarities and differences of the general visual analytics context with the software maintenance context, and present in detail an instance of a visual software analytics application for the build optimization of large-scale code bases. Our application combines and adapts several data mining and information visualization techniques in answering several questions that help developers in assessing and reducing the build cost of such code bases by means of user-driven, interactive analysis techniques.
1 Introduction
Software is everywhere. It is continuously being developed by an estimated 15 million
engineers worldwide (Booch 2006), in a hierarchy of activities, ranging from
requirement gathering, specification, and design, to implementation, debugging, testing, and
maintenance. Understanding software is a towering task. Nowadays software systems
are huge: The Mozilla browser has over 2 million lines of code (MLOC) in over 5,000
files (Mozilla Inc. 2008). Banking, telecom, and industrial applications are an order of
magnitude larger. Software code is structured in many ways, e.g., as a file hierarchy;
a network of components or packages; a set of design patterns (Gamma et al. 1995),
or aspects (Elrad et al. 2003). No single hierarchy suffices for understanding, and the
inter-hierarchy relations are complex. If we add design, architecture, documentation,
dynamic and profiling data to source code, the understanding challenge explodes.
Finally, software continuously evolves, which only increases complexity, as described
by the so-called laws of software evolution (Belady and Lehman 1976; Godfrey and
Tu 2000). Overall, understanding software is hard, as it is large, complex, abstract,
and changing (Klemola and Rilling 2000).
Given the large amount of complex legacy software, maintenance is the most
effort-consuming activity in the software life-cycle. Studies over 15 years, from
Standish (1984) to Corbi (1999), estimate that over 80% of the cost spent in the
software life-cycle goes into maintenance. A significant component (40%) of this cost
represents software understanding. Hence, it is of crucial importance for software
professionals to be empowered with tools that enable them to reduce the
understanding cost efficiently and effectively. In the following, we shall focus on
understanding static software source code, which is a major component of the maintenance
process.
From a data modelling perspective, software code is similar to a database: it consists
of a set of entities ranging from code lines to functions, classes, files, and components;
and relationships, such as containment, data, call, and build dependencies. Entities and
relationships have multiple attributes of numerical, ordinal, or textual type, e.g., quality
and complexity metrics, types of data access, and the source code itself.
Understanding large relational databases involves activities such as data mining,
exploration, and presentation. A rapid growing field addressing this goal is visual
analytics, which combines data mining and information visualization techniques to
help users extract and reason about the information contained in such data collections
(Wong and Thomas 2004; Thomas and Cook 2005). Visual analytics has been
succesfully applied in several domains such as network monitoring, banking, traffic control,
and homeland security. Central to the application of visual analytics in a particular
domain is the customized design of tools and techniques to reflect the questions to be
answered about the data at hand. To be time-effective, such tools should reflect the
way their intended users reason about their data and questions, and also be scalable,
integrated, and interactive.
Although many visual methods and tools have been created for software
understanding and maintenance, few of them have gained wide acceptance in the software
industry. On the other hand, many data mining methods exist and are used in
software engineering, but few support the visual analytical reasoning advocated above.
This opens new opportunities but also poses several questions and challenges. In
this paper, we explore the application of visual analytics principles and techniques
to software maintenance, in what we call software visual analytics. First, we
analyze the specific requirements and constraints of software maintenance. Next, we
detail how the principles of visual analytics can be best put to use in light of these
requirements. Finally, we demonstrate our model by an application of software visual
analytics in solving a concrete problem on industrial software systems: the
optimization of build performance of large code bases. Our application of visual software
analytics demonstrates the high applicability of visual analytics principles to software
understanding, from data collection and mining to hypothesis forming, validation, and
presentation.
This paper is structured as follows. In Sect. 2 we briefly overview the basic principles
of visual analytics. Section 3 details the specific requirements and challenges of source
code understanding in software maintenance. Section 4 presents our model for a visual
software analytics framework, outlining the elements needed for its success. Section 5
presents an instance of a visual software analytics framework for a concrete problem
from the software industry, the build analysis and refactoring of a large code base.
Section 6 discusses our results, based on actual feedback from users of our systems.
Section 7 concludes the paper.
2 Visual analytics: an overview
Visual analytics is defined as the science of analytical reasoning facilitated by
interactive visual interfaces (Wong and Thomas 2004). Its main ingredients are a tight
combination of data mining and visualization techniques aiming at supporting the
reasoning about phenomena captured in a given set of data. Visual analytics differs
from pure data mining, as the involved reasoning cannot be captured in simple data
queries. Also, visual analytics is more than data visualization, as the questions asked
often require reinterpretation of the data at hand and the generation of multiple
visualizations showing different aspects.
Operationally, visual analytics involves a pipeline of activities that refine and enrich
basic data with semantics related to the questions to be answered (Thomas and Cook
2005) (see Fig. 1). First, data is searched and filtered and elements of interest are
extracted in the so-called data foraging loop. This is mainly a data mining step, e.g.,
extract all modules and module dependencies in a software code base. Secondly,
a hypothesis is formed. A refin (...truncated)