Visual software analytics for the build optimization of large-scale software systems (pdf)

Article PDF cannot be displayed. You can download it here:

https://link.springer.com/content/pdf/10.1007%2Fs00180-011-0248-2.pdf

Visual software analytics for the build optimization of large-scale software systems

Alexandru Telea 0 Lucian Voinea 0 0 L. Voinea SolidSource BV, Eindhoven, The Netherlands Visual analytics is the science of analytical reasoning facilitated by interactive visual interfaces. In this paper, we present an adaptation of the visual analytics framework to the context of software understanding for maintenance. We discuss the similarities and differences of the general visual analytics context with the software maintenance context, and present in detail an instance of a visual software analytics application for the build optimization of large-scale code bases. Our application combines and adapts several data mining and information visualization techniques in answering several questions that help developers in assessing and reducing the build cost of such code bases by means of user-driven, interactive analysis techniques. 1 Introduction Software is everywhere. It is continuously being developed by an estimated 15 million engineers worldwide (Booch 2006), in a hierarchy of activities, ranging from requirement gathering, specification, and design, to implementation, debugging, testing, and maintenance. Understanding software is a towering task. Nowadays software systems are huge: The Mozilla browser has over 2 million lines of code (MLOC) in over 5,000 files (Mozilla Inc. 2008). Banking, telecom, and industrial applications are an order of magnitude larger. Software code is structured in many ways, e.g., as a file hierarchy; a network of components or packages; a set of design patterns (Gamma et al. 1995), or aspects (Elrad et al. 2003). No single hierarchy suffices for understanding, and the inter-hierarchy relations are complex. If we add design, architecture, documentation, dynamic and profiling data to source code, the understanding challenge explodes. Finally, software continuously evolves, which only increases complexity, as described by the so-called laws of software evolution (Belady and Lehman 1976; Godfrey and Tu 2000). Overall, understanding software is hard, as it is large, complex, abstract, and changing (Klemola and Rilling 2000). Given the large amount of complex legacy software, maintenance is the most effort-consuming activity in the software life-cycle. Studies over 15 years, from Standish (1984) to Corbi (1999), estimate that over 80% of the cost spent in the software life-cycle goes into maintenance. A significant component (40%) of this cost represents software understanding. Hence, it is of crucial importance for software professionals to be empowered with tools that enable them to reduce the understanding cost efficiently and effectively. In the following, we shall focus on understanding static software source code, which is a major component of the maintenance process. From a data modelling perspective, software code is similar to a database: it consists of a set of entities ranging from code lines to functions, classes, files, and components; and relationships, such as containment, data, call, and build dependencies. Entities and relationships have multiple attributes of numerical, ordinal, or textual type, e.g., quality and complexity metrics, types of data access, and the source code itself. Understanding large relational databases involves activities such as data mining, exploration, and presentation. A rapid growing field addressing this goal is visual analytics, which combines data mining and information visualization techniques to help users extract and reason about the information contained in such data collections (Wong and Thomas 2004; Thomas and Cook 2005). Visual analytics has been succesfully applied in several domains such as network monitoring, banking, traffic control, and homeland security. Central to the application of visual analytics in a particular domain is the customized design of tools and techniques to reflect the questions to be answered about the data at hand. To be time-effective, such tools should reflect the way their intended users reason about their data and questions, and also be scalable, integrated, and interactive. Although many visual methods and tools have been created for software understanding and maintenance, few of them have gained wide acceptance in the software industry. On the other hand, many data mining methods exist and are used in software engineering, but few support the visual analytical reasoning advocated above. This opens new opportunities but also poses several questions and challenges. In this paper, we explore the application of visual analytics principles and techniques to software maintenance, in what we call software visual analytics. First, we analyze the specific requirements and constraints of software maintenance. Next, we detail how the principles of visual analytics can be best put to use in light of these requirements. Finally, we demonstrate our model by an application of software visual analytics in solving a concrete problem on industrial software systems: the optimization of build performance of large code bases. Our application of visual software analytics demonstrates the high applicability of visual analytics principles to software understanding, from data collection and mining to hypothesis forming, validation, and presentation. This paper is structured as follows. In Sect. 2 we briefly overview the basic principles of visual analytics. Section 3 details the specific requirements and challenges of source code understanding in software maintenance. Section 4 presents our model for a visual software analytics framework, outlining the elements needed for its success. Section 5 presents an instance of a visual software analytics framework for a concrete problem from the software industry, the build analysis and refactoring of a large code base. Section 6 discusses our results, based on actual feedback from users of our systems. Section 7 concludes the paper. 2 Visual analytics: an overview Visual analytics is defined as the science of analytical reasoning facilitated by interactive visual interfaces (Wong and Thomas 2004). Its main ingredients are a tight combination of data mining and visualization techniques aiming at supporting the reasoning about phenomena captured in a given set of data. Visual analytics differs from pure data mining, as the involved reasoning cannot be captured in simple data queries. Also, visual analytics is more than data visualization, as the questions asked often require reinterpretation of the data at hand and the generation of multiple visualizations showing different aspects. Operationally, visual analytics involves a pipeline of activities that refine and enrich basic data with semantics related to the questions to be answered (Thomas and Cook 2005) (see Fig. 1). First, data is searched and filtered and elements of interest are extracted in the so-called data foraging loop. This is mainly a data mining step, e.g., extract all modules and module dependencies in a software code base. Secondly, a hypothesis is formed. A refin (...truncated)