Data-Intensive Modelling and Simulation in Life Sciences and Socio-economical and Physical Sciences

Data Science and Engineering, Sep 2017

Andrea Bracciali, Elisabeth Larsson

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

Data-Intensive Modelling and Simulation in Life Sciences and Socio-economical and Physical Sciences

Data Sci. Eng. Data-Intensive Modelling and Simulation in Life Sciences and Socio-economical and Physical Sciences the COST Action on High- Performance Modelling 0 1 2 3 Simulation for Big Data Applications 0 1 2 3 Andrea Bracciali 0 1 2 3 Elisabeth Larsson 0 1 2 3 0 Computing Science and Mathematics, Stirling University , Stirling FK9 4LA , UK 1 Elisabeth Larsson 2 & Andrea Bracciali 3 Department of Information Technology, Uppsala University , Box 337, 751 05 Uppsala , Sweden - Unfortunately, modelling and simulation of big data problems does not always naturally lend itself to efficient HPC solutions. MS communities often lack the detailed expertise required to exploit the full potential of HPC solutions, and HPC architects may not be fully aware of specific MS requirements. cHiPSet is an opportunity to coordinate European research, with the support of overseas colleagues, and facilitate interactions among data-intensive MS and HPC experts, both from research and industry. This Action aims to support the development of the field, which is strategic and of long-standing interest. cHiPSet is organised around four working groups: WG1 and WG2 on HPC infrastructures and programming models for MS, while WG3 and WG4 are thematic umbrellas for data-intensive MS in Life Sciences and for Socio-economical and Physical Sciences. This DSE special issue collects contributions originating from the coordinated work of the ‘‘modelling’’ working groups WG3 and WG4 across the first two years of the Action. Andrea Bracciali and Elisabeth Larsson, guest editors of this volume and chairs of WG3 and WG4, respectively, have compiled papers presenting general examples of modelling and simulation for big data problems in the context of the two working groups. Within the Action, these and other approaches are of strong interest as paradigmatic case studies for the coordinated development of HPC solutions and for fostering collaboration across the various scientific communities in the cHiPSet network. Analogously, readers interested in state-of-the-art examples of efficient big data modelling will find the contributions of this volume of interest. Paper 1: The special issue contains five papers, and below we give a brief overview of what type of data is central for each paper and what kind of outcomes are expected from the methods presented. Trust-based Modelling of Multi-criteria Crowd-sourced Data, by FA´ TIMA LEAL, BENEDITA MALHEIRO, HORACIO GONZA´ LEZVE´ LEZ, AND JUAN CARLOS BURGUILLO, considers the problem of providing high quality personalised recommendations for travellers, based on crowd-sourced data available from sites such as Expedia and TripAdvisor. When an individual rates, e.g. a hotel, there are different aspects that can be rated such as cleanliness or comfort (multi-criteria). Furthermore, the ratings from different individuals may be more or less relevant for a specific recommendation and/or more or less trustworthy considering the overall rating behaviour of that individual. These characteristics are taken into account in the novel approaches presented here. Tracking time evolving data streams for shortterm traffic forecasting, by AMR ABDULLATIF, FRANCESCO MASULLI, AND STEFANO ROVETTA, addresses the problem of processing large volumes of traffic flow data collected in a realtime setting. A challenging feature of the data is that it is non-stationary. The traffic behaviour can change both abruptly and with a gradual drift over time. In order to perform short-term traffic forecasts, the forecaster needs to continuously learn from the data stream, while adapting to the current situation as well as detecting anomalous data points. Using GUHA data mining method in analysing road traffic accidents occurred in the years 2004–2008 in Finland, by ESKO TURUNEN, presents another application in the traffic domain. Investigated in this paper is how to use traffic accident data collected over time to learn about which (risk) factors are strongly correlated with, for example, single vehicle accidents or accidents leading to severe injuries. Knowledge about these relations can then be used to inform preventive work at the societal level. Robust cross-platform workflows: How technical and scientific communities collaborate to develop, test and share best practices for data analysis, by STEFFEN MO¨ LLER, STUART W. PRESCOTT, LARS WIRZENIUS, PETTER REINHOLDTSEN, Paper 5: BRAD CHAPMAN, PJOTR PRINS, STIAN SOILAND REYES, FABIAN KLO¨ TZL, ANDREA BAGNACANI, MATU´ Sˇ, KALASˇ, ANDREAS TILLE, AND MICHAEL R. CRUSOE lies in the context of bioinformatics, a rapidly evolving area where the data volumes are becoming very large (whole genomes of multiple individuals), and where evolving data collection methods, improved data quality, and novel data analysis methods imply that software must adapt and provide new functionality at quite short-time scales. Bioinformaticians need to spend significant amounts of time to adjust workflows to the current state of data and software. This paper discusses how to integrate and coordinate open source packages, e.g. through the emerging Common Workflow Language, to realise such workflows, efficiently. A review of scalable bioinformatics pipelines by BJØRN FJUKSTAD AND LARS AILO BONGO, is a review paper focusing on scalability, an important open question for bioinformatics workflows. Many types of analyses are provided as web services. This means that portals need to be scalable with respect to the number of users accessing them simultaneously, and the service needs to be scalable with respect to increasing data volumes. The backend system providing the service can be a cloud system or a high-performance computing (HPC) cluster, leading to different issues. An advantage of the cloud is that resource can be provided elastically, depending on the load situation. The HPC system cannot do that in general, and instead it becomes important that the service is scalable over the hardware with respect to nodes and cores. As guest editors of this special issue we would like to thank the journal editors and the editorial staff for their support during the publication process. We would also like to thank the anonymous reviewers that have supported the review process for this special issue. Lastly, we want to acknowledge the support of COST ( that, through the cHiPSet Action, has provided a collaborative environment fostering interaction amongst European, and overseas, researchers, across a particularly large number of countries. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

This is a preview of a remote PDF:

Andrea Bracciali, Elisabeth Larsson. Data-Intensive Modelling and Simulation in Life Sciences and Socio-economical and Physical Sciences, Data Science and Engineering, 2017, 197-198, DOI: 10.1007/s41019-017-0049-x