Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis

PLOS ONE, Jul 2023

Reproducibility is crucial for scientific progress, yet a clear research data analysis workflow is challenging to implement and maintain. As a result, a record of computational steps performed on the data to arrive at the key research findings is often missing. We developed Scikick, a tool that eases the configuration, execution, and presentation of scientific computational analyses. Scikick allows for workflow configurations with notebooks as the units of execution, defines a standard structure for the project, automatically tracks the defined interdependencies between the data analysis steps, and implements methods to compile all research results into a cohesive final report. Utilities provided by Scikick help turn the complicated management of transparent data analysis workflows into a standardized and feasible practice. Scikick version 0.2.1 code and documentation is available as supplementary material. The Scikick software is available on GitHub (https://github.com/matthewcarlucci/scikick) and is distributed with PyPi (https://pypi.org/project/scikick/) under a GPL-3 license.

Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis

PLOS ONE RESEARCH ARTICLE Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis Matthew Carlucci1,2, Tadas Bareikis1,2, Karolis Koncevičius2, Povilas Gibas1,2, Algimantas Kriščiūnas2, Art Petronis1,2, Gabriel Oh ID1,2,3* 1 The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, Ontario, Canada, 2 Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania, 3 Stanford University School of Medicine, Stanford, California, United States of America * a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 OPEN ACCESS Citation: Carlucci M, Bareikis T, Koncevičius K, Gibas P, Kriščiūnas A, Petronis A, et al. (2023) Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis. PLoS ONE 18(7): e0289171. https://doi.org/ 10.1371/journal.pone.0289171 Editor: Anna Bernasconi, Politecnico di Milano, ITALY Received: April 12, 2023 Abstract Reproducibility is crucial for scientific progress, yet a clear research data analysis workflow is challenging to implement and maintain. As a result, a record of computational steps performed on the data to arrive at the key research findings is often missing. We developed Scikick, a tool that eases the configuration, execution, and presentation of scientific computational analyses. Scikick allows for workflow configurations with notebooks as the units of execution, defines a standard structure for the project, automatically tracks the defined interdependencies between the data analysis steps, and implements methods to compile all research results into a cohesive final report. Utilities provided by Scikick help turn the complicated management of transparent data analysis workflows into a standardized and feasible practice. Scikick version 0.2.1 code and documentation is available as supplementary material. The Scikick software is available on GitHub (https://github.com/ matthewcarlucci/scikick) and is distributed with PyPi (https://pypi.org/project/scikick/) under a GPL-3 license. Accepted: July 13, 2023 Published: July 27, 2023 Peer Review History: PLOS recognizes the benefits of transparency in the peer review process; therefore, we enable the publication of all of the content of peer review and author responses alongside final, published articles. The editorial history of this article is available here: https://doi.org/10.1371/journal.pone.0289171 Copyright: © 2023 Carlucci et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All relevant data are within the paper and its Supporting information files. 1. Introduction Research reproducibility, in the many forms it takes [1–3], is essential to the scientific method. Multiple insights can often be gained from a single large dataset, however, the breadth of such investigations has placed a heavy burden on researchers who aim to practice full computational transparency. It is essential that analytical procedures are clearly documented throughout an investigation, including details of their intent, background rationale, implementation, and analysis outputs [4, 5]. In its absence, investigative decisions, assumptions, and results can lose their context, and in turn, lower the quality of research communication [6]. Computational notebook formats (e.g., Jupyter Notebooks [7] and Rmarkdown [8]) and their associated development environments have paved a path for streamlined generation and sharing of computational results. Notebooks enable investigators to compile the analytical context (i.e., text), implementation (code), and results (figures) within a single document. This results in a report that reflects the entire process of analysis and serves as a transparent lens into how computations unfolded. PLOS ONE | https://doi.org/10.1371/journal.pone.0289171 July 27, 2023 1/8 PLOS ONE Funding: This project was supported by the European Social Fund, ec.europa.eu/esf (project No 09.3.3-LMT-K-712-17-0008) under grant agreement with the Research Council of Lithuania (LMTLT; lmt.lt) awarded to A.P. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist. Workflow clarity and reproducibility with Scikick In order to develop larger projects, there is a demand to use multiple notebooks within the same analysis. To this end, in addition to best practice guidelines [9] and improvements to the notebook format [10], tools have been designed for some specific project types; generating reading materials on computational topics (e.g., bookdown [11] and Jupyter Book [12]) or developing software packages fully within notebooks (e.g., nbdev [13]). However, these solutions do not emphasize the ordered and interdependent execution of notebooks common to computational research projects. As such, reproducibility is compromised when projects are not configured to execute notebooks in the correct order, and transparency is compromised when projects do not clearly document this execution order. The clarity of notebook outputs gives it an advantage to tools whose main purpose is to configure ordered computations (e.g., GNU Make [14], Snakemake [15], Nextflow [16], etc.). To benefit from both toolsets, researchers often use them in tandem to reproducibly configure the execution of a notebook collection. Further, researchers can produce graphical representations of this configuration to transparently represent the execution to the reader. However, assembling and maintaining these configurations throughout evolving projects is cumbersome. Therefore, many rapidly developing projects cannot dedicate the resources necessary for this level of transparency and reproducibility (Fig 1a). Fig 1. Scikick workflow development use-case, practices, and features. a) An illustration of the problem Scikick aims to address. Left) A schematic of a rendered computational notebook with contextual descriptions accompanying code and results demonstrating the clarity of the notebook format. Centre) A minimal “notebook collection” where execution order of notebooks is undocumented and not configured, compromising both transparency and reproducibility. Right) A graphical representation of a workflow management configuration which supplements the notebook collection to execute the notebooks in the specified order. b) The illustration shows the main Scikick features used to manage a collection of notebooks throughout a project. An unstructured collection of notebooks are initially executed by Scikick to generate a structured report. New content inside the workflow, including modifications to (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0289171&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0289171

Matthew Carlucci, Tadas Bareikis, Karolis Koncevičius, Povilas Gibas, Algimantas Kriščiūnas, Art Petronis, Gabriel Oh. Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis, PLOS ONE, 2023, Volume 18, Issue 7, DOI: 10.1371/journal.pone.0289171