Scikick: A sidekick for workflow clarity and reproducibility during extensive data analysis
PLOS ONE
RESEARCH ARTICLE
Scikick: A sidekick for workflow clarity and
reproducibility during extensive data analysis
Matthew Carlucci1,2, Tadas Bareikis1,2, Karolis Koncevičius2, Povilas Gibas1,2,
Algimantas Kriščiūnas2, Art Petronis1,2, Gabriel Oh ID1,2,3*
1 The Krembil Family Epigenetics Laboratory, The Campbell Family Mental Health Research Institute,
Centre for Addiction and Mental Health, Toronto, Ontario, Canada, 2 Institute of Biotechnology, Life Sciences
Center, Vilnius University, Vilnius, Lithuania, 3 Stanford University School of Medicine, Stanford, California,
United States of America
*
a1111111111
a1111111111
a1111111111
a1111111111
a1111111111
OPEN ACCESS
Citation: Carlucci M, Bareikis T, Koncevičius K,
Gibas P, Kriščiūnas A, Petronis A, et al. (2023)
Scikick: A sidekick for workflow clarity and
reproducibility during extensive data analysis.
PLoS ONE 18(7): e0289171. https://doi.org/
10.1371/journal.pone.0289171
Editor: Anna Bernasconi, Politecnico di Milano,
ITALY
Received: April 12, 2023
Abstract
Reproducibility is crucial for scientific progress, yet a clear research data analysis workflow
is challenging to implement and maintain. As a result, a record of computational steps performed on the data to arrive at the key research findings is often missing. We developed Scikick, a tool that eases the configuration, execution, and presentation of scientific
computational analyses. Scikick allows for workflow configurations with notebooks as the
units of execution, defines a standard structure for the project, automatically tracks the
defined interdependencies between the data analysis steps, and implements methods to
compile all research results into a cohesive final report. Utilities provided by Scikick help turn
the complicated management of transparent data analysis workflows into a standardized
and feasible practice. Scikick version 0.2.1 code and documentation is available as supplementary material. The Scikick software is available on GitHub (https://github.com/
matthewcarlucci/scikick) and is distributed with PyPi (https://pypi.org/project/scikick/) under
a GPL-3 license.
Accepted: July 13, 2023
Published: July 27, 2023
Peer Review History: PLOS recognizes the
benefits of transparency in the peer review
process; therefore, we enable the publication of
all of the content of peer review and author
responses alongside final, published articles. The
editorial history of this article is available here:
https://doi.org/10.1371/journal.pone.0289171
Copyright: © 2023 Carlucci et al. This is an open
access article distributed under the terms of the
Creative Commons Attribution License, which
permits unrestricted use, distribution, and
reproduction in any medium, provided the original
author and source are credited.
Data Availability Statement: All relevant data are
within the paper and its Supporting information
files.
1. Introduction
Research reproducibility, in the many forms it takes [1–3], is essential to the scientific method.
Multiple insights can often be gained from a single large dataset, however, the breadth of such
investigations has placed a heavy burden on researchers who aim to practice full computational transparency. It is essential that analytical procedures are clearly documented throughout an investigation, including details of their intent, background rationale, implementation,
and analysis outputs [4, 5]. In its absence, investigative decisions, assumptions, and results can
lose their context, and in turn, lower the quality of research communication [6].
Computational notebook formats (e.g., Jupyter Notebooks [7] and Rmarkdown [8]) and
their associated development environments have paved a path for streamlined generation and
sharing of computational results. Notebooks enable investigators to compile the analytical context (i.e., text), implementation (code), and results (figures) within a single document. This
results in a report that reflects the entire process of analysis and serves as a transparent lens
into how computations unfolded.
PLOS ONE | https://doi.org/10.1371/journal.pone.0289171 July 27, 2023
1/8
PLOS ONE
Funding: This project was supported by the
European Social Fund, ec.europa.eu/esf (project No
09.3.3-LMT-K-712-17-0008) under grant
agreement with the Research Council of Lithuania
(LMTLT; lmt.lt) awarded to A.P. The funders had
no role in study design, data collection and
analysis, decision to publish, or preparation of the
manuscript.
Competing interests: The authors have declared
that no competing interests exist.
Workflow clarity and reproducibility with Scikick
In order to develop larger projects, there is a demand to use multiple notebooks within the
same analysis. To this end, in addition to best practice guidelines [9] and improvements to the
notebook format [10], tools have been designed for some specific project types; generating
reading materials on computational topics (e.g., bookdown [11] and Jupyter Book [12]) or
developing software packages fully within notebooks (e.g., nbdev [13]). However, these solutions do not emphasize the ordered and interdependent execution of notebooks common to
computational research projects. As such, reproducibility is compromised when projects are
not configured to execute notebooks in the correct order, and transparency is compromised
when projects do not clearly document this execution order.
The clarity of notebook outputs gives it an advantage to tools whose main purpose is to configure ordered computations (e.g., GNU Make [14], Snakemake [15], Nextflow [16], etc.). To
benefit from both toolsets, researchers often use them in tandem to reproducibly configure the
execution of a notebook collection. Further, researchers can produce graphical representations
of this configuration to transparently represent the execution to the reader. However, assembling and maintaining these configurations throughout evolving projects is cumbersome.
Therefore, many rapidly developing projects cannot dedicate the resources necessary for this
level of transparency and reproducibility (Fig 1a).
Fig 1. Scikick workflow development use-case, practices, and features. a) An illustration of the problem Scikick aims to address. Left) A schematic of a
rendered computational notebook with contextual descriptions accompanying code and results demonstrating the clarity of the notebook format. Centre) A
minimal “notebook collection” where execution order of notebooks is undocumented and not configured, compromising both transparency and
reproducibility. Right) A graphical representation of a workflow management configuration which supplements the notebook collection to execute the
notebooks in the specified order. b) The illustration shows the main Scikick features used to manage a collection of notebooks throughout a project. An
unstructured collection of notebooks are initially executed by Scikick to generate a structured report. New content inside the workflow, including
modifications to (...truncated)