Statistical distortion of supervised learning predictions in optical microscopy induced by image compression
www.nature.com/scientificreports
OPEN
Statistical distortion of supervised
learning predictions in optical
microscopy induced by image
compression
Enrico Pomarico1*, Cédric Schmidt1, Florian Chays1, David Nguyen2, Arielle Planchette2,
Audrey Tissot3, Adrien Roux1, Stéphane Pagès3,4, Laura Batti3, Christoph Clausen5,
Theo Lasser6, Aleksandra Radenovic2, Bruno Sanguinetti5 & Jérôme Extermann1
The growth of data throughput in optical microscopy has triggered the extensive use of supervised
learning (SL) models on compressed datasets for automated analysis. Investigating the effects of
image compression on SL predictions is therefore pivotal to assess their reliability, especially for
clinical use. We quantify the statistical distortions induced by compression through the comparison
of predictions on compressed data to the raw predictive uncertainty, numerically estimated from the
raw noise statistics measured via sensor calibration. Predictions on cell segmentation parameters are
altered by up to 15% and more than 10 standard deviations after 16-to-8 bits pixel depth reduction
and 10:1 JPEG compression. JPEG formats with higher compression ratios show significantly larger
distortions. Interestingly, a recent metrologically accurate algorithm, offering up to 10:1 compression
ratio, provides a prediction spread equivalent to that stemming from raw noise. The method described
here allows to set a lower bound to the predictive uncertainty of a SL task and can be generalized to
determine the statistical distortions originated from a variety of processing pipelines in AI-assisted
fields.
In the last years, an ever-growing community of optical microscopists is facing a massive data throughput, longterm storage costs, data transfer limitations and, more importantly, the need for automated quantitative data
analysis, which has paved the way for extensive use of artificial intelligence (AI) methods. Supervised learning
(SL) algorithms are routinely adopted to automate classification, segmentation, and artificial labelling of cellular
or sub-cellular s tructures1–3, biological t issues4–6, as well as material d
efects7–9. SL approaches have reported
remarkable results in various fields, such as medical s creening10,11, single molecule localization12,13 and drug
discovery14,15. Deep-learning (DL) algorithms have also been successfully employed for micrograph restoration,
in particular for de-noising and spatial resolution e nhancement16–18.
However, to deal with large training datasets and computational power constraints, SL models are ubiquitously executed on compressed imaging datasets. Despite producing visually faithful images, lossy compression algorithms are known to remove an unpredictable amount of information from the raw image. Moreover,
compressed data often undergo additional processing before being used to train or test a SL model. Therefore,
image compression can modify SL predictions with respect to when raw datasets are used and lead to unreliable
scientific outcomes, based on how much the statistical distribution of the final predictions is altered. For this
reason, the statistical distortions induced by compression need to be quantified to investigate the tolerability of
image compression methods for SL applications.
To this end, it is crucial to measure the statistical distribution of the SL outcomes in the absence of compression, in other terms the prediction uncertainty associated to raw data. According to Begoli et al.19, the lifecycle of
an AI process from the physical sample to the AI-assisted decisions is affected by multiple sources of uncertainty.
Understanding how image compression affects the statistical distribution of SL outcomes can be ascribed to
1
HEPIA, HES-SO, University of Applied Sciences and Arts Western Switzerland, Rue de la Prairie 4, 1202 Geneva,
Switzerland. 2Laboratory of Nanoscale Biology, School of Engineering, École Polytechnique Fédérale de Lausanne,
1015 Lausanne, Switzerland. 3Wyss Center for Bio- and Neuroengineering, Geneva, Switzerland. 4Department
of Basic Neurosciences, Geneva Neuroscience Center, Faculty of Medicine, University of Geneva, Geneva,
Switzerland. 5Dotphoton SA, Zeughausgasse 17, 6300 Zug, Switzerland. 6Max-Planck Institute for Polymer
Research, Ackermannweg 10, 55128 Mainz, Germany. *email:
Scientific Reports |
(2022) 12:3464
| https://doi.org/10.1038/s41598-022-07445-4
1
Vol.:(0123456789)
www.nature.com/scientificreports/
investigating the representational uncertainty of the AI pipeline, consisting in errors due to the data representation adopted for training or testing the SL model.
Here, we propose a method for quantifying the statistical distortions induced by compression on SL predictions. We first determine the predictive uncertainty of a trained SL model from the statistical noise of raw
imaging data. Raw noise is measured via sensor calibration. As raw noise is unavoidable, our approach allows
one to estimate the minimal level of representational uncertainty in SL predictions. Then, we compare outcomes
obtained on compressed datasets to the raw predictive uncertainty by using a specific figure of merit, that will
indicate the level of alteration of the SL outcomes statistics.
We implement this method to investigate the impact of image compression on the outcomes of cell segmentation tasks. To this end, we will consider three types of operations aimed at reducing data volume: pixel depth
reduction, JPEG compression, as well as a metrologically accurate compression technique developed by the
Dotphoton (DP) company (www.dotphoton.com). The DP method reports compression ratios from 5:1 to 10:1
after an initial image preparation step, in which image noise is replaced with a pseudo-random noise that closely
mimics the statistical distribution of the raw pixel values. Although the noise replacement reduces the signalto-noise ratio of each pixel by 1.2 dB, it allows to achieve high compression factors, as the pseudo-noise can be
computed and makes the subsequent application of a standard lossless compression algorithm more efficient20.
Results
Raw data statistical noise. Raw data, typically obtained through a digitization operation on a physical sample via an acquisition instrument, are intrinsically affected by the noise associated with the acquisition
process. When an optical sensor is used, raw data variability is mainly provided by the quantum noise of the
photons hitting the sensor, as well as by the electronic n
oise21. Hence, if one performs a sample acquisition under
stabilized illumination conditions, as shown in Fig. 1a, the acquired raw images are not identical and the raw
pixel values display a statistical distribution of average μ and width σ (the standard deviation associated to the
per-pixel noise).
Raw statistics could be in principle determined by repeating and averaging the acquisition of the same image
several times. However, these tests are often hard to be carried out in a micros (...truncated)