Statistical distortion of supervised learning predictions in optical microscopy induced by image compression (pdf)

Article PDF cannot be displayed. You can download it here:

https://www.nature.com/articles/s41598-022-07445-4.pdf

Statistical distortion of supervised learning predictions in optical microscopy induced by image compression

www.nature.com/scientificreports OPEN Statistical distortion of supervised learning predictions in optical microscopy induced by image compression Enrico Pomarico1*, Cédric Schmidt1, Florian Chays1, David Nguyen2, Arielle Planchette2, Audrey Tissot3, Adrien Roux1, Stéphane Pagès3,4, Laura Batti3, Christoph Clausen5, Theo Lasser6, Aleksandra Radenovic2, Bruno Sanguinetti5 & Jérôme Extermann1 The growth of data throughput in optical microscopy has triggered the extensive use of supervised learning (SL) models on compressed datasets for automated analysis. Investigating the effects of image compression on SL predictions is therefore pivotal to assess their reliability, especially for clinical use. We quantify the statistical distortions induced by compression through the comparison of predictions on compressed data to the raw predictive uncertainty, numerically estimated from the raw noise statistics measured via sensor calibration. Predictions on cell segmentation parameters are altered by up to 15% and more than 10 standard deviations after 16-to-8 bits pixel depth reduction and 10:1 JPEG compression. JPEG formats with higher compression ratios show significantly larger distortions. Interestingly, a recent metrologically accurate algorithm, offering up to 10:1 compression ratio, provides a prediction spread equivalent to that stemming from raw noise. The method described here allows to set a lower bound to the predictive uncertainty of a SL task and can be generalized to determine the statistical distortions originated from a variety of processing pipelines in AI-assisted fields. In the last years, an ever-growing community of optical microscopists is facing a massive data throughput, longterm storage costs, data transfer limitations and, more importantly, the need for automated quantitative data analysis, which has paved the way for extensive use of artificial intelligence (AI) methods. Supervised learning (SL) algorithms are routinely adopted to automate classification, segmentation, and artificial labelling of cellular or sub-cellular s tructures1–3, biological t issues4–6, as well as material d efects7–9. SL approaches have reported remarkable results in various fields, such as medical s creening10,11, single molecule localization12,13 and drug discovery14,15. Deep-learning (DL) algorithms have also been successfully employed for micrograph restoration, in particular for de-noising and spatial resolution e nhancement16–18. However, to deal with large training datasets and computational power constraints, SL models are ubiquitously executed on compressed imaging datasets. Despite producing visually faithful images, lossy compression algorithms are known to remove an unpredictable amount of information from the raw image. Moreover, compressed data often undergo additional processing before being used to train or test a SL model. Therefore, image compression can modify SL predictions with respect to when raw datasets are used and lead to unreliable scientific outcomes, based on how much the statistical distribution of the final predictions is altered. For this reason, the statistical distortions induced by compression need to be quantified to investigate the tolerability of image compression methods for SL applications. To this end, it is crucial to measure the statistical distribution of the SL outcomes in the absence of compression, in other terms the prediction uncertainty associated to raw data. According to Begoli et al.19, the lifecycle of an AI process from the physical sample to the AI-assisted decisions is affected by multiple sources of uncertainty. Understanding how image compression affects the statistical distribution of SL outcomes can be ascribed to 1 HEPIA, HES-SO, University of Applied Sciences and Arts Western Switzerland, Rue de la Prairie 4, 1202 Geneva, Switzerland. 2Laboratory of Nanoscale Biology, School of Engineering, École Polytechnique Fédérale de Lausanne, 1015 Lausanne, Switzerland. 3Wyss Center for Bio- and Neuroengineering, Geneva, Switzerland. 4Department of Basic Neurosciences, Geneva Neuroscience Center, Faculty of Medicine, University of Geneva, Geneva, Switzerland. 5Dotphoton SA, Zeughausgasse 17, 6300 Zug, Switzerland. 6Max-Planck Institute for Polymer Research, Ackermannweg 10, 55128 Mainz, Germany. *email: Scientific Reports | (2022) 12:3464 | https://doi.org/10.1038/s41598-022-07445-4 1 Vol.:(0123456789) www.nature.com/scientificreports/ investigating the representational uncertainty of the AI pipeline, consisting in errors due to the data representation adopted for training or testing the SL model. Here, we propose a method for quantifying the statistical distortions induced by compression on SL predictions. We first determine the predictive uncertainty of a trained SL model from the statistical noise of raw imaging data. Raw noise is measured via sensor calibration. As raw noise is unavoidable, our approach allows one to estimate the minimal level of representational uncertainty in SL predictions. Then, we compare outcomes obtained on compressed datasets to the raw predictive uncertainty by using a specific figure of merit, that will indicate the level of alteration of the SL outcomes statistics. We implement this method to investigate the impact of image compression on the outcomes of cell segmentation tasks. To this end, we will consider three types of operations aimed at reducing data volume: pixel depth reduction, JPEG compression, as well as a metrologically accurate compression technique developed by the Dotphoton (DP) company (www.dotphoton.com). The DP method reports compression ratios from 5:1 to 10:1 after an initial image preparation step, in which image noise is replaced with a pseudo-random noise that closely mimics the statistical distribution of the raw pixel values. Although the noise replacement reduces the signalto-noise ratio of each pixel by 1.2 dB, it allows to achieve high compression factors, as the pseudo-noise can be computed and makes the subsequent application of a standard lossless compression algorithm more efficient20. Results Raw data statistical noise. Raw data, typically obtained through a digitization operation on a physical sample via an acquisition instrument, are intrinsically affected by the noise associated with the acquisition process. When an optical sensor is used, raw data variability is mainly provided by the quantum noise of the photons hitting the sensor, as well as by the electronic n oise21. Hence, if one performs a sample acquisition under stabilized illumination conditions, as shown in Fig. 1a, the acquired raw images are not identical and the raw pixel values display a statistical distribution of average μ and width σ (the standard deviation associated to the per-pixel noise). Raw statistics could be in principle determined by repeating and averaging the acquisition of the same image several times. However, these tests are often hard to be carried out in a micros (...truncated)