Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network

PLOS ONE, Feb 2020

The objective investigation of the dynamic properties of vocal fold vibrations demands the recording and further quantitative analysis of laryngeal high-speed video (HSV). Quantification of the vocal fold vibration patterns requires as a first step the segmentation of the glottal area within each video frame from which the vibrating edges of the vocal folds are usually derived. Consequently, the outcome of any further vibration analysis depends on the quality of this initial segmentation process. In this work we propose for the first time a procedure to fully automatically segment not only the time-varying glottal area but also the vocal fold tissue directly from laryngeal high-speed video (HSV) using a deep Convolutional Neural Network (CNN) approach. Eighteen different Convolutional Neural Network (CNN) network configurations were trained and evaluated on totally 13,000 high-speed video (HSV) frames obtained from 56 healthy and 74 pathologic subjects. The segmentation quality of the best performing Convolutional Neural Network (CNN) model, which uses Long Short-Term Memory (LSTM) cells to take also the temporal context into account, was intensely investigated on 15 test video sequences comprising 100 consecutive images each. As performance measures the Dice Coefficient (DC) as well as the precisions of four anatomical landmark positions were used. Over all test data a mean Dice Coefficient (DC) of 0.85 was obtained for the glottis and 0.91 and 0.90 for the right and left vocal fold (VF) respectively. The grand average precision of the identified landmarks amounts 2.2 pixels and is in the same range as comparable manual expert segmentations which can be regarded as Gold Standard. The method proposed here requires no user interaction and overcomes the limitations of current semiautomatic or computational expensive approaches. Thus, it allows also for the analysis of long high-speed video (HSV)-sequences and holds the promise to facilitate the objective analysis of vocal fold vibrations in clinical routine. The here used dataset including the ground truth will be provided freely for all scientific groups to allow a quantitative benchmarking of segmentation approaches in future.

Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network

RESEARCH ARTICLE Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal highspeed videos using a deep Convolutional LSTM Network Mona Kirstin Fehling ID1*, Fabian Grosch ID1, Maria Elke Schuster2, Bernhard Schick3, Jörg Lohscheller1 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 1 Department of Computer Science, Trier University of Applied Sciences, Schneidershof, Trier, Germany, 2 Department of Otorhinolaryngology and Head and Neck Surgery, University of Munich, Campus Grosshadern, München, Germany, 3 Department of Otorhinolaryngology, Saarland University Hospital, Homburg/Saar, Germany * Abstract OPEN ACCESS Citation: Fehling MK, Grosch F, Schuster ME, Schick B, Lohscheller J (2020) Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network. PLoS ONE 15 (2): e0227791. https://doi.org/10.1371/journal. pone.0227791 Editor: Yuanquan Wang, Beijing University of Technology, CHINA Received: February 21, 2019 Accepted: December 25, 2019 Published: February 10, 2020 Copyright: © 2020 Fehling et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: The underlying data set is available on Zenodo under the DOI 10.5281/ zenodo.3603185. Funding: This work was supported by the German Research Foundation (DFG), LO-1413/2-2. Computational resources were provided by the High Performance Compute Cluster ‘Elwetritsch’ at the University of Kaiserslautern, which is part of the ‘Alliance of High Performance Computing The objective investigation of the dynamic properties of vocal fold vibrations demands the recording and further quantitative analysis of laryngeal high-speed video (HSV). Quantification of the vocal fold vibration patterns requires as a first step the segmentation of the glottal area within each video frame from which the vibrating edges of the vocal folds are usually derived. Consequently, the outcome of any further vibration analysis depends on the quality of this initial segmentation process. In this work we propose for the first time a procedure to fully automatically segment not only the time-varying glottal area but also the vocal fold tissue directly from laryngeal high-speed video (HSV) using a deep Convolutional Neural Network (CNN) approach. Eighteen different Convolutional Neural Network (CNN) network configurations were trained and evaluated on totally 13,000 high-speed video (HSV) frames obtained from 56 healthy and 74 pathologic subjects. The segmentation quality of the best performing Convolutional Neural Network (CNN) model, which uses Long Short-Term Memory (LSTM) cells to take also the temporal context into account, was intensely investigated on 15 test video sequences comprising 100 consecutive images each. As performance measures the Dice Coefficient (DC) as well as the precisions of four anatomical landmark positions were used. Over all test data a mean Dice Coefficient (DC) of 0.85 was obtained for the glottis and 0.91 and 0.90 for the right and left vocal fold (VF) respectively. The grand average precision of the identified landmarks amounts 2.2 pixels and is in the same range as comparable manual expert segmentations which can be regarded as Gold Standard. The method proposed here requires no user interaction and overcomes the limitations of current semiautomatic or computational expensive approaches. Thus, it allows also for the analysis of long high-speed video (HSV)-sequences and holds the promise to facilitate the objective analysis of vocal fold vibrations in clinical routine. The here used dataset including the ground truth will be provided freely for all scientific groups to allow a quantitative benchmarking of segmentation approaches in future. PLOS ONE | https://doi.org/10.1371/journal.pone.0227791 February 10, 2020 1 / 29 Segmentation of glottis and vocal folds using a deep Convolutional LSTM Network Rheinland-Pfalz’ (AHRP). We kindly acknowledge the support. Competing interests: The authors have declared that no competing interests exist. Introduction In current post-industrial societies a main part of the working population is reliant upon wellfunctioning communication skills. A prerequisite for efficient verbal communication is the production of a proper voice signal which constitutes the carrier signal of speech. Any impairment of the voice production process has a direct impact on the perceivability of speech affecting the communication ability. A cross-sectional survey study carried out by Roy et al. in 2005 showed a lifetime prevalence of a voice disorder of up to 29.9% interfering with verbal communication [1]. Work-related absences due to voice disorders as well as medical consultations causing significant socioeconomic costs. Therefore, the early diagnosis and effective therapy of voice disorders is of great importance. The two opposing vocal folds within the larynx serve as voice generating structures. During voice production (phonation) they constitute a constriction for the exhaled respiratory airflow provided by the lung. Due to the interaction between the driving aerodynamic forces and myoelastic restoring forces of the tissue, oscillations of the vocal folds are provoked. Although the vocal fold vibration itself is a passive process, its vibration characteristics as e.g. the fundamental frequency f0 (pitch) and intensity can be altered by adapting the provided air pressure and laryngeal muscle activities [2]. Due to the different sizes of the laryngeal structures in males and females the fundamental frequency f0 of vocal fold vibrations is sensitive to gender. The mean f0 is around 120Hz for men and around 200Hz for women [3]. Understanding the underlying formation mechanism of voice disorders requires an indepth investigation and analysis of vocal fold vibration patterns. In healthy subjects vocal fold vibrations are characterized by symmetric and highly periodic oscillations [2, 4, 5]. On the contrary, in the presence of voice disorders disturbances of the symmetric and periodic oscillation patterns arise induced by morphological asymmetries or inappropriate muscle tensions [6–8]. In order to quantify the degree of vibration disturbances the vocal fold (VF) oscillation patterns need to be investigated during phonation using laryngeal imaging techniques. In clinical practice videostroboscopy is widely used for the examination of vocal fold (VF) vibrations [9]. Since the sampling rate of videostroboscopic systems is however far below the fundamental frequency of voice signals, they fail to adequately capture the real vocal fold (VF) vibration characteristics. Currently, laryngeal high-speed videoendoscopy (HSV) is the only technique to record the true intracyclic vibratory behavior of (...truncated)


This is a preview of a remote PDF: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0227791&type=printable
Article home page: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0227791

Mona Kirstin Fehling, Fabian Grosch, Maria Elke Schuster, Bernhard Schick, Jörg Lohscheller. Fully automatic segmentation of glottis and vocal folds in endoscopic laryngeal high-speed videos using a deep Convolutional LSTM Network, PLOS ONE, 2020, Volume 15, Issue 2, DOI: 10.1371/journal.pone.0227791