Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

EURASIP Journal on Image and Video Processing, Dec 2007

We present results of a study into the performance of a variety of different image transform-based feature types for speaker-independent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1155%2F2008%2F810362.pdf

Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos

EURASIP Journal on Image and Video Processing Hindawi Publishing Corporation Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos Rowan Seymour 0 Darryl Stewart 0 Ji Ming 0 0 School of Electronics, Electrical Engineering and Computer Science, Queen's University of Belfast , Belfast BT7 1NN, Northern Ireland , UK We present results of a study into the performance of a variety of different image transform-based feature types for speakerindependent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording. 1. INTRODUCTION Speech is one of the most natural and important means of communication between people. Automatic speech recognition (ASR) can be described as the process of converting an audio speech signal into a sequence of words by computer. This allows people to interact with computers in a way which may be more natural than through interfaces such as keyboards and mice, and has already enabled many real-world applications such as dictation systems and voice controlled systems. A weakness of most modern ASR systems is their inability to cope robustly with audio corruption which can arise from various sources, for example, environmental noises such as engine noise or other people speaking, reverberation effects, or transmission channel distortions caused by the hardware used to capture the audio signal. Thus one of the main challenges facing ASR researchers is how to develop ASR systems which are more robust to these kinds of corruptions that are typically encountered in real-world situations. One approach to this problem is to introduce another modality to complement the acoustic speech information which will be invariant to these sources of corruption. It has long been known that humans use available visual information when trying to understand speech, especially in noisy conditions [ 1 ]. The integral role of visual information in speech perception is demonstrated by the McGurk effect [ 2 ], where a person is shown a video recording of one phoneme being spoken, but the sound of a different phoneme being spoken is dubbed over it. This often results in the person perceiving that he has heard a third intermediate phoneme. For example, a visual /ga/ combined with an acoustic /ba/ is often heard as /da/. A video signal capturing a speaker’s lip movements is unaffected by the types of corruptions outlined above and so it makes an intuitive choice as a complementary modality with audio. Indeed, as early as 1984, Petajan [ 3 ] demonstrated that the addition of visual information can enable improved speech recognition accuracy over purely acoustic systems, as visual speech provides information which is not always present in the audio signal. Of course it is important that the new modality provides information which is as accurate as possible and so there have been numerous studies carried out to assess and improve the performance of visual speech recognition. In parallel with this, researchers have been investigating effective methods for integrating the two Video signal modalities so that maximum benefit can be gained from their combination. A visual speech recognition system is very similar to a standard audio speech recognition system. Figure 1 shows the different stages of the typical recognition process. Before the recognition process can begin, the speech models must be constructed. This is usually performed by analyzing a training set of suitable video examples, so that the model parameters for the speech units can be estimated. The speech models are usually hidden Markov models (HMM) or artificial neural networks (ANN). Once the models are constructed, the classifier can use them to calculate the most probable speech unit when given some input video. Visual features will usually be extracted from the video frames using a process similar to that shown in Figure 2. Depending on the content of the video (i.e., whether it contains more than one speaker’s face), it may be necessary to start with a face detection stage which returns the most likely location of the speaker’s face in the video frame. The consecutive stages of face localization and mouth localization provide a cropped image of the speaker’s mouth. The lip parameterization stage may be geometric based or image transform based. Petajan’s original system [ 3 ] is an example of geometric-based (...truncated)


This is a preview of a remote PDF: https://link.springer.com/content/pdf/10.1155%2F2008%2F810362.pdf

Rowan Seymour, Darryl Stewart, Ji Ming. Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos, EURASIP Journal on Image and Video Processing, 2007, pp. 810362, Volume 2008, Issue 1, DOI: 10.1155/2008/810362