A PDF file should load here. If you do not see its contents
the file may be temporarily unavailable at the journal website
or you do not have a PDF plug-in installed and enabled in your browser.
Alternatively, you can download the file locally and open with any standalone PDF reader:
https://link.springer.com/content/pdf/10.1155%2F2008%2F810362.pdf
Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos
EURASIP Journal on Image and Video Processing
Hindawi Publishing Corporation
Comparison of Image Transform-Based Features for Visual Speech Recognition in Clean and Corrupted Videos
Rowan Seymour 0
Darryl Stewart 0
Ji Ming 0
0 School of Electronics, Electrical Engineering and Computer Science, Queen's University of Belfast , Belfast BT7 1NN, Northern Ireland , UK
We present results of a study into the performance of a variety of different image transform-based feature types for speakerindependent visual speech recognition of isolated digits. This includes the first reported use of features extracted using a discrete curvelet transform. The study will show a comparison of some methods for selecting features of each feature type and show the relative benefits of both static and dynamic visual features. The performance of the features will be tested on both clean video data and also video data corrupted in a variety of ways to assess each feature type's robustness to potential real-world conditions. One of the test conditions involves a novel form of video corruption we call jitter which simulates camera and/or head movement during recording.
1. INTRODUCTION
Speech is one of the most natural and important means of
communication between people. Automatic speech
recognition (ASR) can be described as the process of converting
an audio speech signal into a sequence of words by
computer. This allows people to interact with computers in a
way which may be more natural than through interfaces
such as keyboards and mice, and has already enabled many
real-world applications such as dictation systems and voice
controlled systems. A weakness of most modern ASR
systems is their inability to cope robustly with audio corruption
which can arise from various sources, for example,
environmental noises such as engine noise or other people
speaking, reverberation effects, or transmission channel
distortions caused by the hardware used to capture the audio
signal. Thus one of the main challenges facing ASR researchers
is how to develop ASR systems which are more robust to
these kinds of corruptions that are typically encountered
in real-world situations. One approach to this problem is
to introduce another modality to complement the acoustic
speech information which will be invariant to these sources
of corruption.
It has long been known that humans use available visual
information when trying to understand speech, especially in
noisy conditions [
1
]. The integral role of visual
information in speech perception is demonstrated by the McGurk
effect [
2
], where a person is shown a video recording of
one phoneme being spoken, but the sound of a different
phoneme being spoken is dubbed over it. This often results
in the person perceiving that he has heard a third
intermediate phoneme. For example, a visual /ga/ combined with an
acoustic /ba/ is often heard as /da/. A video signal capturing
a speaker’s lip movements is unaffected by the types of
corruptions outlined above and so it makes an intuitive choice
as a complementary modality with audio.
Indeed, as early as 1984, Petajan [
3
] demonstrated that
the addition of visual information can enable improved
speech recognition accuracy over purely acoustic systems,
as visual speech provides information which is not always
present in the audio signal. Of course it is important that
the new modality provides information which is as
accurate as possible and so there have been numerous studies
carried out to assess and improve the performance of
visual speech recognition. In parallel with this, researchers have
been investigating effective methods for integrating the two
Video
signal
modalities so that maximum benefit can be gained from their
combination.
A visual speech recognition system is very similar to a
standard audio speech recognition system. Figure 1 shows
the different stages of the typical recognition process. Before
the recognition process can begin, the speech models must be
constructed. This is usually performed by analyzing a
training set of suitable video examples, so that the model
parameters for the speech units can be estimated. The speech models
are usually hidden Markov models (HMM) or artificial neural
networks (ANN). Once the models are constructed, the
classifier can use them to calculate the most probable speech unit
when given some input video.
Visual features will usually be extracted from the video
frames using a process similar to that shown in Figure 2.
Depending on the content of the video (i.e., whether it contains
more than one speaker’s face), it may be necessary to start
with a face detection stage which returns the most likely
location of the speaker’s face in the video frame. The consecutive
stages of face localization and mouth localization provide a
cropped image of the speaker’s mouth.
The lip parameterization stage may be geometric based
or image transform based. Petajan’s original system [
3
] is an
example of geometric-based (...truncated)