A vocal response time system for use with sentence verification tasks

Behavior Research Methods, Mar 1996

A software system for the reliable detection of vocal response onset is described. The system was designed specifically for the measurement of vocal response times to speech stimuli presented aurally in a sound field in the presence of some background noise. The response time extraction method described here is robust to masking noise and extraneous sounds that may be included in the subject’s recorded response. In addition, the response words do not have to be limited to a small set because the system is able to differentiate the onset of any speech sounds, including low-energy fricatives. The method described here may be implemented with any computer sound system because it relies only on the sound conversion clock for timing accuracy and uses postprocessing of the signal after acquisition for response extraction. The response time extraction technique as currently implemented does not recognize subject’s responses but could be incorporated into an automatic speech recognition system.

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:


A vocal response time system for use with sentence verification tasks

CHRIS JAMES jamesc@mail.medoto.unimelb 0 1 2 0 I would like to thank D. G. Jamieson and T. Schneider at the Hear ing Health Care Research Unit, University of Western Ontario , for their advice. I am also grateful to D. Boles and the anonymous re viewers, and to the Ontario Ministry of Health and the Natural Sci ences and Engineering Research Council of Canada for financial sup Otolaryngology, University of Melbourne , 32 Gisbourne St., East Mel bourne 3002, Victoria, Australia ( 1 University of Western Ontario , London , Ontario, Canada 2 INewCnt = Gobble(lOftPos, IOnPos, INewMinOftli],INewMinOn[i], INewCnt, INewOff, INewOn); } A software system for the reliable detection of vocal response onset is described. The system was designed specifically for the measurement of vocal response times to speech stimuli presented aurally in a sound field in the presence of some background noise. The response time extraction method described here is robust to masking noise and extraneous sounds that may be included in the subject's recorded response. In addition, the response words do not have to be limited to a small set because the system is able to differentiate the onset of any speech sounds, including low-energy fricatives. The method described here may be implemented with any computer sound system because it relies only on the sound conversion clock for timing accuracy and uses postprocessing of the signal after acquisition for response extraction. The response time extraction technique as currently implemented does not recognize subject's responses but could be incorporated into an automatic speech recognition system. - such methods become unreliable and may fail completely. Figure I (fine dotted line) shows the amplitude envelope for the response five (solid line) recorded in some back ground noise. The energy envelope evolved rapidly at around 400 msec, whereas the word production started at around 300 msec using the zero-crossing rate as a guide (long dashes). The zero-crossing rate is just a count of the zero crossings in the signal over a time frame. This rate is simple to extract and effectively indicates the fre quency characteristics of the signal. An energy detector would not have been triggered at the beginning of this word. In addition, the use of low-energy thresholds that might improve the timing accuracy in this case would likely have been exceeded prematurely by background noise. There are other ways of characterizing the signal, such as autocorrelation coefficients, but these will take relatively longer to compute than the zero-crossing rate. Automatic speech recognition systems may be used to both time and record responses. Boles (1988) described how to use the IntroVoice I system with Apple-Psych to collect and time vocal responses. According to Boles (per sonal communication, 1994), the IntroVoice I system uses the output of a bank of filters as a trigger signal. Some spectral weighting is applied to allow lower trigger thresholds and perhaps improved noise rejection over a simple energy detector. The response times produced by this system coincide with the end of a period of energy exceeding a threshold, and thus they are indicative of the end of the response, not the onset. In addition, a pause of at least 160 msec is required between single utterances. Boles's use of a tone complex of fixed duration to mea sure the accuracy of the system does not reflect the fact that one may want to recognize speech that does not begin (and/or end) with voicing (i.e., just yes and no). In addition, the length of responses will vary, and this will introduce errors into the response latencies relative to the ~ .--------------....-- _.. -------.. -_...' Figure 1. Characterization of a signal in terms of amplitude and zero-crossing rate. The signal in this case is the re sponse wordfive, taken from a recording made in the course ofan experiment. The classification ofthe signal as sig nal or silence is also shown. beginning of production. The system's rejection of back ground noise and extraneous sounds is also not specified. The Intro Voice I system is certainly useful for mea suring vocal response times when there is no possible in terference between the stimulus or conditions and the re sponse. It would be appropriate for use with visual stimuli, and where results are collapsed across different responses so that word lengths may be ignored. In summary, the er rors caused by missing weak speech energy at the begin ning of the utterance can range into hundreds of mil liseconds (as in Figure 1). This is insufficiently accurate for the study of listening effort with SVTs. REQUIREMENTS There were various requirements of the system for the present research purposes. It should be (1) automated, reliable, and user friendly; (2) able to cope with a wide range of responses made with free-field stimuli and with interference present (e.g., masking noise): (3) able to in tegrate with Experiment Control System (ECoS) and Computer Speech Research Environment (CSRE) soft ware (Jamieson, Ramji, Kheirallah, & Nearey, 1992); (4) able to allow response time measurements relative to different parts of the stimulus for interstimulus compari son; and (5) able to produce and record signals with a va riety of acquisition platforms. Here, the focus is on the properties of the response signals and a technique for detecting the onset of the response. This is in an effort to satisfy research require ments and overcome the apparent problems with systems that utilize only energy detection to obtain response times. The method outlined here is a modified form of that specified by Rabiner and Sambur (1975; see also Ra biner & Schafer, 1978) to detect the end points of isolated utterances for automatic speech recognition. With this technique, both the energy and zero-crossing rate of the signal are used to define the end points of an utterance. This technique is simpler to use than more advanced techniques, such as those specified by Lamel, Rabiner, Rosenburg, and Wilpon (1981); Wilpon and Rabiner (1987); Savoji (1989); and more recently by Junqua, Mak, and Reaves (1994). Some modifications were made to improve performance under the conditions used in our SVT studies at the Hearing Health Care Research Unit. The resulting algorithm was simple to implement, and the parameter values can be simply related to perfor mance characteristics in terms of precision, speed, and accuracy. SIGNAL DETECTION A recording from the microphone can contain several signals: These are the subject's response, the stimulus (at a relatively reduced amplitude), and some background noise, including any mask signals and extraneous sounds. The combination response signal is shown in Figure 2 (the stimulus signal is to the left and the subject's reTime-Amplitude DetectedSignal --------II :i IIII I""I :, "" " I"' I"' II"' ,I 1 : I [ I II____ JI aoooo ,-, III : 15000 II : I ,I i 5000 I I III IIIIIII IIIIIII III : ! I LI________ JI Figure 2. Signal and silence detection using the fragment fdter method. The sample includes the stimulus signal (left side) and the subject's response (right side). sponse is to the right). The first aim in identifying the onset of the vocal response is to identify what constitutes no signal or silence. The recording is segmented into equal-sized, overlapping frames of 10 msec. The mean amplitude and zero-crossing rate of these frames is cal culated. The background silence of the recording is char acterized by looking for a region, for example, 10 to 20 frames long, of lowest mean amplitude. The mean and standard deviation of amplitude and mean zero-crossing rate of this region are recorded. These values are used to generate a criterion of what is a signal frame and what is a background frame. The signal detection results for the response five in noise are shown in Figure I (lower part). One can see how the criterion has picked the response signal starting at an increase in zero crossings. In addi tion, there are two extraneous regions of signal on either side of the response. Reducing the severity of the crite rion will increase sensitivity but also increase the num ber of these spurious regions. However, these spurious points can be removed by afragment-filter algorithm, as outlined below. The automatic detection of the signal in background noise can be likened to a human signal detection task in that we can define a two-dimensional state table where we have two null hypotheses: A portion of the recording was classified (I) as signal when it was signal and (2) as noise when it was noise. The alternative hypotheses are that noise was mistaken for signal and signal was mistaken for noise. We can increase or decrease the overall number of signal classifications by adjusting the crite rion for classification of the analysis frames. To remove the error rate in false classifications, we can initially spec ify a minimum duration for an on-offsignal period, say, 50 msec. This is similar to an integrator, except that we maintain the actual onset point (not some point where the summed energy is enough to throw the switch). However, if a frame of signal is mistaken for a frame of noise, we may lose the very beginning ofthe onset of the signal. This occurs for weak fricatives and plosives, es pecially in background noise. Here we apply a method of associating short frag ments of signal with a portion of signal of sufficient length (fragment filter). We first search for a portion of signal of sufficient length, and then search backward and forward to see whether there are any fragments of signal close enough to be considered part of a signal (say, within 40 msec). We repeat this process until there are no frag ments within reach. Thus, short, isolated fragments of signal are ignored without throwing away fragments at the onset of a signal. Some balancing of the silence re jection time (maximum gap between signal fragments) and signal rejection time (minimum duration ofa signal fragment, say, 15 msec) is required to optimize the error rate. The response time is taken from the beginning of the detected response. Also imposed is the further crite rion that the response signal is the signal with a region of highest mean amplitude; this rejects the stimulus signal if presented in a sound field. The results of this proce dure are shown in Figure 2 for a sample containing the stimulus and response recorded free field. We can apply this defragmentation procedure more than once, so that initially, small pieces of signal with small gaps between can be collected together. These composite fragments of signal and silence may in turn be combined to form larger gaps and larger pieces. IMPLEMENTATION An outline of the response time extraction procedure is given in the Appendix. This is expressed in pseudo C programming language. Those processes specific to the method presented here are given in detail. For more infor mation on the calculation of zero-crossing rates and en ergy for speech signals, see Rabiner and Schafer (1978). Each frame of response data is assumed to have been prerecorded and either read from an audio data file or collected in real-time. For each frame of data, the zero crossing rate and average amplitude are calculated. The lowest mean amplitude across n contiguous frames is obtained via a search. The size of n should be tuned so that a period of background noise n times the frame length should be available in a majority ofrecordings. The larger the value of n, the better the characterization of the background noise. The average zero-crossing rate is also calculated for these low-energy frames. These parame ters are used to characterize background noise or silence. The frame with the largest amplitude is also recorded. The basic criterion for classifying a frame as containing si lence or signal is generated on the basis of the noise char acterization parameters. The actual thresholds of the cri terion may be varied to obtain the best results for a particular implementation. I chose to use a criterion sim ilar to that used by Rabiner and Sambur (1975) after some experimentation. As each frame is classified, contiguous portions of signal and silence may be marked by start points and end points, as shown in Figure 1 (lower dashed frames). These points are then passed to the fragment filter algorithm. The minimum silence and minimum signal durations are specified. New start and end points are produced so that these minima are incorporated. These new points can then be passed again to the fragment filter algorithm with new, larger, minimum durations. This process can be repeated several times and serves to concatenate short pieces of signal that would otherwise be classified as silence by the fragment filter. I found that values of 50 and 25 msec for minimum signal dura tion and minimum silence duration, respectively, gave good results with single-word responses using SVTs in noise. These periods could be increased or decreased de pending on the application. For example, if sentences are to be isolated, the minima would be increased. If words in a continuous discourse are to be isolated, the minima would be reduced. There is a tradeoff between speed and accuracy. The longer the analysis frames, the better the characterization. The greater the overlap of frames, the greater the accuracy. Longer frames and greater overlap will increase the time required to obtain the response time. The start of the section of signal con taining the greatest mean amplitude is used as the total response time (i.e., from the beginning of the presenta tion of the stimulus). The vocal-response time measurement method as out lined above was incorporated into a system for obtaining responses to sentences presented in various levels of noise. An IBM-AT-compatible computer was used to control presentation of stimuli and recording of re sponses via a Tucker Davis Technologies (Gainesville, FL) AP2 interface board and DDl signal acquisition module. Recording of the response signal was started synchronously with presentation of the stimulus and continued for up to 6 sec. At this stage, the responses are marked by the experimenter. After each subject has comMean error/msec Note-The response times indicated include the stimulus. The difference between the actual response times and those extracted by the fragment-filter method described herein (t.TFF) and by Rabiner and Sambur's (1975) method (t.TRS ) are listed. pleted the task, the recorded response~ are pro~essed to produce a label file indicating all regions of signal and the most probable response signal. These response label files are compared with stimulus label files to obtain la tencies relative to various parts of the stimuli so that re sponses to different stimuli can be compared. EVALUATION Dummy runs of experiments were used to acquire real data so that the accuracy of the extraction process could be optimized and made robust. Some of these trials were run in quiet and others with a degree of background noise present. The response labels produc.ed by ~he s~s tern for several hundred response signals, including dig its and yes and no, seem accurate to the degree of frame overlap (5 msec in this case). There were a few.cases produced in quiet where some 60- Hz pickup was evident, and some responses were produced in fairly low noise (> 30 dB SIN) with very weak frication, whi.ch prove? to be the most problematic for silence detection. Rabiner and Sambur (1975) recommended high-pass filtering of signals to eliminate the former problem, but I chose to study the performance of the system without filtering as a worst case. Table 1 compares the performance of the system described here (fragment filter, or FF) with that specified by Rabiner and Sambur (1975) (RS) for the problematic cases outlined above. Tab~e 2 do~s the same for cases recorded with signal-to-noise ratio down to 20 dB. The Rabiner and Sambur method is compared be cause it might be considered a standard technique for word isolation in automatic speech recognition. It is cer tainly superior to any technique based only on simple en ergy detection. The actual verbal response times (RT) in Tables I and 2 were measured using the CSRE editor. One can play any portion of the recorded response at the beginning of the response. The best measurement of the onse.t may be made by drawing the end of the pla~back po~tIOn ?~ck until no phonation can be heard. The time of this position is the total RT and is indicated by the program. Unfortu nately, the human auditory system is still the best detec tor of speech sounds in noise. The important aspect of the result for the FF method employed here is the relative frequency of errors o~large magnitude. The RS algorithm for our purposes I~ ~oo sensitive to outlying portions of signal that are classified as signal. The RS algorithm searches ba~kward (an? ~or ward if searching for an end) from a portion of suffIcI~nt amplitude for frames with zero-crossing rat~s ~eetmg the criterion. If three such frames are found within a cer tain time then the start of the earliest of these frames is marked a's the start point. In fact, the RS algorithm is not suitable for measuring response latencies between the end of stimulus and onset of response of less than 250 msec, since it searches this far back from the onset ofsignificant energy in the response word. The FF method as implemented here does this up to 40 msec. Both ~e~h ods could be modified to make use of the characteristics Mean error/msec Note-These cases were taken randomly from a set of about 700 re sponses. The response time indicated is the total response. time in cluding stimulus. The errors !J.TFF and LlTRSfrom two extraction meth ods, fragment-filter (FF) and Rabiner and Sambur (1975) (RS), are compared. oflikely response words. For example,yes and no do not start with a long portion of frication whereas four and five do. CONCLUSION A system has been described to mea~ure vocal-resp~nse times to speech stimuli. Much attention has been given to the structure of the response signals so that the onset of the response is very accurately detected even in the presence of some background or spuri~us n?ise. This sy~ tem could easily be adapted for use With Visual or haptic stimuli where vocal responses are to be elicited. The method is currently implemented on an IBM compatible PC and integrated with ECoS and CSRE running on Tucker Davis Technologies hardware. A range of com puter audio systems could be used. In addition, speech recognition could be added using propriet~ry or custom software (since the responses are already Isolated). The method of detection used here can also be applied to ac curately identify portions of silence in a signal before performing other forms of processing. The system ap pears reliable and accurate in practice and is currently being used to assess SVT materials for use in measures of listener-effort with hearing aids. REFERENCES ALGARABEL, S., SANMARTIN, J., & AHUIR, F. (1989). A voice-activated key for the Apple Macintosh computer. Behavior Research Methods, Instruments, & Computers, 21, 67-72. BOLES, D. B. (1988). Voice recognition with the Apple-Psych system. Behavior Research Methods. Instruments. & Computers, 20, 158-163. HAWLEY, K. J., & IZATT, E. J. (1992). An inexpensive sound activated APPENDIX Vocal Response Time Measurement Variable naming scheme: x a data buffer defined as short int * x. dX etc double precision floating point (double). fX etc single precision floating point (float). IX etc long integers (unsigned long or signed long). Where X are hopefully self-explanatory names. C Code Fragment for Silence/Signal detection by Chris 1. James This code assumes that the digitally recorded response signal is stored in a data file in a 16-bit 2's complement PCM format, sampling rate fSampleRate. 1* Prototypes and enums etc *1 II returns average number of crossings per sample in a frame double FrameMeanZeros (short * audiodata, long IFrameSize); II returns average absolute amplitude per sample in a frame double FrameMeanAmp (short * audiodata, long IFrameSize); II returns arithmetic mean ofn double precision numbers double dMean (double * data, long n); II returns standard deviation ofn double precision numbers double dStanDev (double * data, long n); II are we looking for the beginning of silence (FindOff) or signal (FindOn) enum Find {FindOff, FindOn}; II beginning of fragment 1*OBTAINING THE ZERO-CROSSINGS AND MEAN AMPLITUDE FOR EACH FRAME *1 II loop for IFrames-this is the number of frames II in your entire response file (remember to take into account II frame overlap. II read in a frame of data to x read (hFile, (char *)x, (IFrameSize)*sizeof(short)); II calculate zero-crossings and mean amplitude dZeros[l] = FrameMeanZeros (x, IFrameSize)/fFrameSize; dAmp [I] = FrameMeanAmp (x, IFrameSize); II what is the current position of the middle of the frame II in samples ICurPos = (1*(IFrameSize-IOverlap) + IFrameSize/2); II when we have sufficient frames look at average II across IAvgOver frames. if (I>IAvgOver){ /IIAvgOver is mentioned as n in article text II calculate mean mean amplitude dAvgAmp = dMean dAmp+I-IAvgOver), IAvgOver); II is the new mean lower than the old lowest if (dAvgAmp < dLowAmp){ dLowAmp = dAvgAmp; II store low mean II get mean zero-crossings across IAvgOver frames dAvgZer= dMean dZeros+I-IAvgOver), IAvgOver); II and standard deviation dSdZer = dStanDev dZeros+I-IAvgOver), lAvgOver); } II rewind the file by the overlap in samples Iseek (hFile, (-IOverlap)*sizeof(short), SET_CUR); } II re-store the lowest mean amplitude for the sample II this is the model for the background noise dAvgAmp = dLowAmp; 1* OBTAINING THE NOISE REJECTION CRITERION *1 II Now choose the threshold parameters as specified by Rabiner if (dAvgAmp<O.OOI) II stop divide by zero next line! arbitrary min dAvgAmp=O.OO I; II add 2 SDs of the Zero-crossing rate for the noise, this is II the zero-crossing rate criterion. dAvgZer += 2*dSdZer; dAvgTwe += 2*dSdTwe; II multiply the lowest mean amplitude by 4.0 dLowAmp = 4.0 * dAvgAmp; 113 per cent of the difference between the largest mean amplitude II and the lowest plus the lowest dAvgAmp = 0.03* (dHiAmp - dAvgAmp) + dAvgAmp; II Rabiner's condition on choice oflowest amplitude II whichever is the lower is the amplitude criterion if (dAvgAmp > dLowAmp) dAvgAmp = dLowAmp; II the actual time in samples II (divide by fSamplingRate to obtain time) II is the frame silence? II are we looking for the start of silence? if (FindFlag==FindOff) { II store the current position IOftPos[ICnt] = ICurPos; II now we want to find signal II no its signal, are we looking for signal? II store the current position IOnPos[ICnt] = ICurPos; II now we want to find silence II increment the block counter 1* FRAGMENT FILTERING *1 Now we have the beginning (lOnPos) and ends (lOftPos) of blocks of signal stored in arrays. We now pass these arrays to the fragment filter function called "Gobble". We also provide storage for the "filtered" On-Off positions, the minimum Signal (lMinOn) and silence (lMinOff) durations and the number of points we put in. The function returns the number of whole on-off blocks found. II We may wish to re-filter with larger minimum durations for (i=O;i<iMaxRepeat;++i){ IOnPos[j] = INewOn[j]; 1* INTERPRETATION OF OUTPUT *1 The arrays INewOn and INewOff should be stored. The position IHiAmpPos should lie between a pair of points INewOn[n] and INewOff[n+ I]. These points denote the beginning and end of the response signal. The positions may be converted to times by dividing by the sampling rate of the digitized response signal. 1* The FRAGMENT FILTER FUNCTION *1 The working of this function is not immediately obvious! This is the briefest implementation of the algorithm I could think of. long Gobble (long *IOff, long *IOn, long lMinOffGap, long lMinOnGap, long IPoints, long *lDigOff, long *IDigOn) long i=l,j=l, k=O, 1=0; while ( i <= (IPoints) ) { II store the current "off"! lDigOffll] = IOfflk]; II find the next on period larger than the minimum: II a period of signal starts at IOn[x] and ends IOfflx+ I] ! II i.e. ignore small bits of signal for the time being while (lOffli]-IOn[i-l]) < IMinOnGap && i < IPoints) ++i; j = i-I; II search back for a significant off II i.e. if there is less than the minimum off time II between bits of signal "join" them up while ( (lOn[j] - IOfflj]) < lMinOffGap && j > ) - j ; Ili.e. minus-minus k=i; II ditto search forward while ( (lOn[k]-IOfflk]) < IMinOffGap && k < IPoints) ++k; return I; II the number of blocks (pairs of points)

This is a preview of a remote PDF: http://link.springer.com/content/pdf/10.3758%2FBF03203638.pdf

Chris James. A vocal response time system for use with sentence verification tasks, Behavior Research Methods, 1996, 67-75, DOI: 10.3758/BF03203638