Fig. 4. This figure shows the long-term spectral amplitudes of the speech signals exhibited in Fig. 2, one for each of the five speakers. These spectra were computed over the extent of the utterance and were then smoothed by fitting the amplitude spectrum with a 20th-order LPC model. Notice that there exist significant differences between the spectra, despite the fact that the speakers all spoke the same utterance.
(b)
_ 110 m ч
~ 100
(d)
o cc Ш 90
ш
80
FREQUENCY (KHz)
Fig. 5. This figure shows the long-term spectral amplitudes of the speech signals exhibited in Fig. 3, all for one speaker saying the same utterance under five different conditions. Notice that the spectra for (a) and (b), which were collected at different sessions but otherwise under the same conditions, are quite similar, except perhaps above 3 kHz. The other long-term spectra demonstrate large differences attributable to the various conditions of (c) carbon button microphone, (d) soft (-20-dB) speech, and (e) loud ( + 20 dB) speech.
likelihood of the long-term average feature vector given the various speaker hypotheses. Most work has used this approach, and this approach appears to yield the best performance today and is therefore most popular [31], [34]. The second approach, which is intuitively appealing, is to search for specific phonetic events in the incoming speech signal and then to compare the speech features of the selected and detected phonetic events with those features belonging to the matching phonetic event of the reference speakers. The problem with this approach is that errors in detecting phonetic events tend to corrupt the speaker recognition process. Even using speaker-specific phonetic references, the reliability of phonetic detection is currently inadequate to support good speaker recognition performance [28]. A modification of this approach which has provided reasonably good recognition performance is to place a tight threshold on the detection of these reference phonetic events and then to determine the speaker based upon the frequency of detecting a speaker's phonetic events [21], [35]. Another more recent recognition approach avoids the problems of phonetic detection while benefiting from short-term phonetic information about the speech signal [33]. In this approach, the short-term feature vectors are characterized statistically for the reference speakers, then the likelihood of the input speech is computed based upon this statistical model. Such an approach has been applied to difficult speaker recognition problems involving noise and distortion with some success [38].
A critically important aspect of the development of a free-text speaker recognition system is the database used for its development and evaluation. In the earliest studies, subjects read from prepared texts in an environment relatively free of noise. This caused a degree of skepticism about the significance of such results in an operational environment. An important study which addressed this problem was performed by Markel and Davis [30]. In this effort, a linguistically unconstrained database of extemporaneous speech was collected from eleven men and six women over a period of three months. This database was then used to develop a highly successful speaker recognition technology. The features used in this system were based upon the voice pitch and the reflection coefficients of an LPC-10 model. The recognition features were the mean and the standard deviation of these eleven parameters. Using a statistically orthogonal linear transformation of these features, performance of 2-percent identification error (and 4-percent verification error) was achieved for 40-s segments of input speech. This result is excellent for unconstrained extemporaneous speech. Of course, poorer results were obtained for shorter speech segment durations, and recognition error exceeded 20 percent for segment durations below 4 s.
Although the Markel database stands out as a significant step forward in simulation of realistic applications, it does not demonstrate many of the difficulties of operational data. For example, microphone degradation, acoustic noise, and channel distortion were not addressed. A recent effort at Bolt Beranek and Newman Inc. has calibrated the difficulty of free-text speaker recognition on an operational database collected over a radio channel [37], [38]. This database is corrupted by noise (with an average S/N ratio of 19 dB) and emotionally stressful task-oriented activity. Further, the segment durations used for recognition are short. On this database, the best performance achieved over a variety of transformations of the spectrum yielded 30-percent error for segment durations of 2 s on a population of nine male speakers. When cross-channel recognition was attempted (enrolling on data from one radio channel, then attempting recognition on another channel) the error rate rose further to about 50 percent. Attempts at channel equalization were effective only to a modest degree, with up to 5-percent improvement in recognition error rate. This is a sobering demonstration of the difficulty of the operational recognition task.
Performance is determined in speech tasks by the quality of the speech database evaluated, and excellent performance is often quite easy to achieve if the speech data are carefully controlled. As we have seen, recognition error can vary by an order of magnitude, depending on the difficulty of the database. Because of this, it is impossible to make serious comparisons of different recognition approaches unless they are evaluated on the same database. There are several possible solutions to this difficulty. One is simply to use a single standard database for evaluation of all recognition techniques. This will not often be satisfactory because the diversity of applications will demand consideration of many divergent database conditions. A more viable solution would be to calibrate algorithm performance against human listener performance on the same task. Human listeners are generally regarded to be good speaker recognizers who maintain relatively good performance under most degrading conditions. This calibration technique, therefore, offers a means of benchmarking the performance for any speaker recognition algorithm and a means of comparing the performance of systems in different application scenarios. Such a technique has been useful in a variety of applications [18], [39].
Dostları ilə paylaş: |