Speaker Recognition-Identifying People by their Voices george r. Doddington, member, ieee



Yüklə 210,37 Kb.
səhifə7/10
tarix20.09.2023
ölçüsü210,37 Kb.
#145400
1   2   3   4   5   6   7   8   9   10
doddington1985

Fig. 4. This figure shows the long-term spectral amplitudes of the speech signals exhibited in Fig. 2, one for each of the five speakers. These spectra were computed over the extent of the utterance and were then smoothed by fitting the amplitude spectrum with a 20th-order LPC model. Notice that there exist significant differences between the spectra, despite the fact that the speakers all spoke the same utter­ance.

(b)

_ 110 m ч
~ 100

(d)
o cc Ш 90





ш

80

FREQUENCY (KHz)

Fig. 5. This figure shows the long-term spectral amplitudes of the speech signals exhibited in Fig. 3, all for one speaker saying the same utterance under five different conditions. Notice that the spectra for (a) and (b), which were collected at different sessions but otherwise under the same condi­tions, are quite similar, except perhaps above 3 kHz. The other long-term spectra demonstrate large differences attri­butable to the various conditions of (c) carbon button micro­phone, (d) soft (-20-dB) speech, and (e) loud ( + 20 dB) speech.



likelihood of the long-term average feature vector given the various speaker hypotheses. Most work has used this ap­proach, and this approach appears to yield the best perfor­mance today and is therefore most popular [31], [34]. The second approach, which is intuitively appealing, is to search for specific phonetic events in the incoming speech signal and then to compare the speech features of the selected and detected phonetic events with those features belong­ing to the matching phonetic event of the reference speakers. The problem with this approach is that errors in detecting phonetic events tend to corrupt the speaker recognition process. Even using speaker-specific phonetic references, the reliability of phonetic detection is currently inadequate to support good speaker recognition perfor­mance [28]. A modification of this approach which has provided reasonably good recognition performance is to place a tight threshold on the detection of these reference phonetic events and then to determine the speaker based upon the frequency of detecting a speaker's phonetic events [21], [35]. Another more recent recognition approach avoids the problems of phonetic detection while benefiting from short-term phonetic information about the speech signal [33]. In this approach, the short-term feature vectors are characterized statistically for the reference speakers, then the likelihood of the input speech is computed based upon this statistical model. Such an approach has been applied to difficult speaker recognition problems involving noise and distortion with some success [38].


A critically important aspect of the development of a free-text speaker recognition system is the database used for its development and evaluation. In the earliest studies, subjects read from prepared texts in an environment rela­tively free of noise. This caused a degree of skepticism about the significance of such results in an operational environment. An important study which addressed this problem was performed by Markel and Davis [30]. In this effort, a linguistically unconstrained database of extempora­neous speech was collected from eleven men and six wom­en over a period of three months. This database was then used to develop a highly successful speaker recognition technology. The features used in this system were based upon the voice pitch and the reflection coefficients of an LPC-10 model. The recognition features were the mean and the standard deviation of these eleven parameters. Using a statistically orthogonal linear transformation of these fea­tures, performance of 2-percent identification error (and 4-percent verification error) was achieved for 40-s segments of input speech. This result is excellent for unconstrained extemporaneous speech. Of course, poorer results were obtained for shorter speech segment durations, and recog­nition error exceeded 20 percent for segment durations below 4 s.
Although the Markel database stands out as a significant step forward in simulation of realistic applications, it does not demonstrate many of the difficulties of operational data. For example, microphone degradation, acoustic noise, and channel distortion were not addressed. A recent effort at Bolt Beranek and Newman Inc. has calibrated the diffi­culty of free-text speaker recognition on an operational database collected over a radio channel [37], [38]. This database is corrupted by noise (with an average S/N ratio of 19 dB) and emotionally stressful task-oriented activity. Further, the segment durations used for recognition are short. On this database, the best performance achieved over a variety of transformations of the spectrum yielded 30-percent error for segment durations of 2 s on a popula­tion of nine male speakers. When cross-channel recogni­tion was attempted (enrolling on data from one radio channel, then attempting recognition on another channel) the error rate rose further to about 50 percent. Attempts at channel equalization were effective only to a modest de­gree, with up to 5-percent improvement in recognition error rate. This is a sobering demonstration of the difficulty of the operational recognition task.
Performance is determined in speech tasks by the quality of the speech database evaluated, and excellent perfor­mance is often quite easy to achieve if the speech data are carefully controlled. As we have seen, recognition error can vary by an order of magnitude, depending on the difficulty of the database. Because of this, it is impossible to make serious comparisons of different recognition approaches unless they are evaluated on the same database. There are several possible solutions to this difficulty. One is simply to use a single standard database for evaluation of all recogni­tion techniques. This will not often be satisfactory because the diversity of applications will demand consideration of many divergent database conditions. A more viable solution would be to calibrate algorithm performance against hu­man listener performance on the same task. Human listeners are generally regarded to be good speaker recognizers who maintain relatively good performance under most degrad­ing conditions. This calibration technique, therefore, offers a means of benchmarking the performance for any speaker recognition algorithm and a means of comparing the per­formance of systems in different application scenarios. Such a technique has been useful in a variety of applications [18], [39].


  1. Yüklə 210,37 Kb.

    Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin