Speaker Recognition-Identifying People by their Voices george r. Doddington, member, ieee



Yüklə 210,37 Kb.
səhifə6/10
tarix20.09.2023
ölçüsü210,37 Kb.
#145400
1   2   3   4   5   6   7   8   9   10
doddington1985

Text-Independent Recognition Technology

During the past few years, text-independent (or "free- text") speaker recognition has become an increasingly pop­ular area of research, with a broad spectrum of potential applications. The free-text speaker recognition task defini­tion is highly variable, from an acoustically clean and pre­scribed task description to environments where not only is the speech linguistically unconstrained but also the acous­tic environment is extremely adverse. Possible applications include forensic use, automatic sorting and classification of intelligence data, and passive security applications through monitoring of voice circuits. In general, applications for free-text speaker recognition have limited control of the conditions which influence system performance. Indeed, the definition of the task as "free-text" connotes a lack of complete control. (It may be assumed that a fixed text would be used if feasible, because better performance is possible if the text is known and calibrated beforehand.) This lack of control leads to corruption of the speech signal and consequently to degraded recognition performance. Corruption of the speech signal occurs in a number of ways, including distortions in the communication channel, additive acoustical noise, and''probably most importantly through increased variability in the speech signal itself. (The speech signal may be expected to vary greatly under oper­ational conditions in which the speaker may be absorbed in a task or involved in an emotionally charged situation.) Thus the free-text recognition task typically confers upon the researcher multiple problems—namely, that the input speech is unconstrained, that the speaker is uncooperative, and that the environmental parameters are uncontrolled.
Research into speaker characteristics and free-text recog­nition algorithms seems more appealing than fixed-text speaker recognition in a sense, because emphasis is on the search for features and characteristics unique to the individ­ual rather than on artifactual differences that со-vary with particular phonetic environments. Nonetheless, perfor­mance of free-text speaker recognition has never ap­proached that achievable within a controlled fixed-text task definition. Perhaps as a result of this, interest in and re­search on free-text recognition has historically lagged be­hind fixed-text work. During the last five years, however, research in free-text speaker recognition has matured greatly and interest in the free-text task is now quite high, judging by the relative amount of work in the area [33]-[35], Focus has shifted from highly controlled databases and laboratory experiments to the processing of actual operational data. One consequence of this realism is that the level of re­cognition performance achieved lately has, unfortunately, deteriorated, with speaker recognition error rates not infre­quently in excess of 20 percent [41].
One of the key issues in developing a text-independent speaker recognition system is to identify appropriate fea­tures and measures which will support good recognition performance. Use of the long-term average spectrum as a feature vector was discovered to have potential for free-text recognition during initial exploratory studies of fixed-text recognition using spectral pattern matching techniques [4]. In the Pruzansky study [4], speaker recognition error rate was found to remain undegraded (at 11 percent) even after averaging spectral amplitudes over all frames of speech data into a single reference spectral amplitude vector for each talker. To illustrate this feature vector, the long-term ampli­tude spectra for the different-speaker utterances shown in Fig. 2 are displayed in Fig. 4, and the long-term spectra for the same-speaker utterances of Fig. 3 are displayed in Fig. 5. Unfortunately, the long-term spectrum is not a good stable feature vector to use for speaker recognition. Long-term spectrum is obviously sensitive to changes in the spectral response of any interposed communications channel. More important, we have seen that the long-term spectrum is not particularly stable across variations in the speaker's speech effort level. A number of increasingly more sophisticated approaches have been developed to overcome some of the more fundamental limitations of a simple Euclidean dis­tance measure on a simple spectral amplitude vector [26], [31], [34]. These approaches typically attempt to stabilize and statistically characterize features which represent the speech spectrum. These features include statistically or­thogonal spectral vector combinations, cepstral coefficients, and a variety of LPC-based parameters. Surprisingly, the primary measure of choice remains the spectral amplitude vector, and very little effort has been devoted to the devel­opment of other measures such as pitch, formant frequen­cies, or statistical time functions. One reason for selecting the spectral amplitude vector is that it has typically pro­duced performance superior to other features such as voice-pitch frequency [25].
Another key issue in free-text speaker recognition is the general strategy used to make a recognition decision. There have been two distinct approaches to this problem. First is the use of long-term averages. That is, certain features of the speech signal are computed for each incoming frame and are then averaged over a complete segment of speech. A recognition decision is made by computing the statistical


Yüklə 210,37 Kb.

Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©www.azkurs.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin