Speaker Recognition-Identifying People by their Voices george r. Doddington, member, ieee

Yüklə 210,37 Kb.

səhifə	8/10
tarix	20.09.2023
ölçüsü	210,37 Kb.
	#145400

1 2 3 4 5 6 7 8 9 10

doddington1985

Fixed-Text Verification Technology

Of all the forms of automatic speaker recognition technology, the one with the greatest potential for practical application is fixed-text speaker verification. Speaker verification has the potential to add security and convenience to home door locks, automobile ignition switches, automatic teller machines, and bank-by-phone facilities. Fixed-text speaker verification is the form of speaker recognition used for security applications in which a person desires privileged entry or access to some protected resource. This privilege is granted upon verification of the person's identity, which is performed by comparing his voice characteristics with that of a valid user. The speaker in these applications is thus cooperative, which helps immeasurably in achieving good speaker recognition performance. First, he is willing to proffer his identity to the system, which reduces the identification process (who is he?) to a verification process (is he truly who he claims to be?). Second, he is willing to say whatever is requested of him. Finally, he is willing (although perhaps not completely able) to say the requested speech token consistently during each verification attempt.
Because of the high degree of control which can usually be exercised over the speech signal conditions, performance in fixed-text verification is typically much better than for other speaker recognition tasks where the degree of control over the recognition environment is limited. In fact, performance of fixed-text verification has in many cases reached the point where practical application of the technology is being considered [23]. One of these applications, to control physical entry, will be described in detail in the next section.
Perhaps the most compelling applications of speaker verification, however, involve the verification of voices transmitted over telephone lines, where corruption of the speech data by microphone and channel characteristics remains a difficult problem. Although limited success has been achieved, speaker verification over the telephone still offers a challenge to those interested in the development of practical voice verification technology. One notable study of voice verification over the telephone examined the performance of the Bell Laboratories voice verification system for more than 100 men and women in a realistic operational simulation over an extended period of 5 months [24]. This system, which used as speaker discrimination features only the speech energy and voice pitch as a function of time, exhibited a user rejection rate and impostor acceptance rate of about 10 percent initially for new users, with error rate declining to about half this value for experienced users and fully adapted speaker templates. Another interesting result of this study was the histogram of population statistics versus performance, with about half of the users of the system experiencing less than 5-percent error while a small but significant fraction of the population exhibiting over 20-percent user rejection or impostor acceptance.
The most typical strategy for performing fixed-text speaker verification is to create a reference file of speech parameters (as a function of time) for each user and then, during verification, to compare the speech parameters for the unknown speaker with reference parameters at equivalent points in time. That is, the input speech from the unknown speaker is time aligned with the reference speech for the proffered identity, and the distance between corresponding times is computed and averaged over the utterance. This general technique has served as a basis for almost all approaches to fixed-text speaker verification. An exception to this general strategy is the approach that computes time-averaged statistics of speech frame parameters and bases the verification decision on the estimated likelihood of the reference speaker given these statistics. This latter approach is very much like that for free-text verification, except that the verification utterance is fixed and therefore the time-averaged statistics are far more stable than if the speaker's utterances were unspecified.
These two approaches, namely, comparing dynamic speech features at equivalent points in time and comparing average speech feature statistics, have been compared and found to yield similar performance in at least one study [32]. In the Furui study, the speech features included the log area ratios of a 12th-order LPC model and the voice pitch frequency, and good verification performance was achieved for a set of nine men speaking two short Japanese words using either the time-averaged statistical features or the dynamic features. Less than 1-percent error was achieved on input utterances spoken 10 months after enrollment. Error rate increased to more than 3 percent on utterances spoken 5 years after enrollment, however, thus indicating the desirability of periodically updating the speakers' reference data. This good performance was also critically dependent upon careful equalization of the average speech spectrum. Spectral equalization was performed by filtering the input signal with a two-zero critically damped filter adjusted so as to flatten the average input spectrum. This spectral equalization was shown by Furui to reduce the error rate by as much as a factor of two.
A concern in the operational use of speaker verification is the susceptibility of the verification algorithms to mimicking. Unfortunately, few formal studies have been conducted on this issue. In one study performed at Bell Laboratories [15], four professional mimics were selected from a much larger set of candidates. These four, who sounded best in the prescreening trials, were then coached intensively and attempted mimicking enrolled users under favorable conditions. The mimics did a rather good job on timing and inflection and achieved an order of magnitude greater impostor acceptance rate (27 percent on the automatic system) than did casual impostors. Later versions of this system, which stressed spectral features rather than prosodic features, reduced this mimic error rate considerably, however. An interesting note on mimicry is the comparison of human versus machine performance on dis tinguishing the voices of identical twins. In the Bell Labs database there was a pair of identical twins, and listeners who otherwise did well in discriminating speakers almost universally accepted the twin as his brother (%-percent acceptance) [18]. Interestingly, the machine never confused the two twins. This suggests that man and machine are using different features or strategies in making verification decisions, despite the fact that their overall performances are comparable.
There are limits to the performance achievable with voice verification, as previously discussed, and these limits in the speaker verification task often relate to the degree of voice consistency that the user can maintain. This is, among other things, a worthy human factors challenge for the system designer. How do you ensure that the user uses the same rate of speech and the same speech effort level for each verification? More importantly, how do you prevent the users from contracting ailments such as the common cold that affect the voice quality? The essence of the challenge is to control the rare statistics rather than the mean. "Typical" speech input may always yield perfect performance, so that the error performance of a system may be determined by the frequency of occurrence of anomalous speech data. Shrewd control of the system users, along with the use of robust speech features, are key factors in establishing a high-performance speaker verification system.

An Operational Speaker Verification System

Texas Instruments currently uses a voice verification system to control physical entry into its main computer center at corporate headquarters in Dallas. A brief description of this system will serve to illustrate some of the human factors problems encountered and solutions developed to make a successful operational system. This system has been operational 24 h per day for more than a decade now. To use the system an entrant first opens the door to the entry booth and walks in, then he identifies himself by entering a user ID into a keypad, and then he repeats the verification phrase(s) that the system prompts him to say. If he is verified, the system says "verified, thank you" and unlocks the inside door of the booth so that he may enter into the computer center. If he is not verified, the system notifies him by saying "not verified, call for assistance."
The verification algorithm is fairly straightforward, using as a speech feature vector the output of a 14-channel filter bank whose frequencies are spaced uniformly between 300 and 3000 Hz. The verification decision is based upon the cumulative Euclidean distance between the features of the speaker's reference frames and those of the time-aligned input frames. Time alignment is established at the point of best match between input and reference feature vectors using a simplified form of dynamic time warping.
Verification utterances are constructed randomly to avoid the possibility of being able to defeat the system with a tape recording of a valid user. An innocuous four-word fixed phrase structure is used, with one of sixteen word alternatives filling each of the four word positions. The complete set of words is shown in the Table 2. An example verification utterance might be "Proud Ben served hard." These utterances are prompted by voice. This is thought to improve verification performance by stabilizing the pronunciation of the user's utterance. (The user will tend to say

Yüklə 210,37 Kb.

Dostları ilə paylaş:

1 2 3 4 5 6 7 8 9 10