Speech recognition is based on speech as the research object. Through speech signal processing and pattern recognition, the machine automatically recognizes and understands human spoken language. Speech recognition technology is a high technology that allows machines to transform speech signals into corresponding text or commands through recognition and understanding. Speech recognition is a wide-ranging interdisciplinary subject. It has a very close relationship with acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology. Speech recognition technology is gradually becoming a key technology in computer information processing technology. The application of speech technology has become a competitive emerging high-tech industry.
1. The basic principle of speech recognition
The speech recognition system is essentially a pattern recognition system, including three basic units such as feature extraction, pattern matching, and reference pattern library. Its basic structure is shown in the following figure:
The unknown voice is converted into an electrical signal by a microphone and then added to the input of the recognition system. First, it is preprocessed, and then a voice model is established according to the characteristics of human voice, the input voice signal is analyzed, and the required features are extracted. Create the template required for speech recognition. In the recognition process, the computer should compare the characteristics of the voice template stored in the computer with the input voice signal according to the model of voice recognition, and according to a certain search and matching strategy, find a series of optimal matches with the input voice template. Then according to the definition of this template, the identification result of the computer can be given by looking up the table. Obviously, this optimal result is directly related to the choice of features, the quality of the speech model, and the accuracy of the template.
2. Development history and current status of speech recognition technology
In 1952, Davis et al. Of AT & TBell Lab developed the first person-specific speech enhancement system capable of ten English digits-Audry system. In 1956, Olson and Belar et al. A system of syllable words that uses the spectral parameters obtained by the band-pass filter bank as speech enhancement features. In 1959, Fry and Denes and others tried to build a phoneme to produce 4 vowels and 9 consonants, and used spectrum analysis and pattern matching to make decisions. This greatly improves the efficiency and accuracy of speech recognition. Since then, computer speech recognition has attracted the attention of researchers from various countries and began to enter into the research of speech recognition. In the 1960s, the Soviet Union ’s MaTIn and others proposed endpoint detection of speech end points, which significantly increased the level of speech recognition; Vintsyuk proposed dynamic programming, which was indispensable in future recognition. The important achievements in the late 1960s and early 1970s were the signal linear predictive coding (LPC) technology and dynamic time warping (DTW) technology, which effectively solved the problem of speech signal feature extraction and unequal length speech matching; at the same time, it proposed Vector Quantization (VQ) and Hidden Markov Model (HMM) theory. The combination of speech recognition technology and speech synthesis technology enables people to get rid of the fetters of the keyboard. Instead, it is an easy-to-use, natural, and humanized input method such as voice input. It is gradually becoming the key technology of human-machine interface in information technology.
3. Voice recognition method
At present, the representative speech recognition methods mainly include dynamic time warping technology (DTW), hidden Markov model (HMM), vector quantization (VQ), artificial neural network (ANN), support vector machine (SVM) and other methods.
Dynamic time warping algorithm (Dynamic TIme Warping, DTW) is a simple and effective method in non-specific person speech recognition. This algorithm is based on the idea of ​​dynamic programming and solves the problem of template matching with different pronunciation lengths. It is a voice recognition technology. An earlier and more commonly used algorithm appeared. When applying DTW algorithm for speech recognition, it is to compare the pre-processed and framed speech test signal with the reference speech template to obtain the similarity between them, according to a certain distance measurement to get the similarity between the two templates And choose the best path.
Hidden Markov Model (HMM) is a statistical model in speech signal processing, evolved from Markov chain, so it is a statistical recognition method based on a parameter model. Because its pattern library is formed by repeated training, the best model parameter with the highest probability of matching with the training output signal is not the pre-stored pattern sample, and the likelihood probability between the speech sequence to be recognized and the HMM parameter is used in its recognition process The best state sequence corresponding to the maximum value is used as the recognition output, so it is an ideal speech recognition model.
Vector quantization (Vector QuanTIzaTIon) is an important signal compression method. Compared with HMM, vector quantization is mainly suitable for speech recognition of small vocabulary and isolated words. The process is to combine several scalar data of speech signal waveforms or characteristic parameters into a vector for overall quantization in multi-dimensional space. The vector space is divided into several small regions. Each small region finds a representative vector. The vector that falls into the small region during quantization is replaced by this representative vector. The design of the vector quantizer is to train a good codebook from a large number of signal samples, find a good distortion measurement definition formula from the actual effect, design the best vector quantization system, and use the least amount of search and calculation to calculate the distortion Achieve the largest possible average signal-to-noise ratio.
In the actual application process, people have also studied a variety of methods to reduce complexity, including memoryless vector quantization, memory vector quantization and fuzzy vector quantization methods.
Artificial neural network (ANN) is a new speech recognition method proposed in the late 1980s. It is essentially an adaptive nonlinear dynamics system, simulating the principle of human neural activity, with adaptability, parallelism, robustness, fault tolerance and learning characteristics, and its powerful classification capabilities and input-output mapping capabilities Very attractive in speech recognition. The method is an engineering model that simulates the thinking mechanism of the human brain. It is just the opposite of HMM. Its classification decision-making ability and description of uncertain information are recognized worldwide, but its ability to describe dynamic time signals is not yet satisfactory, usually MLP classifier can only solve the problem of static pattern classification, and does not involve the processing of time series. Although scholars have proposed many structures with feedback, they are still insufficient to characterize the dynamic characteristics of time series such as speech signals. Because ANN can not describe the time dynamic characteristics of speech signals well, ANN is often combined with traditional recognition methods to use their respective advantages for speech recognition to overcome the shortcomings of HMM and ANN. In recent years, significant progress has been made in the research of recognition algorithms combining neural networks and hidden Markov models. Its recognition rate is close to that of hidden Markov model recognition systems, which further improves the robustness and accuracy of speech recognition.
Support vector machine (Support vector machine) is a new learning machine model that uses statistical theory. It uses Structural Risk Minimization (SRM) to effectively overcome the shortcomings of traditional empirical risk minimization methods. Taking into account the training error and generalization ability, it has many excellent performances in solving small samples, nonlinear and high-dimensional pattern recognition, and has been widely used in the field of pattern recognition.
4. Classification of speech recognition system
Speech recognition systems can be categorized according to restrictions on input speech. If considering the relevance of the speaker and the recognition system, the recognition system can be divided into three categories: (1) specific person speech recognition system. Only consider the recognition of the voice of the person. (2) Non-specific person voice system. Recognized speech has nothing to do with people. Usually, a large number of different people's speech databases are used to learn the recognition system. (3) Multi-person identification system. Usually can recognize the voice of a group of people, or become a specific group of voice recognition system, the system only requires training of the voice of the group of people to be recognized.
If you consider the way of speaking, you can also divide the recognition system into three categories: (1) Isolated word speech recognition system. The isolated word recognition system requires a pause after entering each word. (2) Conjunction speech recognition system. The connection word input system requires that each word be pronounced clearly, and some consonants begin to appear. (3) Continuous speech recognition system. Continuous voice input is a continuous fluent input of natural fluency, and a large number of legato and accent will appear.
If considering the vocabulary size of the recognition system, the recognition system can also be divided into three categories: (1) small vocabulary speech recognition system. A speech recognition system that usually includes dozens of words. (2) Speech recognition system with medium vocabulary. The recognition system usually includes hundreds of words to thousands of words. (3) Large vocabulary speech recognition system. Speech recognition systems that usually include thousands to tens of thousands of words. As the computing power of computers and digital signal processors and the accuracy of recognition systems improve, the classification of recognition systems based on vocabulary size also continues to change. It is currently a medium vocabulary recognition system, and it may be a small vocabulary speech recognition system in the future. These different limitations also determine the difficulty of speech recognition systems.
5. Application of voice recognition
The fields where speech recognition can be applied are roughly divided into five categories:
Office or business system. Typical applications include: filling in data forms, database management and control, keyboard function enhancement, and so on.
Manufacturing: In quality control, voice recognition systems can provide a "hands-free" and "eyes-free" inspection (component inspection) for the manufacturing process.
Telecommunications: A very wide range of applications are feasible in dial-up telephone systems, including the automation of operator assistance services, international and domestic remote e-commerce, voice call distribution, voice dialing, and classified orders.
Medical: The main application in this area is to generate and edit professional medical reports from sound.
Others: Including games and toys controlled and operated by voice, voice recognition system to help the disabled, voice control of some non-critical functions in vehicle driving, such as vehicle traffic control system, audio system.
6. The latest development of speech recognition system
The development of speech recognition technology to today, especially small and medium vocabulary non-specific person speech recognition system recognition accuracy has been greater than 98%, the recognition accuracy of the specific person speech recognition system is higher. These technologies have been able to meet the requirements of common applications. Due to the development of large-scale integrated circuit technology, these complex voice recognition systems can already be made into dedicated chips and mass produced. In Western economically developed countries, a large number of voice recognition products have entered the market and service areas. Some user exchanges, telephones, and mobile phones already include voice recognition dialing functions, voice notepads, voice smart toys, and other products, as well as voice recognition and voice synthesis functions. People can inquire about air ticket, travel, and bank information through the telephone network using voice recognition spoken dialogue system. Survey statistics show that up to 85% of people are satisfied with the performance of the voice recognition information query service system. It can be predicted that in the past 5 years, the application of voice recognition systems will be more extensive, and various voice recognition system products will continue to appear on the market. The role of speech recognition technology in manual mail sorting is also increasingly apparent, with attractive prospects for development. Some postal departments in developed countries have already used this system, and voice recognition technology has gradually become a new technology for mail sorting. It can overcome the shortcomings of manual sorting that rely solely on the sorter's memory, solve the problem of high personnel costs, and improve the efficiency and effectiveness of mail processing. In terms of education, the most direct application of speech recognition technology is to help users better practice language skills.
Another development branch of speech recognition technology is the development of telephone speech recognition technology. Bell Labs is a pioneer in this area. Telephone speech recognition technology will be able to implement telephone inquiry, automatic wiring and some special services such as travel information. After applying the voice inquiry system of voice understanding technology, banks can provide customers with 24-hour telephone banking financial services day and night. As for the securities industry, if a voice recognition system using telephone voice recognition is used, users can directly tell the stock name or code if they want to query the market. After the system confirms the user's request, it will automatically read the latest stock price, which will greatly facilitate the user. . At present, there are a large number of manual services at the 114 number search desk. If voice technology is used, the computer can automatically answer the user's needs and then play back the inquired phone number, thereby saving human resources.
Z18 Bone Conduction Earphone,Outdoor Wireless Sports Headset,Waterproof Bone Conduction Headphones,Outdoor Bone Conducting Headset
Shenzhen Lonfine Innovation Technology Co., Ltd , https://www.lonfinesmart.com