Speech Recognition: Difficult then, Common now

Speech recognition is almost as natural as breathing for us, but for a computer, it has taken more than half a century to solve this ‘problem’. Previously, the fundamental drawbacks to Speech Recognition were its poor accuracy, sensitivity to noise, over dependence on training to a particular voice and similar problems meant it worked in principle, but not in practice. This has been hugely improved now, it is often reaching the high nineties in percentage terms for several reasons: the general increase in the availability of affordable computing power, the advent of the cloud and the vast numbers of people now using it. Last year, IBM announced a major milestone in conversational speech recognition by building a system that achieved a 6.9 percent word error rate. Since then, it has continued to push the boundaries of speech recognition. Today, it reached a new industry record of 5.5 percent where these are measured on a very difficult speech recognition task: recorded conversations between humans discussing day-to-day topics like “buying a car.” This recorded corpus, known as the “SWITCHBOARD” corpus and has been used for over two decades to benchmark speech recognition systems.

The 21st century has seen many improvements in this field. In the 2000s DARPA sponsored two speech recognition programs: Effective Affordable Reusable Speech-to-Text (EARS) in 2002 and Global Autonomous Language Exploitation (GALE). The National Security Agency has made use of a type of speech recognition for keyword spotting since 2006. This technology allows analysts to search through large volumes of recorded conversations and isolate mentions of keywords. Google‘s first effort at Speech Recognition came in 2007 when its first product “GOOG-411” a telephone based directory service was released. Now, Google voice search is supported in over 30 languages and particularly in 2015, Google’s speech recognition reportedly experienced a dramatic performance jump of 49% through new techniques involving deep learning.

These advancements in Speech Recognition Technology had diversified its application. It has been implemented by many Healthcare and Military organizations:

Health care

Medical documentation

In the health care sector, Speech Recognition is implemented in front-end or back-end of the medical documentation process. In Front-end speech recognition, the provider dictates into a speech-recognition engine, the recognized words are displayed as they are spoken, and the dictator is responsible for editing and signing off on the document. Whereas, in Back-end or deferred speech recognition the provider dictates into a digital dictation system, the voice is recognized and a draft document is made out of it which is routed along with the original voice file to the editor, where the draft is edited and finalized. Deferred speech recognition is widely used in the industry currently.

Therapeutic use

Particularly in short-term-memory re-strengthening of brain AVM patients, the use of speech recognition software in conjunction with word processors has shown significant benefits. Further research needs to be conducted to determine cognitive benefits for individuals whose AVMs have been treated using radiologic techniques.


High-performance fighter aircraft

Significant progress in the test and evaluation of Speech Recognition in fighter aircraft has taken place in the last decade. Of particular note have been the US program in speech recognition for the Advanced Fighter Technology Integration (AFTI)/F-16 aircraft (F-16 VISTA). In this program, speech recognizers have been operated successfully in fighter aircraft, with applications including setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display.

Air Force Chief of Staff Gen. T. Michael Moseley announced Lightning II as the F-35 name during a Joint Strike Fighter inauguration ceremony July 7 at the Lockheed Martin Aeronautics Co. at Fort Worth, Texas. The F-35 Lightning II is the next generation strike fighter bringing cutting-edge technologies to the battlespace of the future. The Lightning II features an advanced airframe, autonomic logistics, avionics, propulsion systems, stealth and firepower. (U.S. Navy photo/Chief Petty Officer Eric A. Clement)

Also, speaker-independent systems are being developed and are under test for the F35 Lightning II (JSF). This system has produced word accuracy scores in excess of 98%.

Training air traffic controllers

Training for air traffic controllers (ATC) represents an excellent application for speech recognition systems. In the current scenario, many ATC training systems need a person to act as a “pseudo-pilot”, engaging in a voice dialog with the trainee controller, which simulates the dialog that the controller would have to conduct with pilots in a real ATC situation. Speech recognition techniques can eliminate the need for a person to act as pseudo-pilot, thus reducing training and support personnel. The USAF, USMC, US Army, US Navy, and FAA as well as a number of international ATC training organizations are currently using ATC simulators with speech recognition from different vendors.


Editor’s note: Original Sources