Home / Technology / AI & IT / Speech recognition and Speech translation technologies driving AI Chatbots and intelligent agents for global businesses, disaster relief and military operations

Speech recognition and Speech translation technologies driving AI Chatbots and intelligent agents for global businesses, disaster relief and military operations

Speech Translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language. A speech translation system would typically integrate the following three software technologies: automatic speech recognition (ASR), machine translation (MT) and voice synthesis (TTS). Fast, accurate speech-to-speech translation can globalize the free exchange of information and ideas, allow tourists to communicate easily when traveling in foreign countries, and promote Global businesses that are increasingly expected to operate 24/7, responding to their customers in different markets in near real time.


“The best way we communicate is by talking to each other. It is best if we can recognize each other’s speech and language style. Then, we can teach a machine how to recognize our speech, so you can have a soldier talk to a machine, and that machine operates on his behalf, instead of using a lot of buttons, a lot of typing and a lot of joystick control. Just imagine if you could simply tell the machine what operation you want it to perform, and the machine can go and do that,” sU.S. Army Brigadier General Anthony Potts, RDECOM’s deputy commanding general.


Speech to speech translation is also essential to provide effective situational awareness for military operations in foreign lands as well as in humanitarian assistance and disaster relief efforts that require immediate and close coordination with local communities.


Automatic Speech Recognition (ASR)

Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability that enables a program to process human speech into a written format. While it’s commonly confused with voice recognition, speech recognition focuses on the translation of speech from a verbal format to a text one whereas voice recognition just seeks to identify an individual user’s voice.


Speech recognition is the ability of a machine or program to identify words and phrases in spoken language and convert them to a machine-readable format. Rudimentary speech recognition software has a limited vocabulary of words and phrases, and it may only identify these if they are spoken very clearly. More sophisticated software has the ability to accept natural speech.



While speech technology had a limited vocabulary in the early days, it is utilized in a wide number of industries today, such as automotive, technology, and healthcare. Speech recognizers improves driver safety by enabling voice-activated navigation systems and search capabilities in car radios. Doctors and nurses leverage dictation applications to capture and log patient diagnoses and treatment notes. Today, speech recognition ranges from finance, human resources, marketing, crime, and even public transportation with the goal of cutting business costs, simplifying outdated processes, improving user experience (UX), and increasing overall efficiency.


Virtual agents are increasingly becoming integrated within our daily lives, particularly on our mobile devices. We use voice commands to access them through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks, such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s Cortana, to play music. Voice Search uses the power of speech recognition to search the web! Instead of typing use voice input to quickly and easily search for the things you care about. Studies show that now more than half of all smartphone users engage with voice technology on their device. . They’ll only continue to integrate into the everyday products that we use, fueling the “Internet of Things” movement.


Its adoption has only continued to accelerate in recent years due to advancements in deep learning and big data. Research shows that this market is expected to be worth USD 24.9 billion by 2025.


As technology advances in the areas of cloud computing, data science, and machine learning, speech recognition technology will only improve and change business models in increasingly competitive markets.



Apps like Google Maps use voice command to interact with drivers on a daily basis; Amazon Alexa has become a way of life for many Americans, especially now that “nearly one in five U.S. adults today have access to a smart speaker”, according to new research provided by Voicebot.ai. That number isn’t expecting to slow down either.


Smart speakers are becoming more commonplace in households around the globe. Early products such as Alexa and Siri were merely novel and entertaining at first; it was fun to ask Alexa silly questions just to see how “she” would respond. But voice technologies have drastically improved, and new hardware like Google Home and Apple HomePod is now moving into the marketplace, with other tech giants rushing to release their own smart speaker technology and integrations to keep pace with consumer demand.


Advances in AI have led to more accurate natural language processing (NLP), which allowed for the rise of chatbots. While early versions of these programs were unnatural and often inaccurate, some modern ones are indistinguishable from human users. This development presents a profitable investment for businesses, but perhaps an even more enticing opportunity for the military. You might’ve seen a few headlines about Russian chatbots trying to influence people online. In the hands of hostile powers, these AI systems stand as a threat to U.S.’s defenses, but they may also be the solution.


In the business world, chatbots usually serve as virtual assistants. People ask them questions as they would with another person, and the bot responds with relevant info in natural language. Just as they can help civilians find helpful information, they can provide soldiers with potentially life-saving info.


“In times of crisis, companies are typically left scrambling to respond to changing directives and shifting consumer demand, making streamlined and clear communication critical,” said Alex George, President and CTO of Astute. “Instead of customer service teams becoming overwhelmed with consumer inquiries about everything from company policies, hours and product availability, refunds and more, our AI-powered platform steps in to communicate quickly and consistently with customers and employees on all channels during these times of constant change.”


AI Chatbots are also being adopted by military. The British Ministry of Defence recently commissioned a chatbot to help soldiers access information in the field. Even with a wealth of data, humans under pressure could find it challenging to make sense of their situation and create a plan of action. With the help of this AI, soldiers could get sensible analysis quickly, helping them carry out the mission.



Speech recognition tools still don’t do very well in noisy, crowded or echo-laden places, and they aren’t as good with poor hardware such as low-quality microphones or people talking from far away. They also can struggle when people speak quickly or quietly, or have an accent. It’s also sometimes hard for computers to understand children and elderly speakers, said  Allison Linn, Microsoft News Center Staff. Microsoft is trying to address that problem with technology such as Microsoft Project Oxford’s Custom Recognition Intelligent Services, a forthcoming tool that lets developers build products that deal with those kinds of challenges.


“You want to have voice systems perform well in noisy environments. These include the conditions where the voice intended to be recognized are mixed with other speakers’ voices, such as when playing Xbox or Kinect games with voice control.” “Noise robustness, especially the robustness against other speakers’ voice, in speech recognition component of the system is very important,” said Dang. Human listeners can use attention to focus on the intended speaker, but so far computer systems cannot simulate such ability easily.


“Normal people, when they think about speech recognition, they want the whole thing,” said Hsiao-Wuen Hon, who is managing director of Microsoft Research Asia and also a renowned speech researcher. “They want recognition, they want understanding and they want an action to be taken.” Hon said that means you not only have to solve speech recognition, but you also need to solve natural language, text-to-speech, action planning and execution. Hon refers to this as a system that is “AI complete.”

Speech recognition technology

Speech recognizers are made up of a few components, such as the speech input, feature extraction, feature vectors, a decoder, and a word output. The decoder leverages acoustic models, a pronunciation dictionary, and language models to determine the appropriate output.


However recently real-time speech recognition is coming to mainstream enabled by three technologies first advanced machine learning algorithms like deep learning,  lots and lots of examples to train these algorithms  to identify sounds, and finally  large computing power available not only in personal computers and mobile gadgets but also through cloud computing. The successful real-time speech translation requires a very sophisticated artificial intelligence technology, much more than speech and text recognition. We compose our words differently in speech and in text.


Primarily leveraged for deep learning algorithms, neural networks process training data by mimicking the interconnectivity of the human brain through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds a given threshold, it “fires” or activates the node, passing data to the next layer in the network. Neural networks learn this mapping function through supervised learning, adjusting based on the loss function through the process of gradient descent. While neural networks tend to be more accurate and can accept more data, this comes at a performance efficiency cost as they tend to be slower to train compared to traditional language models.


Speech recognition technology is evaluated on its accuracy rate, i.e. word error rate (WER), and speed. A number of factors can impact word error rate, such as pronunciation, accent, pitch, volume, and background noise. Speech is also dependent on inflection, tone, body language, slang, idiom, mispronunciation, regional dialect and colloquialism.


Reaching human parity – meaning an error rate on par with that of two humans speaking – has long been the goal of speech recognition systems. To reach the level of accuracy a human interpreter achieves, these highly intelligent machines have to not just convert each word into the target language, but to analyze entire phrases and infer their meaning before offering up a translation.


Microsoft  made a major breakthrough in speech recognition in 2016 , creating a technology that recognizes the words in a conversation as well as a person does. Team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists. The researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported earlier. The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.  “We’ve reached human parity,” says Microsoft’s chief speech scientist Xuedong Huang in a statement. “This is an historic achievement.” The technology uses neural language models that group similar words together, allowing for efficient generalization.


Zweig attributed the accomplishment to the systematic use of the latest neural network technology in all aspects of the system.  Deep neural networks use large amounts of data – called training sets – to teach computer systems to recognize patterns from inputs such as images or sounds. To reach the human parity milestone, the team used Microsoft Cognitive Toolkit, a homegrown system for deep learning that the research team has made available on GitHub via an open source license.


Huang said CNTK’s ability to quickly process deep learning algorithms across multiple computers running a specialized chip called a graphics processing unit vastly improved the speed at which they were able to do their research and, ultimately, reach human parity. Moving forward, Zweig said the researchers are working on ways to make sure that speech recognition works well in more real-life settings. That includes places where there is a lot of background noise, such as at a party or while driving on the highway. They’ll also focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.


In the longer term, researchers will focus on ways to teach computers not just to transcribe the acoustic signals that come out of people’s mouths, but instead to understand the words they are saying. That would give the technology the ability to answer questions or take action based on what they are told. “The next frontier is to move from recognition to understanding,” Zweig said.


According to Gartner, “70% of white-collar workers will interact with conversational platforms on a daily basis by 2022.” In short, voice recognition and artificial intelligence behind it are only going to get more sophisticated going forward. As the design and tech industries move toward total inclusivity, intentional AI is becoming imperative to serve a wider range of demographics, along with the demand for positive user experiences. In order to become more inclusive, technologists and scientists have begun to improve AI to recognize a diverse range of accents and dialects. The Harvard Business Review has released recent research that voice recognition “still has significant gender and racial biases,” solidifying the need for improvement to serve diverse populations without discrimination.


Even with holes in the technology, the industry is saturated with companies experimenting with integrating AI into their products and services with digital voice-assistants. One of the greatest industries impacted by technology is entertainment, with augmented reality games exploding onto the scene (hello, Pokémon Go). Virtual reality and biofeedback in voice-controlled video games are becoming more popular as well.


In addition to changes in the actual technology, the advertising industry will respond and have to adapt. Voice will be increasingly difficult to earn money from visual ads, making a revenue shift away from advertising to subscription models. Social media platforms like Snapchat and TikTok are already leveraging voice in their advertisements. For TikTok in particular, the app built and run entirely on AI to give a truly personalized user experience only fuels the power of advertisers to reach users. The art of storytelling will continue to be a force of branding in the new age of AI-powered speech recognition technology, enabling platforms to grow in size, power, and purchasing authority.


Voice technology company DAISYS B.V. (Leiden, the Netherlands) announced their release of a worldwide breakthrough in the development of human-sounding voices by means of artificial intelligence in Dec 2021. The innovation, which narrates written texts in a natural way, generates new, realistically sounding, not yet existing voices. Speech properties like speed and pitch, can be adjusted in real time, allowing the voice to be customised.


Barnier Geerling, CEO of DAISYS explains, ‘this technology makes it easier and faster to apply speech-steered technology. The market potential is enormous, think of audio-visual media using voice-overs, or ‘talking’ cars, robots, or appliances. For manufacturers this means the possibility to integrate realistic speech in their products becomes much easier and more efficient.’


‘We’ve made several important adjustments to the existing basic technology. In addition, we had to cleverly ‘train’ our models, using the right balance of speech data from different speakers. Because of this we’ve managed to generate new, naturally sounding voices that can be real-time adjusted by means of gender, pitch, power and speed.’ Dr.ir Joost Broekens, Chief Technology Officer at DAISYS, explains.


Microsoft’s Skype Translator

Skype Translator is a speech to speech translation application developed by Skype, which has operated as a division of Microsoft since 2018. Skype Translator Preview has been publicly available since December 15, 2015. Skype Translator is available as a standalone app and, as of October 2015, is integrated into the Skype for Windows desktop app.


Microsoft  released a new version of Microsoft Translator API that adds real-time speech-to-speech (and speech to text) translation capabilities to the existing text translation API. Powered by Microsoft’s state-of-the-art artificial intelligence technologies, this capability has been available to millions of users of Skype for over a year, and to iOS and Android users of the Microsoft Translator apps since late 2015.Microsoft Translator was  the first end-to-end speech translation solution optimized for real-life conversations (vs. simple human to machine commands) available on the market. Before today, speech translation solutions needed to be cobbled together from a number of different APIs (speech recognition, translation, and speech synthesis), were not optimized for conversational speech or designed to work with each other. Now, end users and businesses alike can remove language barriers with the integration of speech translation in their familiar apps and services.


Skype promising nothing short of a universal translator for desktops and mobile devices will open up endless possibilities for people around the world to connect, communicate and collaborate; people will no longer be hindered by geography and language. “Their lofty goal is to make it possible for any human on Earth to communicate with any other human on Earth, with no linguistic divide.” “Skype has always been about breaking down barriers,” says Gurdeep Pall, Skype’s corporate vice president. “We think with Skype Translator we’ll be able to fill a gap that’s existed for a long time, really since the beginning of human communication.”


Skype Translator isn’t perfect. It still gets hung up on idioms it doesn’t understand, or turns of phrases that are uncommon, or the fact that most of us speak our mother tongues with a certain degree of disregard for proper pronunciation, sentence structure, or diction. Lee and his colleagues at Skype aren’t bothered by this. They’re more interested to see how the system evolves with tens of thousands of users not only testing its limitations but teaching it new aspects of speech and human interaction that MSR hasn’t yet considered.


Skype Translator relies on machine learning, so it should improve over time the more it’s used. The Microsoft could also improve the performance of speech recognition by 30% by complementing existing Hidden Markov Models with deep neural networks. The technology is very promising, and we hope that in a few years we will have systems that can completely break down language barriers, according to Microsoft.


Microsoft  shared they have reached a technological breakthrough by developing a new way to allow people to have a more natural conversation with an AI-powered chatbot. The breakthrough happened in China, where the XiaoIce Microsoft chatbot can now operate in “full duplex voice sense” by listening to a user, digesting the information, and then responding more naturally at the same time. Microsoft says that the advancements means the chatbot is also able to predict what the person talking will say next. Microsoft’s Li Zhou, engineer lead for XiaoIce, says this is a common skillset natural to us humans, but not yet common to chatbots.


Microsoft has released a new update for its cross-platform Skype app that brings some important changes regarding translated conversations. Indeed, version 8.54 of the app introduces a new built-in Translation feature, which replaces the existing Skype Translator bot that will soon be retired (via Neowin). To use this new native Translation feature with one of your Skype contacts, you’ll need to right click or tap and hold on your contact and select View profile, then Click or tap on Send translation request to enable the Translated Conversation. Be aware that your contact will also need to be on version 8.54 of the app to the accept the Translation request. Skype currently supports 11 languages for Translated Conversations, including English, Chinese, French, German, Italian, Spanish, and Japanese.

Google translator

Google Translate is fast becoming a universal language translation device; its most recent update includes seamless conversation and foreign text translation.  It now supports 103 languages and millions of users. The new version of the app can act as a real-time translator between two people speaking in different languages. The two individuals can speak in their native languages while having a seamless conversation with their smartphone or tablet acting as interpreter.


In addition to the speech translation functionality, the Google Translate app also has a Word Lens function. This allows users to train their device’s camera on some foreign text, such as a street sign, and get an instant translation on-screen. Word Lens can currently translate from English to and from French, German, Italian, Portuguese, Russian and Spanish. Google says it is “working to expand to more languages.”


For machine translation, Google is using a form of deep neural network called an LSTM, short for long short-term memory. An LSTM can retain information in both the short and the long term—kind of like your own memory. That allows it learn in more complex ways. Google says that with certain languages, its new system—dubbed Google Neural Machine Translation, or GNMT—reduces errors by 60 percent. “The key thing about neural network models is that they are able to generalize better from the data,” says Microsoft researcher Arul Menezes. “With the previous model, no matter how much data we threw at them, they failed to make basic generalizations. At some point, more data was just not making them any better.”


Google introduced Translatotron in 2019. Translatotron is a direct speech to speech translation with a sequence to sequence model. This model does not rely on intermediate text representation (as has been the case with traditional systems). Translatotron offers advantages like improved inference speed, which in turn avoids compounding errors between recognition and translation. This means that the translation is straightforward to retain the original speaker’s voice and handles the words that need not be translated.


That said, despite Translatotron’s ability to produce natural-sounding high-fidelity speech translations, the model underperformed compared to strong baseline cascade speech-to-speech translation systems. To remedy this, Google released Translatotron 2 in July 2021. The new version that applies a new method of transferring the source speaker’s voice to the translated speech, is an improvement over the original. It outperforms Translatotron by a margin in terms of translation quality and predicted speech naturalness. It has also improved the robustness of the output speech by cutting down on babbling and long pauses.


In March 2020 Google Translate’s new transcription feature, first demoed back in January, is out now for Android users as part of an update to the artificial intelligence-powered mobile app. The feature will allow you to record spoken words in one language and transform them into translated text on your phone, all in real time and without any delay for processing. The feature will begin rolling out starting today and will be available to all users by the end of the week. The starting languages will be English, French, German, Hindi, Portuguese, Russian, Spanish, and Thai. That means you’ll be able to listen to any one of those languages spoken aloud and translate it any one of the other available languages.


This will work live for speeches, lectures, and other spoken word events and from pre-recorded audio, too. That means you could theoretically hold your phone up to computer speakers and play a recording in one language and have it translated into text in another without you having to input the words manually.


Prior to this feature, you could have used Google Translate’s voice option for turning a spoken word, phrase, or sentence from one language into another, including in both text and verbal form. But a Google spokesperson says that part of the app “wasn’t well suited to listen to a longer translated discussion at a conference, a classroom lecture or a video of a lecture, a story from a grandparent, etc.”


To start, this feature will require an internet connection, as Google’s software has to communicate with its Tensor Processing Units (TPUs), a custom type of AI-focused processing chip for use in cloud servers, to perform the transcription live. In fact, a Google spokesperson says the feature works by combining the existing Live Transcribe feature built into the Recorder app on Pixel phones, which normally works offline, with the power of its TPUs in the cloud, thereby creating real-time translated transcription — so long as you have that internet connection to facilitate the link.

Military Applications

Being able to converse with people who don’t speak English is essential for the Army, since every day, Soldiers are partnering with militaries in dozens of countries around the world. A number of speech-translator devices are available commercially, and Soldiers have been using them. However, speech translators are seldom completely accurate and problems can arise in places where the population converses in a dialect, a form of a language that is specific to that region, according to Dr. Steve LaRocca, a team leader at the Multilingual Computing and Analysis Branch at the Army Research Laboratory.


The Field Assistance in Science and Technology, or FAST team recorded 1,664 lines of speech from 20 Nigerien soldiers during the Military Intelligence Basic Officer Course – Africa in Niamey, Niger, Oct. 26-30. The FAST team’s objective is to record speech samples from each of Africa’s five regions during the next year to capture the different dialects. The Army’s Rapid Equipping Force is purchasing 50 translators, which should arrive in January 2016, for USARAF’s use in Africa.


They used the SQ.410 Translation System, a handheld, rugged, two-way language translation system from a commercial vendor, VoxTec. The device is programmed with nine languages and does not require a cell network or Internet service to operate. When a Soldier speaks in English, the device will repeat what it recognizes and display it on the screen. The system then provides written and spoken translations in the other language. It can also record conversations. An important aspect of the research is to collect data for improving the system’s ability to recognize the many French accents and dialects in Africa, Dr. Stephen LaRocca, a computer scientist and team chief of the Multilingual Computing Branch at Research, Development and Engineering Command’s, or RDECOM said.


Improving the ability of American service members to communicate in foreign languages, particularly in French dialects, is becoming critical in Africa, said Maj. Eddie Strimel, the field Assistance in Science and Technology, or FAST, advisor assigned to U.S. Army Africa, or USARAF. “I am impressed with the FAST team’s language-translation program. It is a force-multiplier in improving our efforts in military-to-military exercises,” Brig. Gen. Kenneth H. Moore Jr., U.S. Army Africa deputy commanding general said. “Language understanding can hinder or enhance operations when multi-national forces operate together, particularly on a continent where more than 2,100 languages are spoken.


Identifying Potential Threats

The military’s always been on the cutting edge of technology. Just as they use advanced materials and equipment in their physical assets, their software is often beyond the civilian level. The government’s used AI for national defense for long before anyone was talking about Russian bots.

Government chatbots have been searching for and identifying threats for more than ten years. AI bots have been engaging with suspected terrorists as early as 2003, and similar technology is likely still in use. These systems can talk to multiple suspects at once, recognizing threats, and gaining intel on how to stop them through online conversation. With today’s advanced AI, these chatbots are only more effective. Working like undercover agents, they can infiltrate chatrooms and other online discussions to find and gain data on hostile forces. Just like businesses gather information online to understand their customers fully, the military can use chatbots to get a well-rounded picture of security threats.



References and resources also include:


About Rajesh Uppal

Check Also

Deepening Sino-Russian Collaboration: A Strategic Alliance Reshaping Global Dynamics

Introduction: In the wake of Russia’s 2022 invasion of Ukraine, the strategic partnership between Russia …

error: Content is protected !!