First of three blogs about speech technology, AI and voicebots. In conversation with dr. Arjan van Hessen, working in the field of Human Language Technology since ’91 at Lernout & Hauspie, the Universities of Twente and Utrecht and Telecats.
What is speech technology?
Speech technology is the collection of techniques that have “something” to do with processing the human voice. Of course, this immediately brings to mind flagship technologies such as automatic speech recognition and speech synthesis, but there is much more. Think of trying to determine the emotion contained in an utterance or think of the combination of what and how something was said that is used for the early detection of brain disorders. The processing of utterances of the (human) speech also falls under the heading of Speech Technology. Examples are the removal of noise or humming, making speech sound clearer or speeding up or slowing down speech without it being audible.
Because Automatic Speech Recognition (ASR) in particular has become so much better through the use of Deep Neural Networks (DNN), interest is shifting to the next step: no longer what is said, but what is meant. To do this, knowledge of the world around (i.e. the context) such a speech expression is needed, among other things. However, we mainly work with the results of ASR and other textual sources, which makes this, strictly speaking, more language than speech technology. Due to the increasing importance of additional language technology in processing speech data, we are now talking about language and speech technology.
What is the level of maturity of speech technology in 2020?
Since 2010, the use of DNNs in the different parts of ASR became common practice. And the result was quite impressive because everywhere where it was used, performance increased significantly. One of the consequences of theses working ASR-systems was the introduction of speech recognition on your own mobile phone, iPad, laptop and regular computer. Services such as SIRI, Google Assistance and Alexa began for English and were rapidly expanding their services to other (popular) languages. With the arrival of such advanced services, “resistance” from the general public seemed to disappear. Moreover, since the same technology was of course used in most call centres, the public was not “disappointed” with the overall results of speech recognition.
At this moment, we can say that we can recognise over 90% correctly provided that…
- The audio quality is optimal
- There is no background noise
- The speakers speak reasonably accent less, in a normal way, and do not mix languages.
Where do things still go wrong?
Certainly, with telephony services, we often see that the spoken response does not quite meet these 3 requirements. People make phone calls in the street (with a lot of background noise), speak without knowing exactly what they want to know/do and do not always use something that resembles “accent-free”.
On the other hand, many people do ask the same kind of questions, which allows us, with a reasonable data set, to correct for errors made. Looking at the results of most telephony services, we can cautiously state that we correctly handle more than 90% of the recordings made.
How can and do we use speech technology in customer service?
An important field of current ASR applications is in conversations between customers and companies and/or organisations. The main part of these conversations takes place via the telephone, although there is a clear increase in the use of other devices such as tablets or laptops. Modern phones use all sorts of algorithms to ensure that the quality of their calls is as good as possible. Nevertheless, telephone speech is usually not as good as desk speech and, as a result, so are the ASR results. But despite this slightly lower quality of speech, ASR and Text To Speech (TTS) are being used more and more in customer calls.
A customer call usually involves three things: who calls when and for what. The first step (who is calling) can be solved reasonably easily by asking for an ID linked to the caller. Think of a zipcode and house number, a customer number or other IDs that provides a unique result. The second item is not a problem at all because every system logs and stores the date and time of the call. The third item is trickier. Of course, you can just ask why someone calls a certain telephony number, but interpreting the answer given is less easy. There are two different methods for this.
The first is to have human experts answer each call, store each answer given, and train a Machine Learning application with the answers given. In other words, you have the ASR results, you have the human-given “label” and use that combination to train an ML algorithm to correctly label a new (unseen) conversation. This works well but is fairly time consuming as each new customer requires a large number of calls to be reviewed by human experts.
A second solution is to “understand” the recognition results and then do something meaningful with it. For example, consider a sentence such as “I would like to, eh, get two red chairs from you, eh, buy them”. When understanding the sentence, something similar to “action: buy; purpose: chairs; details: colour=red; number: 2″ might come out of it. If this could be done, you could create new applications much faster, e.g. for a bookseller, a furniture shop or a car dealer, where instead of having to label large amounts of speech over and over again, it would be enough to adjust the general application. But… this is being worked on intensively within the (academic) research institutes and, for the time being, it does not work as well as the first method.
What are the benefits for customers?
It would be an illusion to assume that people would prefer not to have human contact in their telephone conversations with companies/ organisations. But that is not always the case. If we look at the history of automating customer calls, at first there was nothing. You called, explained who you were and why you called, and then you got an answer or were connected to another number where everything started all over again. On the one hand, this was annoying because you often had to tell the same story again. On the other hand, it could go nice and fast when the operator immediately understood what you wanted to know and could give the right answer.
With the advent of telephones with push buttons came the possibility of making a selection before talking to a human being. Think of dialogs like “press 1 if you want to know about sales, press 2 if you want to know about the delivery time, press 3 if…”. This allowed companies to conduct their conversations more efficiently, but it was not always a pleasure for the customers: especially if you did not understand what exactly you were being asked.
With the advent of simple forms of speech recognition in the mid-1990s, more became possible. Through cleverly constructed software, people could give the desired answers: slot by slot, or everything in one go. A nice example was reserving a room in a hotel. You had to answer when, for how long, how many people, specialities such as whether or not to bring a dog, ground floor room and more. When you said: “next Wednesday for three nights”, the first two slots were filled immediately, and you could continue with the remaining slots.
But the developments within the ASR continued and, especially after the arrival of DNNs around 2010, quickly became much better. Nowadays, speech is “just” recognised, and the slots are taken from the recognised response. So here too we see that move towards using both Speech and Language technology again (Human Language Technology).
The great advantage of this increasing automation of customer conversations, is that eventually people can always have a conversation with the company/organisation at times that suit them. Moreover, they are helped faster because there are no or hardly any queues. But also for companies, there is a big advantage in using automated dialogues. They can more easily handle a sudden increase in calls, they get much more detailed management information, and it is a lot cheaper in the long run. Of course, it is not the case that people can be taken out completely. ASR makes mistakes, people do not always know exactly the right question to ask and there will always be cases that were not foreseen when creating the automated system. But that is not a bad thing, because people are much better suited to this than machines.
What will be the future of speech technology?
Developments in speech technology focus on three main areas: recognition and, increasingly, understanding of what someone says or means; speech synthesis, in which the naturalness of computer-generated speech can hardly be distinguished from that of a human person; and the use of emotion (both in recognition and in the use of computer speech). Although the ASR results are getting better and better, it is certainly not the case that all recognition problems have been solved. Speakers with a heavy accent, people who have not learned the language as their mother tongue, recordings in the vicinity of a lot of noise or unclear phrasing. These are all factors that cause the recognition of what was said (and meant) to be less than optimal. To some extent, we know what we can do about it: better acoustic models, broader language models, and better noise reduction all help in improving recognition. But those unclearly formulated questions are something else: they require us to work together, especially with language technologists, to try to find the question behind the question. We will therefore see increased cooperation between these two originally separate areas of research and a clear focus on what someone means to say.
Dr. Arjan van Hessen received a master in Geophysics, a PhD in Phonetics and is working in the field of Human Language Technology since ’91. His main interest is in applying the various HLT techniques in both the academic/research and the real world. Through his work as director-user-involvement at CLARIAH and Telecats, he is in the middle of the world of public-private collaboration.