The proper and correct processing of both written and spoken language is becoming more and more important. Information transfer is increasingly done through “natural” language instead of more predefined choices like the push of a phone button or the recognition of a spoken word. This phenomenon is possible because available technology is capable of converting the sometimes vague and/or not always consistent messages that comes with written/spoken messages into a clear “command”. But how does this “conversion” work? And what are the problems that comes with using “natural” input?
The first step of course, is the processing of the input. With a text message we can assume that the message is “as intended” but with a spoken message we need to consider errors made by the Speech-to-Text engine. Either way, we get a written representation of the customer’s question, and we must work with it.
The processing of textual messages can be done with several algorithms, used to “process” free, unstructured language (and speech). The oldest group is Natural Language Processing (NLP) while two newer groups coming from NLP, are Natural Language Understanding (NLU) and Natural Language Generation (NLG). Moreover, NLU and partly NLG, can be seen as grownup “daughters” of NLP which has become more of a catch-all term.
Anyway, NLP is concerned with “processing” free text into a standardised structure. NLU is a next step and deals with “interpretation” of text to give it meaning. And finally, NLG is the process of generating natural (sounding) language and then generating the corresponding speech.
The history of machine translation dates to the 17th century, when philosophers such as Leibniz and Descartes proposed codes that would link words between languages. All these proposals remained theoretical, however, and did not lead to the development of a real machine. It was not until the mid-1930s that the first patents for machine translation were filed.
In 1950, Alan Turing published his famous article “Computing Machinery and Intelligence“, in which he proposed what is now called the Turing test as a criterion for intelligence. This criterion depends on the ability of a computer program to impersonate a human being in a real-time written conversation with a human judge. The judge must decide if “the other side” is a real human being or a computer algorithm.
NLU & NLG
With the advent of fast computers, the internet, and large amounts of data on the one hand, and the rapidly increasing (market) demand for fast and reliable “language processing” algorithms on the other, NLP began a steady and increasingly rapid march. And nowadays we would probably not be able to do without NLP, NLU, and NLG. Some examples? Well, you may think of automatic translation on your smart phone, determining the intent of a conversation, making a summary, or giving personal information in a telephone conversation.
In all cases, we see a similar pattern. It started somewhere in the 1990s with mainly rules-based systems, but with the advent of good and fast (AI) software, it shifted towards the massive use of intelligent algorithms often based on Artificial Intelligence. It might be a good idea to give some examples of the three Natural Language processes.
Suppose you have a telephone conversation in which a caller wants to know something about applying for a permit for a shed. It may go like this:
“Good morning, this is Pauline. I applied to build a shed three weeks ago and now I want to know how things are going.”
The first steps to be taken are typical NLP steps in which the unstructured text is converted into a form that a computer can “understand”. The NLP steps that can be taken are:
- Tokenisation: the detection of each individual word. In our example, this means 24 tokens.
a. Stemming: rewriting a verb or noun form in its base (e.g. “applied -> applying” or “things -> thing”)
b. Lemmatization: determining via a dictionary (or database) the meaning of the word. Suppose you have the word “better” than you see via the dictionary that it comes from “good”.
- Part of Speech tagging: this determines the class of the word. For example, is it a verb, a noun, or a prefix?
- Named Entity Recognition: this is to determine whether there is an entity associated with the word. In our example sentence, for example, this is Pauline. The entity here is a personal name.
Of course, there are many more tools available for additional purposes, but these can be considered as the main NLP-components.
Understanding the message is sometimes a very difficult (or sometimes even impossible) task. In addition to just the plain written (or spoken) “whish”, we add a lot of additional information in our messages that refers to general knowledge, things that have just happened or wishes that are clear to the speaker but not to the recipient. We use the tone in our voice to emphasis or de-emphasis certain aspects and use humour to give our message a certain twist. In a human-to-human conversation, we use various non-verbal components to determine whether the other understands exactly what we mean. And in case of a non-understanding, we may change our language to make us understandable.
At present, it is still quite difficult to use suitable algorithms to correctly identify the non-verbal parts of a conversation. However, this will have to be done if we want to be able to truly approximate human-to-human conversations. Some examples of difficulties in NLU are:
- Lexical ambiguity – how to treat the word “current”?
- Alice is swimming against the current.
- The current version of the report is in the folder.
In sentence 1, current is a noun that belongs to swimming while in sentence 2 it is an adjective that says something about the noun “version”.
- “He lifted the beetle with green cap” − Did he use cap to lift the beetle, or he lifted a beetle that had green cap?
- Mary went to Jeanne. She said, “I am tired” − Exactly who is tired?
- Mary went to Jeanne and said: I’m tired
- Mary went to Jeanne who said: I’m tired.
- How you pronounce this phrase: Yes, and that is what you are going to do.
- Skeptical pronunciation: I do not believe that you will/can do it.
- Confirming pronunciation: I believe that you will do it.
With NLG, you create a good sounding and understandable sentence in a language of choice, based on a set of input parameters. For NLG too, a whole series of different algorithms are available to make such a sentence. For example, you can create an elaborate and polite or, on the contrary, short, and powerful phrase. Moreover, you can make the sentence suitable for people who are familiar (experts) or unfamiliar with the process.
Although NLG began in the pre-AI era, today’s tools are mostly AI-based. These algorithms work fast, use relevant data and are relatively easy and quick to adapt to a changing situation.
Where are the different NLP, NLU and NLG algorithms used today? Well, in about a lot of modern speech and text applications of e.g. Call Routing, Q&A applications, dictation apps, topic detection apps and in speech or text summarization tools.
The first question a telephony system asks is often to find out who is calling. The next question can be “what are you actually calling for?”. And this is where it gets tricky. Because people do not answer with a grammatically correct sentence that can immediately be interpreted correctly. Suppose the answer is “yes, I, um, I’m actually calling to, um, to find out something about the, er, the situation for me now. I mean do I get that red chair or not“? The intent is probably “status update order of caller” but how do you get that intent out of the expressed sentence?
Although NLU and NLG are both undergoing rapid development and there are more and more tools for understanding what someone means, we are not ready. An important and partly unknown part of understanding comes from how something is said. For example, you can say yes but make it clear by the way you say it that you mean no. Currently we focus on what is said and thus get sometimes the wrong answer. This emotion detection is something that many companies are working on, but it is quite difficult because the characteristics of the speaker play an important role. A calm and civilised older lady will speak differently than an excited young person. But how do you determine what kind of person is calling?
We will probably write more about this soon, but for now, enjoy reading!
To go further: