When humans talk to computers

To say that computers are overtaking humans in many ways is nowadays an understatement. Computers have far greater compute and storage capacities than humans and solve problems that were previously unsolvable. However, the reverse is also true: some tasks remain an element of mystery for computers. One task, in particular, characterizes this phenomenon: language comprehension. Although this is a skill that humans acquire early in life, it is very complex to make a computer understand the meaning of words. Worse, it is even more difficult to explain sentence construction, syntax, and especially semantics This problem, which has remained unsolved for a long time, would however have applications in many fields. Probably the most obvious are translation, automatic text correction, self-completion of words or word suggestions, which are now widely used in mobile phones. Some might also dream of computers, writers, novelists, journalists. So how do you teach a computer to understand languages?

The key to being understood. Math

In order to make human language understandable by a computer, it is necessary to translate it into what it can understand. One of the main paradigms is to create a mathematical representation of the sentence. The most classic representation is the so-called vectorial representation: each sentence, document or word will be represented by a table of numbers, i.e a vector, allowing its content to be summarised and represented in space. The granularity of this representation, i.e. the amount of information contained, will vary according to the complexity of the chosen representation.

Statistics are part of the dance

Historically, one of the first statistical representation is the bag of words. The principle is as follows: a sentence is represented by a vector made up of 0 and 1. This vector is the total size of what is called the vocabulary, the set of words that the sentence could potentially contain. Each word of the vocabulary is associated with a box in the vector. If the word is indeed present in the sentence, then the number in this box will be 1. If the word is not present in the sentence, the number representing it in this vector will be 0. Thus, the word bag mechanism allows a sentence to be represented by the presence or absence within it of each word in the vocabulary.

Figure 1 - Example of a bag of words.

This representation is simplistic and expresses certain limits. Word order is not taken into account at all. Nor is it possible to tell whether a word appears more than once in the same sentence. Furthermore, if a word present in the sentence has not been included in the vocabulary, we will not be able to represent it. Finally, each word in the sentence has the same importance in the representation. This is certainly the greatest of the limitations cited: in a sentence, it is obvious that some words, such as nouns and verbs, convey more information than others, such as determiners.

To overcome this problem, more complex statistical representations have been proposed. The most common is called TF-IDF (Term Frequency-Invert document frequency, which can be translated by term frequency). This method is a weighting method used to assess the importance of a term contained in a document. This weighting is calculated using the frequency of the term in the document compared to its importance in the entire corpus, that is to say the set of documents that will be used as a representative sample of the language to be transcribed. In order to fully understand what a document and a corpus are, here are some examples: a document is La Fontaine's poem "Le Corbeau et le Renard" (The Raven and the Fox), while the corpus is the whole of his work., while the corpus is the whole of his work. Thus, a word that is very frequent in a document, but also very frequent in the corpus, carries little information: it is probably a determinant such as "the" or "it". On the other hand, a word that is very frequent in the document but less frequent in the corpus will certainly carry a lot of information and will allow the subject of the document to be fully understood. The TF-IDF method makes it possible to compare the frequency of a word in the document with its frequency in the corpus, and to give it more or less weight depending on this ratio.

This method is very efficient and is still used today in text mining. However, this technique also suffers from limitations, notably the fact that no importance is given to the meaning of the term. Its representation is purely statistical, based solely on frequencies of occurrence of the term. No information about the meaning of the term is therefore included in this representation.

Not yet creative but constantly evolving

As in many areas of computer science, the representation of language is also being revolutionized by artificial intelligence. The first model that revolutionized this field is known as Word2vec, a vector word. Its principle is as follows: an artificial intelligence model, and more precisely a neural network will be trained to represent a word according to the use made of it in a training corpus.

In practice, suppose that we want to train a Word2vec model in order to teach it to represent the words in the French language. The process will be as follows: we will provide it with a large number of texts in French, like all the articles in the French Wikipedia. The assumption that the model will follow is that two words used in a similar context have a similar meaning. The ultimate goal of the model is to produce a vector of word representation that translates the meaning of that word. Thus, two words with a close meaning will be associated with vectors whose values will be close.

To illustrate this concept, the model must produce representations for the words in such a way that king - male + female = queen.

The above operation is performed on the vectors representing each of these words.

Figure 2: Illustration of the working principle of the Word2vec template. (Source: kawine.github.io/)

Thus, the Word2vec model will allow us to produce for each word a representation that has been learned by studying the context in which the word was used. Each word is therefore associated with a single vector, this vector is supposed to translate the set of meanings that the word can take. In practice, this is not exactly the case. Indeed, this static representation is necessarily biased by the frequencies of occurrence of each meaning of the word. For example, “wind” is more often used in the sense of the physical phenomenon than the verb, “wind the clock”. Thus, the model, which has been more often confronted with one meaning than another, will tend to produce a vector reflecting the first meaning instead. This limitation can be overcome by the dynamic representation of words or documents.

Dynamic word representations, which represent the current state of the art, are revolutionizing many aspects of language processing, such as the understanding of queries by search engines, translation, which is now carried out on the Internet. sentence scaling or spelling correction. These advances allow computers to grasp some of the richness that human language represents, sometimes even exceeding their performance for certain tasks. While creativity is not yet one of their capabilities, it is certain that the revolution is underway in many aspects of communication.

Sources :
1. JONES, Karen Sparck. A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 1972.
2. DEVLIN, Jacob, CHANG, Ming-Wei, LEE, Kenton, et al. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
3. MIKOLOV, Tomas, SUTSKEVER, Ilya, CHEN, Kai, et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013.
4. https://kawine.github.io/blog/nlp/2019/06/21/word-analogies.html