In our ongoing series of blogs “Unravelling the AI mystery” Digitate continues to explore advances in AI and our experiences in turning AI and GenAI theory into practice. The blogs are intended to enlighten you as well as provide perspective into how Digitate solutions are built.
Please enjoy the blogs
2. Prompt Engineering – Enabling Large Language Models to Communicate With Humans
3. What are Large Language Models? Use Cases & Applications
4. Harnessing the power of word embeddings
written by different members of our top-notch team of data scientists and Digitate solution providers.
Word2vec: What Are Word Embeddings? A Complete Guide
Humans have a very intuitive way of working with languages. Tasks such as understanding similar texts, translating a text, completing a text, and summarizing a text come very naturally to humans with an inherent understanding of language semantics. But when it comes to computers, passing on this intuition is an uphill task! Sure, computers can assess how structurally similar two strings are. When you type “Backstreet Boys,” a computer might correct you to “Backstreet Boys,” but how do you make them understand the semantics of words?
- How do you make a computer infer that king and queen carry the same equivalence as man and woman?
- How do you make a computer infer that in a conversation about technology companies, the term Apple refers to the company and not the fruit?
- How do you make a computer infer that if someone is searching for football legends and has searched Ronaldo, they might (should!) also be interested in Messi?
- How do you make a computer recommend “GoodFellas” or “The Irishman” when someone has browsed for “The Godfather”?
- How do you accomplish this mammoth task of bridging the gap between humans and computers to infer the capacity to interpret languages? The answer to these questions lies in this tutorial on the concept of “word embeddings”! Read on!
What are word embeddings?
Word embeddings are a type of word representation in a numerical format. In simple terms, it is a mathematical way of representing the meaning of words or a numerical representation of words. Word embeddings represent words as vectors of real-value numbers, where each dimension in the vector corresponds to a particular feature or aspect of the word’s meaning. For example, one dimension might represent the word’s gender, while another might represent its tense. The values in the vector indicate the strength of the association between the word and the feature. In a way, word embeddings are like a dictionary of training data for a computer. Just like we use a dictionary to look up the meanings of words, a computer can use a word embedding to look up the numerical vector representation of a word. Let’s look at how these word embeddings are calculated!
Why are word embeddings needed?
Word embeddings are needed as they are crucial for training deep learning and machine learning models designed for handling NLP tasks like sentiment classification, word analogy, and speech recognition involving human language. As traditional language models could only identify and understand individual words and viewed them as isolated entities, they lacked the ability to recognize and capture syntactic and semantic relationships between different words. For example, words like “dog” and “cat” were assigned a unique classifier because they were seen as two unrelated entities and not as belonging to the same category of beings with shared attributes: Animals. This is where word embedding models like RNNs, LSTMs, and ELMo surpass traditional models by addressing this limitation. They juxtapose similar words with similar contexts next to each other in a multi-dimensional space, thus creating vector representations of words.
Pre-trained word embedding models like FastText facilitate vector search. They are the building blocks of various natural language processing models that perform tasks such as sentiment analysis, text classification, and machine translation of languages. By decoding semantic relationships between words, word embeddings render NLP models more accurate and efficient in completing their intended tasks. Unlike traditional approaches that require manual feature engineering, word embedding methods offer a novel and more effective pathway for AI models to recognize and understand language patterns and create more efficient and desired outcomes for users.
Understanding word2vec
word2vec is an abbreviation for “word to vector” and is a widely used vector-space approach to using iterations over a text corpus and learning word embeddings. It is based on the distributional hypothesis and was developed by Tomas Mikolov, a Czech computer scientist, and his data science team at Google in 2013. While this approach has many implementations, we will present a simplified explanation of the core concept.
The intuition behind the approach
Let’s think of how humans approach understanding a new word. Our natural approach is making sense of a given word based on its context. E.g., say that you don’t know the meaning of the word mojo. You don’t know what it means or how to use it. You don’t have access to any dictionary! But you see everybody around you using this word in various conversations, such as:
- The team has lost its mojo!
- We need to get our mojos working again!
- Game of Thrones lost its mojo in the final season!
- It took me a long time to get my mojo back!
You also see other similar words used in a similar context, such as:
- The team has lost its power!
- We need to get our magic working again!
- Game of Thrones lost its charm in the final season!
- It took me a long time to get my energy back!
And then you connect the dots to make a mental map of mojo with charm, energy, or magic!
In the above example, we looked for words with similar meanings. You can imagine a similar exercise for understanding relationships (man-woman => king-queen), tenses (running => ran), and other aspects of a language.
The word2vec approach
The word2vec model works in a similar manner. It creates a mapping of frequent words and the context in which these words are used and then uses a neural network (aka transformer architecture) to capture the word similarities in the form of a vector of numbers.
Let’s understand word2vec embeddings with an example.
Step 1: The first thing we need is text. Consider the following sentences stating facts about the fictional Dothraki royal family.
Step 2: For each word in these sentences, we identify some words before and after it and capture them as the context words. The context window size is one of the hyperparameters that can be configured while creating the embeddings. For example, if we consider the context window as two words, then for each word, we look for two words before and after it to form its context words. At this step, some preprocessing is also done to exclude stop words such as the, is, are, of, etc. Applying this to our example:
Step 3: At this stage, we have a vector of focus words and another vector of context words. Consider that we have n focus words. Thus, both focus and context word vectors are of length n.
Step 4: Next, we want to use a neural network to map the focus words to the context words. However, neural networks do not understand strings. Hence, we need to convert these strings into vectors of numbers. Let’s take a short detour to understand how to convert strings to numbers using one hot encoding.
One hot encoding converts a word into a binary vector of zeros and one. Let’s try this out with a simple example. Consider a language dictionary with only three words: Red, Amber, and Green. Each word is represented as a vector of three numbers.
The basic idea is to represent each word as a vector of zeros and ones, where the length of the vector is equal to the size of the vocabulary, i.e., the number of unique words in the text data.
Step 5: Now, we train a neural network. This neural network has one input layer, one output layer, and one hidden layer. The input layer consists of the focus words, and the output layer consists of the context words. The intermediate hidden layer is the one responsible for creating embeddings! The number of nodes in the hidden layer is configurable. For ease of visualization, I set it to 2. If I have two nodes in the hidden layer, then each input node will have two weights connecting to the hidden layer. The weights on these edges are the word embeddings.
Without going into details about the neural network, consider that the task of the neural network is to take n 18-bit focus vectors as input, n 18-bit context vectors as output, and create an intermediate vector that best matches input to output! Since we have set the dimension size of the hidden layer as two, we create an 18*2 matrix in the hidden layer. In other words, for each word, we form an embedding vector of size two.
The beauty of these embedding vectors is that they try to capture the context around the words such that words with similar contexts have a smaller distance between them. This is also reflected in a scatter plot where the words with similar contexts get placed closer to one another!
Step 6: We extract the 18*2 matrix that represents the embeddings for the 18 words. We plot this matrix to understand how the word’s affinity is captured by the neural network.
- To summarize, the approach includes the following steps:
- Read the text.
- Pre-process the text.
- Create a mapping of focus and context vectors.
- Create their one hot encodings.
- Train the neural network.
- Extract the weight from the hidden embedding layer and use these weights as the word embedding vectors.
We have presented a simplified approach to explain word embeddings and distributed representations of words with their compositionality. The word2vec algorithms have some more nuances. There are two popular techniques — CBOW (Continuous Bag of Words) and Skip-gram. The CBOW model predicts the target word from the surrounding words. The Skip-gram model takes an exact opposite approach and predicts the context words given a target word. word2vec can be used to tell apart true context words from skip-grams and false context words obtained through negative sampling.
What are the limitations of Word2Vec?
As much as Word2Vec simplifies the vectorization process, it has its challenges, such as:
Has difficulty handling unknown words
This is a significant drawback of Word2Vec as it cannot deal with unknown or out-of-vocabulary words. When an unfamiliar word is introduced, Word2Vec cannot recognize it, so it assigns a random vector for it that may not make any sense, resulting in subpar performance. This limitation is particularly problematic in noisy dataset environments like Twitter, where many words appear infrequently in a large text corpus.
Doesn’t have shared representations at sub-word levels
Word2Vec lacks the ability to provide shared representations at sub-word levels and ends up treating each word as an independent vector. This can be challenging for morphologically complex languages like German, Turkish, or Arabic, where many words are morphologically similar and thus nuanced linguistic relationships cannot be captured.
Difficult to scale to new languages
Using Word2Vec for new languages will involve creating new embedding matrices. However, as parameter sharing is not possible, applying a single model for cross-lingual uses becomes difficult. Each new language will need its own set of embedding matrices, so using a shared model for languages will not be effective, given unique linguistic contexts.
Harnessing the power of word embedding
Word embeddings provide the power of understanding contextual similarities in words. This can be used in a variety of ways. Below are some examples:
Search Engines: Word embeddings can improve the matches in a search engine. E.g., if you search for “soccer,” the search engine also gives you results for “football,” as they are two different names for the same game.
Language Translation: Word embeddings are crucial for language translation. Two or more words with the same meaning words in two different languages would have similar vectors, which would make it easier for a computer to translate from one language to another. E.g., “engineer” in English is translated to “ingeniero” or “ingeniera” in Spanish. Word embeddings for all three words would be similar, and hence, the machine would be able to translate text with better accuracy.
Chatbots: We have seen an increasing use of chatbots across different applications for different purposes. The users of a chatbot can write a query in any form and can use different words to convey the same thing. For example, for a taxi booking application’s chatbot, a user can either say, “Book me a cab” or “Please reserve a taxi for me.” Both sentences convey the same thing. Using word embeddings, a chatbot can understand that they are the same and act on it accordingly.
Conclusion
This blog chapter discussed the concept of word embeddings, which helps the machine understand the semantics of texts. Word embeddings are a great analogy to how humans understand language — humans first understand the meaning of the words in any text and then try to understand what the text entails. Similarly, a computer understands the meaning of words by using word embeddings. This blog chapter also explored how to generate word embeddings using one of the most popular techniques — word2vec. Word embeddings are the stepping stone for the current advancements in Natural Language Processing (NLP) like BERT and ChatGPT.