Build Custom Word2Vec Model

🧠 Embedding Space Learning
“An embedding is a relatively low-dimensional space into which you can translate high-dimensional vectors. Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.” $-$ Machine Learning Crash Course with TensorFlow APIs
This Notebook is based on the official Word2Vec Tutorial from Tensorflow.
Recreating Word2Vec
What is Word2Vec?
Word2Vec [Mikolov, Tomas, et al. 2013a and Mikolov, Tomas, et al. 2013b] is a popular natural language processing technique that is used to create high-quality vector representations of words from large datasets of text. It is a neural network based model that is capable of capturing the semantic and syntactic meaning of words, and it has been widely used in various downstream NLP tasks such as text classification, sentiment analysis, and machine translation. Word2Vec has revolutionized the field of NLP by providing a more efficient and effective way to analyze and understand natural language text. In this document, we will provide a comprehensive overview of Word2Vec, its architecture, and recreate Word2Vec for our custom dataset.
Most Common Types of Methods for Word2Vec
There are two main types of methods used to create Word2Vec models:
Continuous Bag of Words (CBOW): In this method, the model predicts the target word based on the context words that surround it. The context words are used as input to the model, and the output is the probability distribution of the target word given the context words.
Skip-gram: In this method, the model predicts the context words given a target word. The target word is used as input to the model, and the output is the probability distribution of the context words given the target word.
Both methods use a neural network architecture with one hidden layer to learn the vector representations of the words. The size of the hidden layer determines the dimensionality of the word vectors, and typically ranges from a few hundred to a few thousand. The Word2Vec models are trained on large corpora of text data using stochastic gradient descent, and the resulting word vectors are used in various NLP applications.
🥷 As for our use case in this Assignment, we are interested to create Knowledge Graphs, Topic Modeling and Entity-Relationship extraction as downstream tasks, we find that Skip-gram
approach will be much suitable for us as underline it is trying to predict the context for a given word, where we can consider context as neighboring words for a given word. This will be very useful to us in establishing the strong relationships between different words.
Skip-gram
Skip-gram is a natural language processing technique used to create vector representations of words. As mentioned earlier, It is a type of Word2Vec model that learns to predict the context words given a target word
. The basic idea behind Skip-gram is to use the target word as input to a neural network, and then predict the probability distribution of the context words that are likely to appear with the target word in a sentence.
The Skip-gram model takes a corpus of text as input, and creates a vocabulary of all the unique words in the corpus. Each word is represented by a vector of a fixed dimensionality (e.g., 100, 200, or 300). The Skip-gram model then trains a neural network on this vocabulary using a sliding window approach.
In this approach, a window of fixed size (e.g., 5) is moved across the text corpus, and for each target word in the window, the model is trained to predict the surrounding context words. This process is repeated for all target words in the corpus.
During training, the model adjusts the vector representations of each word in the vocabulary based on the prediction errors. After training, the word vectors are used to represent the semantic and syntactic meaning of words, and can be used in various downstream NLP tasks such as sentiment analysis, text classification, and machine translation.
Here are few examples of Skip-grams:
Consider the sentence “The quick brown fox jumps over the lazy dog”. Using a window size of 2, the Skip-gram model would generate training pairs like
(quick, The)
,(quick, brown)
,(brown, quick)
,(brown, fox)
,(fox, brown)
, and so on. The model learns to predict the context words (e.g., The, brown, fox) given a target word (e.g., quick).Let’s say we are training a Skip-gram model on a corpus of movie reviews. The model might learn that the word “awesome” tends to appear in the context of positive sentiment words like “great”, “fantastic”, and “amazing”, while it is less likely to appear in the context of negative sentiment words like “bad”, “terrible”, and “awful”. This information can then be used to perform sentiment analysis on new movie reviews.
Suppose we want to train a Skip-gram model to represent the semantic relationships between different animals. The model might learn that the vector representations of “dog” and “cat” are similar, while the vectors of “dog” and “snake” are dissimilar. This information can then be used to perform tasks such as animal classification or identification. This example is very close to our use case in this Assignment 🥷.
The training objective of the Skip-gram model can be represented by the following negative log-likelihood function
$$-\frac{1}{T}\sum_{t=1}^{T}\sum_{-c\le j\le c, j\ne 0}\log P(w_{t+j}\mid w_t)$$
where
- $T$ is the total number of words in the corpus,
- $c$ is the size of the context window,
- $w_t$ is the target word at position $t$,
- $w_{t+j}$ is the context word $j$ positions away from the target word,
- $P(w_{t+j}\mid w_t)$ is the probability of the context word given the target word.
The Skip-gram model aims to maximize this objective function by adjusting the vector representations of the words in the corpus.
$$P(w_{t+j}\mid w_t)=\frac{\exp(\mathbf{v}{w{t+j}}\cdot\mathbf{v}{w_t})}{\sum_{i=1}^{V}\exp(\mathbf{v}i\cdot\mathbf{v}{w_t})}$$
where
- $\mathbf{v}{w{t+j}}$ is the vector representation of the context word $w{t+j}$,
- $\mathbf{v}_{w_t}$ is the vector representation of the target word $w_t$,
- $V$ is the size of the vocabulary.
The dot product of the two vectors measures the similarity between the target word and the context word, and the softmax function normalizes the probabilities of all the context words in the vocabulary. The Skip-gram model learns to maximize the probability of the context words that are likely to appear with the target word in the corpus. 🥷
Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words, which are often large ($10^5$ - $10^7$) terms.
The noise contrastive estimation (NCE)
loss function provides a useful alternative to the full softmax in order to learn word embeddings.
The objective of NCE loss is to distinguish context words from negative samples drawn from a noise distribution. This negative sampling can simplify the NCE loss for a target word by posing it as a classification problem between the context word and a certain number of negative samples. This provides an efficient approximation of the full softmax over the vocabulary in a skip-gram model.
A negative sample
is defined as a (target word, context word)
pair such that the context word does not appear in the window size neighborhood of the target word. Let’s say for “The quick brown fox jumps over the lazy dog” sentence we want to train a Skip-gram model with a context window of size 2. Given the target word “fox”, one negative sample could be the word “apple” ((fox, apple)
). We can draw this negative sample from a noise distribution that assigns low probabilities to words that are unlikely to appear in the context of the target word. In this case, “apple” is a word that is not likely to appear in the context of “fox”, so it serves as a suitable negative sample. Another such example of negative sample could be (fox, dog)
. Since “dog” is not likely to appear in the context of “fox” (in this sentence), it can be used as a negative sample. However, it is important to note that the number of negative samples chosen for the Skip-gram model depends on the size of the corpus and the context window, and a larger number of negative samples can result in a more stable and accurate model.
Generating Skip-grams using Tensorflow
Generation of Skip-grams involves three main steps:
- Vectorize every sentence encoded as a list of word indices.
- Convert Sentence into Tokens.
- Create a vocabulary to save mappings from tokens to integer indices.
- Use vocabulary to vectorize every sentence in the dataset.
- Use
tf.keras.preprocessing.sequence.skipgrams
to create skipgrams.- This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:
- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).
- Provide a word sequence (sentence), encoded as a list of word indices (integers) as input.
- Provide
vocabulary size
andwindow size
as input.
- This function transforms a sequence of word indexes (list of integers) into tuples of words of the form:
import tqdm
import numpy as np
import pandas as pd
import tensorflow as tf
from collections import defaultdict
from utils import styled_print
sentence = "The quick brown fox jumps over the lazy dog"
def create_vocabulary(sentence):
tokens = list(sentence.lower().split())
vocabulary = defaultdict(int)
vocabulary['<pad>'] = 0
index = 1
for i, token in enumerate(tokens):
if token not in vocabulary:
vocabulary[token] = index
index += 1
inverse_vocabulary = {index: token for token, index in vocabulary.items()}
return tokens, vocabulary, inverse_vocabulary
def vectorize_sentence(sentence, vocabulary):
tokens = list(sentence.lower().split())
sentence = [vocabulary[word] for word in tokens]
return sentence
def print_skipgrams(skip_grams, labels, inverse_vocabulary, num_samples=5):
index = 0
if num_samples is None:
num_samples = len(skip_grams)
for target, context in skip_grams[:num_samples]:
styled_print(f"({target}, {context}): ({inverse_vocabulary[target]}, {inverse_vocabulary[context]}) : Label {labels[index]}")
index+=1
def create_skip_gram(sentence, window_size=2, sampling_table=None, only_positive_skip_grams=True):
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
styled_print(f"Found {len(tokens)} Tokes: {tokens}", header=False)
styled_print(f"Vocabulary: {dict(vocabulary)}", header=False)
word_sequence = vectorize_sentence(sentence, vocabulary)
styled_print(f"Word Sequence: {word_sequence}", header=False)
if only_positive_skip_grams:
negative_samples = 0
else:
negative_samples = 1
skip_grams, labels = tf.keras.preprocessing.sequence.skipgrams(
word_sequence,
vocabulary_size=len(vocabulary),
window_size=window_size,
sampling_table=sampling_table,
negative_samples=negative_samples)
styled_print(f"Found Total {len(skip_grams)} skip grams")
return skip_grams, labels
Generating Positive Skipgrams
styled_print("Creating Skipgrams using Tensorflow", header=True)
styled_print(f"Some Samples of Positive Skip Grams Only", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=2, only_positive_skip_grams=True)
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
styled_print(f"Some Samples of Positive and Negative Skip Grams", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=2, only_positive_skip_grams=False)
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
styled_print(f"Some Samples of Positive and Negative Skip Grams with Window Size of 3", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=3, only_positive_skip_grams=False)
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
[1m› [4mCreating Skipgrams using Tensorflow[0m
[1m› [4mSome Samples of Positive Skip Grams Only[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 30 skip grams
› (3, 2): (brown, quick) : Label 1
› (7, 6): (lazy, over) : Label 1
› (5, 6): (jumps, over) : Label 1
› (3, 4): (brown, fox) : Label 1
› (3, 5): (brown, jumps) : Label 1
[1m› [4mSome Samples of Positive and Negative Skip Grams[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 60 skip grams
› (2, 3): (quick, brown) : Label 0
› (8, 4): (dog, fox) : Label 0
› (3, 2): (brown, quick) : Label 1
› (5, 6): (jumps, over) : Label 1
› (1, 3): (the, brown) : Label 1
[1m› [4mSome Samples of Positive and Negative Skip Grams with Window Size of 3[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 84 skip grams
› (5, 7): (jumps, lazy) : Label 1
› (6, 4): (over, fox) : Label 0
› (3, 2): (brown, quick) : Label 1
› (4, 5): (fox, jumps) : Label 0
› (2, 6): (quick, over) : Label 0
Sampling Table
When dealing with large datasets, the vocabulary size tends to be bigger, with more frequently occurring words such as stopwords. However, using training examples from such commonly occurring words does not offer much useful information for the model to learn from. To address this, Mikolov, Tomas, et al. 2013a and Mikolov, Tomas, et al. 2013b have suggested that subsampling frequent words can improve the quality of word embeddings. A sampling table can be used to encode the probabilities of sampling any token in the training data. The tf.keras.preprocessing.sequence.skipgrams
function can accept a sampling table as input, and the tf.keras.preprocessing.sequence.make_sampling_table
function can generate a word-frequency rank based probabilistic sampling table that can be passed to the tf.keras.preprocessing.sequence.skipgrams
function. One can inspect the sampling probabilities for a vocabulary size of 10 as follows where sampling_table[i]
denotes the probability of sampling the i-th most common word in a dataset.
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(size=10)
styled_print(sampling_table)
› [0.00315225 0.00315225 0.00547597 0.00741556 0.00912817 0.01068435
0.01212381 0.01347162 0.01474487 0.0159558 ]
Here we can see that the most frequent words will have lease probability of sampling. Let’s try to create sampling table for our vocabulary and the create skip grams based on that.
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(len(vocabulary), sampling_factor=0.01)
styled_print(sampling_table)
› [0.09968283 0.09968283 0.17316546 0.23450073 0.288658 0.33786866
0.38338842 0.42601017 0.46627369]
Here we are setting sampling_factor=0.01
while the default value is sampling_factor=1e-5
. The default value is much suitable for large vocabulary. As we have a small vocabulary we need to update it with a slightly larger number.
styled_print("Creating Skipgrams using Tensorflow", header=True)
styled_print(f"Some Samples of Positive Skip Grams Only", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=2, sampling_table=sampling_table, only_positive_skip_grams=True)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
styled_print(f"Some Samples of Positive and Negative Skip Grams", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=2, sampling_table=sampling_table, only_positive_skip_grams=False)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
styled_print(f"Some Samples of Positive and Negative Skip Grams with Window Size of 3", header=True)
skip_grams, labels = create_skip_gram(sentence, window_size=3, sampling_table=sampling_table, only_positive_skip_grams=False)
print_skipgrams(skip_grams, labels, inverse_vocabulary, 5)
[1m› [4mCreating Skipgrams using Tensorflow[0m
[1m› [4mSome Samples of Positive Skip Grams Only[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 4 skip grams
› (6, 5): (over, jumps) : Label 1
› (6, 7): (over, lazy) : Label 1
› (6, 1): (over, the) : Label 1
› (6, 4): (over, fox) : Label 1
[1m› [4mSome Samples of Positive and Negative Skip Grams[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 28 skip grams
› (4, 1): (fox, the) : Label 0
› (3, 2): (brown, quick) : Label 1
› (6, 4): (over, fox) : Label 1
› (3, 4): (brown, fox) : Label 0
› (4, 3): (fox, brown) : Label 1
[1m› [4mSome Samples of Positive and Negative Skip Grams with Window Size of 3[0m
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 34 skip grams
› (8, 1): (dog, the) : Label 1
› (4, 6): (fox, over) : Label 1
› (4, 2): (fox, quick) : Label 0
› (2, 5): (quick, jumps) : Label 0
› (8, 1): (dog, the) : Label 0
Here we should focus on the total number of skipgrams found
. We can see that with sampling_table
argument we have less number of skipgrams and that is because it is assigning less probabilities of selecting most frequent workds i.e. the
in our example.
Negative Sampling
Here only_positive_skip_grams
argument will allow us to create negative samples same as number of positive skip grams in our data. It is a good feature is we would like to create a balance dataset. But we are interested to create more negative samples for each positive sample as it will help us to extend our dataset and will be useful for noise contrastive estimation (NCE)
loss function. In the this part we create $N$ number of negative samples for a given target word. This will be an important step in our data pipeline for Word2Vec model training. For this purpose we will use tf.random.log_uniform_candidate_sampler
function to sample num_ns
words from the vocabulary.
def get_negative_sampling_candidates(context, num_ns, vocab_size, seed):
context_class = tf.reshape(tf.constant(context, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class,
num_true=1,
num_sampled=num_ns,
unique=True,
range_max=vocab_size,
seed=seed,
name="negative_sampling"
)
return negative_sampling_candidates
tokens, vocabulary, inverse_vocabulary = create_vocabulary(sentence)
skip_grams, labels = create_skip_gram(sentence, window_size=2, only_positive_skip_grams=True)
sample_target, sample_context = skip_grams[0]
styled_print(f"Let's sample negative candidates for {(sample_target, sample_context)} - {(inverse_vocabulary[sample_target], inverse_vocabulary[sample_context])} pair", header=True)
negative_sampling_candidates = get_negative_sampling_candidates(sample_context, 5, len(vocabulary), 1)
styled_print(f"Fetched {negative_sampling_candidates} indexes for negatives words")
styled_print([inverse_vocabulary[index.numpy()] for index in negative_sampling_candidates])
› Found 9 Tokes: ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
› Vocabulary: {'<pad>': 0, 'the': 1, 'quick': 2, 'brown': 3, 'fox': 4, 'jumps': 5, 'over': 6, 'lazy': 7, 'dog': 8}
› Word Sequence: [1, 2, 3, 4, 5, 6, 1, 7, 8]
› Found Total 30 skip grams
[1m› [4mLet's sample negative candidates for (2, 3) - ('quick', 'brown') pair[0m
› Fetched [7 1 3 4 0] indexes for negatives words
› ['lazy', 'the', 'brown', 'fox', '<pad>']
As we can see that the negative_sampling_candidates
sometime also include our positive context class while we expect it to explicitly exclude the positive context class. This type of behavior is explained in this document and this comment. It is not intuitive but the underline idea is that even though in this particular example a given (target, context)
pair is part of positive skipgram but the same pair could be part of negative skipgram in some other data.
Create tf.data Datapipeline
book_csv_file = "../data/clean_csvs/book-clean-paragraphs.csv"
book_df = pd.read_csv(book_csv_file)
book_df.head()
id | paragraphs | |
---|---|---|
0 | 101 | maesters citadel keep history westeros used ae... |
1 | 102 | either ac conquest bc conquest |
2 | 103 | true scholar know dating far precise aegon tar... |
3 | 104 | even start date matter misconception many assu... |
4 | 106 | battle war conquest fought thus seen aegon ’ a... |
book_df = book_df.drop(["id"], axis=1)
book_df.head()
paragraphs | |
---|---|
0 | maesters citadel keep history westeros used ae... |
1 | either ac conquest bc conquest |
2 | true scholar know dating far precise aegon tar... |
3 | even start date matter misconception many assu... |
4 | battle war conquest fought thus seen aegon ’ a... |
tf.data
pipelines are usually confusing to understand. In this notebook we will take step by step approach to explain and understand each step of our data pipeline.
datagen = tf.data.Dataset.from_tensor_slices(dict(book_df))
styled_print(f"Checking first five Sample from the tf.data Datapipeline", header=True)
for feature_batch in datagen.take(5):
for key, value in feature_batch.items():
styled_print("{!r:15s}: {}".format(key, value))
[1m› [4mChecking first five Sample from the tf.data Datapipeline[0m
› 'paragraphs' : b'maesters citadel keep history westeros used aegon \xe2\x80\x99 conquest touchstone past three hundred year birth death battle event dated'
› 'paragraphs' : b'either ac conquest bc conquest'
› 'paragraphs' : b'true scholar know dating far precise aegon targaryen \xe2\x80\x99 conquest seven kingdom take place single day two year passed aegon \xe2\x80\x99 landing oldtown coronation\xe2\x80\xa6and even conquest remained incomplete since dorne remained unsubdued sporadic attempt bring dornishmen realm continued king aegon \xe2\x80\x99 reign well reign son making impossible fix precise end date war conquest'
› 'paragraphs' : b'even start date matter misconception many assume wrongly reign king aegon targaryen began day landed mouth blackwater rush beneath three hill city king \xe2\x80\x99 landing would eventually stand day aegon \xe2\x80\x99 landing celebrated king descendant conqueror actually dated start reign day crowned anointed starry sept oldtown high septon faith coronation took place two year aegon \xe2\x80\x99 landing well three major'
› 'paragraphs' : b'battle war conquest fought thus seen aegon \xe2\x80\x99 actual conquering took place 2\xe2\x80\x931 bc conquest'
def only_sentence(sample):
return sample["paragraphs"]
datagen = datagen.map(only_sentence)
styled_print(
f"Checking first five Sample from the tf.data Datapipeline", header=True)
for feature_batch in datagen.take(5):
styled_print(feature_batch)
[1m› [4mChecking first five Sample from the tf.data Datapipeline[0m
› b'maesters citadel keep history westeros used aegon \xe2\x80\x99 conquest touchstone past three hundred year birth death battle event dated'
› b'either ac conquest bc conquest'
› b'true scholar know dating far precise aegon targaryen \xe2\x80\x99 conquest seven kingdom take place single day two year passed aegon \xe2\x80\x99 landing oldtown coronation\xe2\x80\xa6and even conquest remained incomplete since dorne remained unsubdued sporadic attempt bring dornishmen realm continued king aegon \xe2\x80\x99 reign well reign son making impossible fix precise end date war conquest'
› b'even start date matter misconception many assume wrongly reign king aegon targaryen began day landed mouth blackwater rush beneath three hill city king \xe2\x80\x99 landing would eventually stand day aegon \xe2\x80\x99 landing celebrated king descendant conqueror actually dated start reign day crowned anointed starry sept oldtown high septon faith coronation took place two year aegon \xe2\x80\x99 landing well three major'
› b'battle war conquest fought thus seen aegon \xe2\x80\x99 actual conquering took place 2\xe2\x80\x931 bc conquest'
Before we actually create batch of data for model training, we need to create skipgrams. Previously we have created a custom function to vectorize the sentence. In this section as we have a bigger dataset, instead of using that custom function, we are using tf.keras.layers.TextVectorization
layer to vectorize our dataset and create the vocabulary for our dataset.
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=None,
standardize='strip_punctuation',
split='whitespace',
ngrams=None,
output_mode='int',
output_sequence_length=None,
pad_to_max_tokens=False
)
vectorize_layer.adapt(datagen.batch(1024))
styled_print("Create Vocabulary and Vectorizer", header=True)
styled_print(f"The size of our Vocabulary is {vectorize_layer.vocabulary_size()}")
styled_print(f"First 10 tokens from our vocabulary are {vectorize_layer.get_vocabulary()[:20]}")
[1m› [4mCreate Vocabulary and Vectorizer[0m
› The size of our Vocabulary is 10891
› First 10 tokens from our vocabulary are ['', '[UNK]', '’', 'king', 'lord', '“', '”', 'queen', 'would', 'ser', 'aegon', 'prince', 'one', 'dragon', 'men', 'year', 'son', 'said', 'hand', 'lady']
Here get_vocabulary()
method returns the tokens sorted in descending order by their frequency.
text_vector_datagen = datagen.batch(1024).prefetch(
tf.data.AUTOTUNE).map(vectorize_layer).unbatch()
styled_print(
f"Checking first Sample from the tf.data Datapipeline", header=True)
for feature_batch in text_vector_datagen.take(2):
styled_print(f"The Length of Sequence is {len(feature_batch)}")
styled_print(feature_batch)
[1m› [4mChecking first Sample from the tf.data Datapipeline[0m
› The Length of Sequence is 127
› [ 371 398 79 345 137 685 10 2 364 7217 524 32 101 15
225 76 122 820 6450 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
› The Length of Sequence is 127
› [ 822 82 364 4307 364 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
We can make two main observations here:
- The length of each vectorized sentence is 223. This is same as the longest sentence in our dataset.
- Each vectored sequence is padded with
0s
at the end to make sure that each sentence after vectorization has the same length. This is very useful for Neural Network training as it will help us to have a fix input dimension.
styled_print(
f"Check Mapping of Some Vectorized Indexes with Tokens (Words)", header=True)
for feature_batch in text_vector_datagen.take(1):
for token in feature_batch[:30]:
styled_print(f"{token} --> {vectorize_layer.get_vocabulary()[token]}")
[1m› [4mCheck Mapping of Some Vectorized Indexes with Tokens (Words)[0m
› 371 --> maesters
› 398 --> citadel
› 79 --> keep
› 345 --> history
› 137 --> westeros
› 685 --> used
› 10 --> aegon
› 2 --> ’
› 364 --> conquest
› 7217 --> touchstone
› 524 --> past
› 32 --> three
› 101 --> hundred
› 15 --> year
› 225 --> birth
› 76 --> death
› 122 --> battle
› 820 --> event
› 6450 --> dated
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
› 0 -->
Let’s wrap everything we have discussed so far in one function.
def get_preprocessing_datapipeline(df, batch_size=512):
datagen = tf.data.Dataset.from_tensor_slices(dict(df))
datagen = datagen.map(only_sentence, num_parallel_calls=tf.data.AUTOTUNE)
vectorize_layer = tf.keras.layers.TextVectorization(
max_tokens=None,
standardize='strip_punctuation',
split='whitespace',
ngrams=None,
output_mode='int',
output_sequence_length=None,
pad_to_max_tokens=False
)
vectorize_layer.adapt(datagen.batch(batch_size))
text_vector_datagen = datagen.batch(1024).prefetch(
tf.data.AUTOTUNE).map(vectorize_layer, num_parallel_calls=tf.data.AUTOTUNE).unbatch()
return datagen, text_vector_datagen, vectorize_layer
_, text_vector_datagen, vectorize_layer = get_preprocessing_datapipeline(
book_df)
sequences = list(text_vector_datagen.as_numpy_iterator())
styled_print(
f"Checking first few Sequences", header=True)
for sequence in sequences[:1]:
styled_print(f"The Length of Sequence is {len(sequence)}")
styled_print(sequence)
[1m› [4mChecking first few Sequences[0m
› The Length of Sequence is 127
› [ 371 398 79 345 137 685 10 2 364 7217 524 32 101 15
225 76 122 820 6450 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0]
Next we create a function which takes these sequences as input and creates skipgrams as training pairs.
def create_skip_grams(sequence, vocabulary_size, window_size=2, sampling_table=None, only_positive_skip_grams=True):
if only_positive_skip_grams:
negative_samples = 0
else:
negative_samples = 1
skip_grams, labels = tf.keras.preprocessing.sequence.skipgrams(
sequence,
vocabulary_size=vocabulary_size,
window_size=window_size,
sampling_table=sampling_table,
negative_samples=negative_samples)
return skip_grams, labels
def get_negative_sampling_candidates(context, num_ns, vocabulary_size, seed):
context_class = tf.reshape(tf.constant(context, dtype="int64"), (1, 1))
negative_sampling_candidates, _, _ = tf.random.log_uniform_candidate_sampler(
true_classes=context_class,
num_true=1,
num_sampled=num_ns,
unique=True,
range_max=vocabulary_size,
seed=seed,
name="negative_sampling"
)
return negative_sampling_candidates
def create_training_pairs(sequences, window_size, num_ns, vocabulary_size, seed=1):
targets, contexts, labels = [], [], []
sampling_table = tf.keras.preprocessing.sequence.make_sampling_table(
size=vocabulary_size)
for sequence in tqdm.tqdm(sequences):
skip_grams, _ = create_skip_grams(
sequence,
vocabulary_size,
window_size,
sampling_table,
only_positive_skip_grams=True
)
for target_word, context_word in skip_grams:
negative_sampling_candidates = get_negative_sampling_candidates(
context_word, num_ns, vocabulary_size, seed
)
# Build context and label vectors (for one target word)
context = tf.concat(
[tf.constant([context_word], dtype="int64"), negative_sampling_candidates], 0)
label = tf.constant([1] + [0]*num_ns, dtype="int64")
# Append each element from the training example to global lists.
targets.append(target_word)
contexts.append(context)
labels.append(label)
return targets, contexts, labels
targets, contexts, labels = create_training_pairs(
sequences=sequences,
window_size=2,
num_ns=4,
vocabulary_size=vectorize_layer.vocabulary_size(),
seed=1)
targets = np.array(targets)
contexts = np.array(contexts)
labels = np.array(labels)
print('\n')
print(f"targets.shape: {targets.shape}")
print(f"contexts.shape: {contexts.shape}")
print(f"labels.shape: {labels.shape}")
100%|██████████| 3129/3129 [00:34<00:00, 89.75it/s]
targets.shape: (119736,)
contexts.shape: (119736, 5)
labels.shape: (119736, 5)
BATCH_SIZE = 1024
BUFFER_SIZE = 10000
dataset = tf.data.Dataset.from_tensor_slices(((targets, contexts), labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)
<BatchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>
dataset = dataset.cache().prefetch(buffer_size=tf.data.AUTOTUNE)
print(dataset)
<PrefetchDataset element_spec=((TensorSpec(shape=(1024,), dtype=tf.int64, name=None), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None)), TensorSpec(shape=(1024, 5), dtype=tf.int64, name=None))>
class Word2Vec(tf.keras.Model):
def __init__(self, vocab_size, embedding_dim, num_ns):
super(Word2Vec, self).__init__()
self.target_embedding = tf.keras.layers.Embedding(vocab_size,
embedding_dim,
input_length=1,
name="w2v_embedding")
self.context_embedding = tf.keras.layers.Embedding(vocab_size,
embedding_dim,
input_length=num_ns+1)
def call(self, pair):
target, context = pair
# target: (batch, dummy?) # The dummy axis doesn't exist in TF2.7+
# # context: (batch, context)
if len(target.shape) == 2:
target = tf.squeeze(target, axis=1)
# target: (batch,)
word_emb = self.target_embedding(target)
# word_emb: (batch, embed)
context_emb = self.context_embedding(context)
# context_emb: (batch, context, embed)
dots = tf.einsum('be,bce->bc', word_emb, context_emb)
# dots: (batch, context)
return dots
embedding_dim = 512
word2vec = Word2Vec(vectorize_layer.vocabulary_size(), embedding_dim, 4)
word2vec.compile(optimizer='adam',
loss=tf.keras.losses.CategoricalCrossentropy(
from_logits=True),
metrics=['accuracy'])
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir="logs")
word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])
Epoch 1/20
116/116 [==============================] - 11s 89ms/step - loss: 1.6034 - accuracy: 0.2405
Epoch 2/20
116/116 [==============================] - 10s 89ms/step - loss: 1.4765 - accuracy: 0.8175
Epoch 3/20
116/116 [==============================] - 10s 90ms/step - loss: 1.2702 - accuracy: 0.8141
Epoch 4/20
116/116 [==============================] - 11s 92ms/step - loss: 1.0210 - accuracy: 0.8710
Epoch 5/20
116/116 [==============================] - 11s 91ms/step - loss: 0.7571 - accuracy: 0.9185
Epoch 6/20
116/116 [==============================] - 11s 93ms/step - loss: 0.5334 - accuracy: 0.9465
Epoch 7/20
116/116 [==============================] - 11s 92ms/step - loss: 0.3733 - accuracy: 0.9658
Epoch 8/20
116/116 [==============================] - 11s 91ms/step - loss: 0.2676 - accuracy: 0.9788
Epoch 9/20
116/116 [==============================] - 11s 91ms/step - loss: 0.1986 - accuracy: 0.9863
Epoch 10/20
116/116 [==============================] - 11s 92ms/step - loss: 0.1526 - accuracy: 0.9905
Epoch 11/20
116/116 [==============================] - 10s 91ms/step - loss: 0.1209 - accuracy: 0.9927
Epoch 12/20
116/116 [==============================] - 11s 91ms/step - loss: 0.0984 - accuracy: 0.9939
Epoch 13/20
116/116 [==============================] - 11s 91ms/step - loss: 0.0819 - accuracy: 0.9947
Epoch 14/20
116/116 [==============================] - 11s 91ms/step - loss: 0.0695 - accuracy: 0.9954
Epoch 15/20
116/116 [==============================] - 11s 91ms/step - loss: 0.0599 - accuracy: 0.9958
Epoch 16/20
116/116 [==============================] - 10s 90ms/step - loss: 0.0524 - accuracy: 0.9961
Epoch 17/20
116/116 [==============================] - 10s 90ms/step - loss: 0.0465 - accuracy: 0.9964
Epoch 18/20
116/116 [==============================] - 11s 91ms/step - loss: 0.0416 - accuracy: 0.9965
Epoch 19/20
116/116 [==============================] - 11s 93ms/step - loss: 0.0377 - accuracy: 0.9966
Epoch 20/20
116/116 [==============================] - 11s 92ms/step - loss: 0.0344 - accuracy: 0.9967
<keras.callbacks.History at 0x7fc510f5abe0>
print(word2vec.summary())
Model: "word2_vec"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
w2v_embedding (Embedding) multiple 5576192
embedding (Embedding) multiple 5576192
=================================================================
Total params: 11,152,384
Trainable params: 11,152,384
Non-trainable params: 0
_________________________________________________________________
None
weights = word2vec.get_layer('w2v_embedding').get_weights()[0]
vocab = vectorize_layer.get_vocabulary()
import io
out_v = io.open('vectors.tsv', 'w', encoding='utf-8')
out_m = io.open('metadata.tsv', 'w', encoding='utf-8')
for index, word in enumerate(vocab):
if index == 0:
continue # skip 0, it's padding.
vec = weights[index]
out_v.write('\t'.join([str(x) for x in vec]) + "\n")
out_m.write(word + "\n")
out_v.close()
out_m.close()
word2vec.save('model')