As there are some word embeddings already available openly which are trained on very huge amount of data taken from Wikipedia, Google etc.
Commonly available word embeddings are Word2Vec, GloVe embeddings which is easily available on the Internet.
What is Word Embedding?
Word Embeddings are the representation of each word present in the document. As in deep networks each word is represented by a dense vector.
Usually these are representations are Sparse because vocabulary is huge like thousands of dimensions for representing each word.
There are some algorithms for generating these vectors which are representation if each word.
The position of word inside vector is learned from text and its value for each word representation is learned from data with the help of algorithms.
The discussion of algorithms in detail is out of scope of this blog.
The algorithms used behind generating the Word2Vec embeddings are -:
- Skip Gram Model
- Continuous Bag of Words
Suppose there are several documents out which we come up that it contains 2000 words which is very less i.e, vocab size is 2000 and each word is represented by 2000 dimensional vector.
For more details on this Word Embeddings you can refer to course of Stanford CS 224N.
Link for the YT tutorials -: https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z
Keras Embedding Layer
Keras has an Embedding layer which is commonly used for neural networks on text data.
As in machine learning solutions & Services, it is important to encode the word into integers, therefore each word is encoded to a unique integer.
We need to prepare a vocabulary first and based on that vocabulary by using Tokenizer API of keras we can prepare our input data in order to perform classification/prediction task.
This Embedding layer is initialized with random weights and then learns from text for generating the word vectors.
The generated embedding can be saved and use for further modelling purposes.
Embedding layer has 3 arguments -:
Input_dim: It is the size of vocabulary in the text data.
Output_dim: Size of the word vectors in which words will be embedded
Input_length: It is the length of input sequences. If length of documents is not same which is obvious so we need to make padding.
Let’s see how to create embeddings for our small dataset -:
Building model from scratch using Embedding layer
## Importing all the modules from Keras
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
## Creating a document for making a word vectors using Embedding Layer
document = ['Nice Clothes!', 'Very good shop for clothes',
'Amazing clothes', 'Clothes are good', 'Superb!', 'Very bad', 'Poor quality', 'not good', 'clothes fitting bad', 'Shop not good']
## Defining a labels for corresponding documents for sentiment analysis
labels = [1,1,1,1,1,0,0,0,0,0]
## Let’s keep the vocab size 40
vocab_size = 40
## Encoding the documents with vocabulary, we can do it using Tokenizer
## API as well
encoded_documents = [one_hot(d, vocab_size) for d in document]
## Output of encoded
[[32, 31], [27, 1, 32, 25, 31], [12, 31], [31, 1, 1], , [27, 32], [14, 39], [16, 1], [31, 25, 32], [32, 16, 1]]
## Padding the sequences to make all documents of equal length
maxlength = 5
padded_documents = pad_sequences(encoded_documents, maxlen = maxlength, padding ='post')
## Output of padded documents with zero
[[32 31 0 0 0]
[27 1 32 25 31]
[12 31 0 0 0]
[31 1 1 0 0]
[38 0 0 0 0]
[27 32 0 0 0]
[14 39 0 0 0]
[16 1 0 0 0]
[31 25 32 0 0]
[32 16 1 0 0]]
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=maxlength))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['acc'])
model.fit(padded_documents, labels, verbose=0, epochs=50)
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy %f' %(accuracy*100))
## Output of fitting the model
Layer (type) Output Shape Param #
embedding_2 (Embedding) (None, 5, 10) 400
flatten_2 (Flatten) (None, 50) 0
dense_2 (Dense) (None, 1) 51
Total params: 451
Trainable params: 451
Non-trainable params: 0
So, the above code helps in training a word embedding for a small dataset.
In the above blog we have seen how to generate word embeddings using Embeddings Layer from Keras.
I have discussed about the basics of Word2Vec and how it is generated using Skip Gram, CBOW algorithms.
Further more with this Embedding layer we can do a lot of stuff like saving the generated embeddings and will use for future modelling purposes.