Looking for an Expert Development Team? Take two weeks Trial! Try Now

What is Word Embedding?

banner

Introduction

As there are some word embeddings already available openly which are trained on very huge amount of data taken from Wikipedia, Google etc.

Commonly available word embeddings are Word2Vec, GloVe embeddings which is easily available on the Internet.

What is Word Embedding?

Word Embeddings are the representation of each word present in the document. As in deep networks each word is represented by a dense vector.

Usually these are representations are Sparse because vocabulary is huge like thousands of dimensions for representing each word.

There are some algorithms for generating these vectors which are representation if each word.

The position of word inside vector is learned from text and its value for each word representation is learned from data with the help of algorithms.

The discussion of algorithms in detail is out of scope of this blog.

The algorithms used behind generating the Word2Vec embeddings are -:

  • Skip Gram Model
  • Continuous Bag of Words

Suppose there are several documents out which we come up that it contains 2000 words which is very less i.e, vocab size is 2000 and each word is represented by 2000 dimensional vector.

For more details on this Word Embeddings you can refer to course of Stanford CS 224N.

Link for the YT tutorials -: https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z

Keras Embedding Layer

Keras has an Embedding layer which is commonly used for neural networks on text data.

As in machine learning solutions & Services, it is important to encode the word into integers, therefore each word is encoded to a unique integer.

We need to prepare a vocabulary first and based on that vocabulary by using Tokenizer API of keras we can prepare our input data in order to perform classification/prediction task.

This Embedding layer is initialized with random weights and then learns from text for generating the word vectors.

The generated embedding can be saved and use for further modelling purposes.

Embedding layer has 3 arguments -:

Input_dim: It is the size of vocabulary in the text data.

Output_dim: Size of the word vectors in which words will be embedded

Input_length: It is the length of input sequences. If length of documents is not same which is obvious so we need to make padding.

Let’s see how to create embeddings for our small dataset -:

Building model from scratch using Embedding layer

## Importing all the modules from Keras

from keras.preprocessing.text import one_hot from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.embeddings import Embedding ## Creating a document for making a word vectors using Embedding Layer document = ['Nice Clothes!', 'Very good shop for clothes', 'Amazing clothes', 'Clothes are good', 'Superb!', 'Very bad', 'Poor quality', 'not good', 'clothes fitting bad', 'Shop not good'] ## Defining a labels for corresponding documents for sentiment analysis labels = [1,1,1,1,1,0,0,0,0,0] ## Let’s keep the vocab size 40 vocab_size = 40 ## Encoding the documents with vocabulary, we can do it using Tokenizer ## API as well encoded_documents = [one_hot(d, vocab_size) for d in document] print(encoded_documents) ## Output of encoded [[32, 31], [27, 1, 32, 25, 31], [12, 31], [31, 1, 1], [38], [27, 32], [14, 39], [16, 1], [31, 25, 32], [32, 16, 1]] ## Padding the sequences to make all documents of equal length maxlength = 5 padded_documents = pad_sequences(encoded_documents, maxlen = maxlength, padding ='post') print(padded_documents) ## Output of padded documents with zero [[32 31 0 0 0] [27 1 32 25 31] [12 31 0 0 0] [31 1 1 0 0] [38 0 0 0 0] [27 32 0 0 0] [14 39 0 0 0] [16 1 0 0 0] [31 25 32 0 0] [32 16 1 0 0]] model = Sequential() model.add(Embedding(vocab_size, 10, input_length=maxlength)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['acc']) model.summary() model.fit(padded_documents, labels, verbose=0, epochs=50) loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy %f' %(accuracy*100)) ## Output of fitting the model _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 5, 10) 400 _________________________________________________________________ flatten_2 (Flatten) (None, 50) 0 _________________________________________________________________ dense_2 (Dense) (None, 1) 51 ================================================================= Total params: 451 Trainable params: 451 Non-trainable params: 0 _________________________________________________________________ Accuracy 89.999998

So, the above code helps in training a word embedding for a small dataset.

Conclusion:

In the above blog we have seen how to generate word embeddings using Embeddings Layer from Keras.

I have discussed about the basics of Word2Vec and how it is generated using Skip Gram, CBOW algorithms.

Further more with this Embedding layer we can do a lot of stuff like saving the generated embeddings and will use for future modelling purposes.

Read More:

DMCA Logo do not copy