What is Word Embedding and Why Keras Embedding Layer is Important in Machine Learning?

banner

Introduction

As there are some word embeddings already available openly which are trained on very huge amounts of data taken from Wikipedia, Google etc.

Commonly available word embeddings are Word2Vec, GloVe embeddings which are easily available on the Internet.

What is Word Embedding?

Word Embeddings are the representation of each word present in the document. As in deep networks each word is represented by a dense vector.

Usually these representations are Sparse because vocabulary is huge like thousands of dimensions for representing each word. There are some algorithms for generating these vectors which are represented by each word.

The position of the word inside the vector is learned from text and its value for each word representation is learned from data with the help of algorithms. The discussion of algorithms in detail is out of scope of this blog.

The algorithms used behind generating the Word2Vec embeddings are:

  • Skip Gram Model
  • Continuous Bag of Words

Suppose there are several documents out which we come up with that contain 2000 words which is very less i.e, the vocab size is 2000 and each word is represented by 2000 dimensional vectors. For more details on this Word Embeddings you can refer to the course of Stanford CS 224N.

Link for the YT tutorials: https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z

Want to know more about Word Embedding?

Confused about how to encode words into integers or the about the usage of the tokenizer API of Keras? Get a clear picture of the machine learning solutions and services that helps in encoding word into integers and the three arguments of the embedding. The article also discusses creating embeddings for small datasets.

Keras Embedding Layer

Keras has an Embedding layer which is commonly used for neural networks on text data. As in machine learning solutions & Services, it is important to encode the word into integers, therefore each word is encoded to a unique integer.

We need to prepare a vocabulary first and based on that vocabulary by using the Tokenizer API of keras we can prepare our input data in order to perform classification/prediction tasks. This Embedding layer is initialized with random weights and then learns from text for generating the word vectors.

The generated embedding can be saved and used for further modelling purposes.

Embedding layer has 3 arguments:

  • Input_dim: It is the size of vocabulary in the text data.
  • Output_dim: Size of the word vectors in which words will be embedded
  • Input_length: It is the length of input sequences. If the length of documents is not the same which is obvious so we need to make padding.

Let’s see how to create embeddings for our small dataset:

Building model from scratch using Embedding layer

## Importing all the modules from Keras

from keras.preprocessing.text import one_hot from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense from keras.layers import Flatten from keras.layers.embeddings import Embedding ## Creating a document for making a word vectors using Embedding Layer document = ['Nice Clothes!', 'Very good shop for clothes', 'Amazing clothes', 'Clothes are good', 'Superb!', 'Very bad', 'Poor quality', 'not good', 'clothes fitting bad', 'Shop not good'] ## Defining a labels for corresponding documents for sentiment analysis labels = [1,1,1,1,1,0,0,0,0,0] ## Let’s keep the vocab size 40 vocab_size = 40 ## Encoding the documents with vocabulary, we can do it using Tokenizer ## API as well encoded_documents = [one_hot(d, vocab_size) for d in document] print(encoded_documents) ## Output of encoded [[32, 31], [27, 1, 32, 25, 31], [12, 31], [31, 1, 1], [38], [27, 32], [14, 39], [16, 1], [31, 25, 32], [32, 16, 1]] ## Padding the sequences to make all documents of equal length maxlength = 5 padded_documents = pad_sequences(encoded_documents, maxlen = maxlength, padding ='post') print(padded_documents) ## Output of padded documents with zero [[32 31 0 0 0] [27 1 32 25 31] [12 31 0 0 0] [31 1 1 0 0] [38 0 0 0 0] [27 32 0 0 0] [14 39 0 0 0] [16 1 0 0 0] [31 25 32 0 0] [32 16 1 0 0]] model = Sequential() model.add(Embedding(vocab_size, 10, input_length=maxlength)) model.add(Flatten()) model.add(Dense(1, activation='sigmoid')) model.compile(optimizer='adam', loss='binary_crossentropy', metrics =['acc']) model.summary() model.fit(padded_documents, labels, verbose=0, epochs=50) loss, accuracy = model.evaluate(padded_docs, labels, verbose=0) print('Accuracy %f' %(accuracy*100)) ## Output of fitting the model _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 5, 10) 400 _________________________________________________________________ flatten_2 (Flatten) (None, 50) 0 _________________________________________________________________ dense_2 (Dense) (None, 1) 51 ================================================================= Total params: 451 Trainable params: 451 Non-trainable params: 0 _________________________________________________________________ Accuracy 89.999998

So, the above code helps in training a word embedding for a small dataset.

Conclusion:

In the above blog we have seen how to generate word embeddings using Embeddings Layer from Keras. I have discussed the basics of Word2Vec and how it is generated using Skip Gram, CBOW algorithms.

Furthermore with this Embedding layer we can do a lot of stuff like saving the generated embeddings and will use it for future modelling purposes.

Related article

An interdisciplinary field, data science uses scientific systems, algorithms, processes, and other methods to gain insight and knowledge from data in different forms,

Creating an algorithm that can learn from data to make a prediction is what Machine Learning is all about.

2018 will bring a lot of modern inclusions that will make the year immensely eventful and the power of these technologies will be appreciated in a better way.

DMCA Logo do not copy