As there are some word embeddings already available openly which are trained on very huge amounts of data taken from Wikipedia, Google etc.
What is Word Embedding?
Word Embeddings are the representation of each word present in the document. As in deep networks each word is represented by a dense vector.
Usually these representations are Sparse because vocabulary is huge like thousands of dimensions for representing each word. There are some algorithms for generating these vectors which are represented by each word.
The position of the word inside the vector is learned from text and its value for each word representation is learned from data with the help of algorithms. The discussion of algorithms in detail is out of scope of this blog.
The algorithms used behind generating the Word2Vec embeddings are:
- Skip Gram Model
- Continuous Bag of Words
Suppose there are several documents out which we come up with that contain 2000 words which is very less i.e, the vocab size is 2000 and each word is represented by 2000 dimensional vectors. For more details on this Word Embeddings you can refer to the course of Stanford CS 224N.
Link for the YT tutorials: https://www.youtube.com/watch?v=8rXD5-xhemo&list=PLoROMvodv4rOhcuXMZkNm7j3fVwBBY42z
Keras Embedding Layer
Keras has an Embedding layer which is commonly used for neural networks on text data. As in machine learning solutions & Services, it is important to encode the word into integers, therefore each word is encoded to a unique integer.
We need to prepare a vocabulary first and based on that vocabulary by using the Tokenizer API of keras we can prepare our input data in order to perform classification/prediction tasks. This Embedding layer is initialized with random weights and then learns from text for generating the word vectors.
The generated embedding can be saved and used for further modelling purposes.
Embedding layer has 3 arguments:
- Input_dim: It is the size of vocabulary in the text data.
- Output_dim: Size of the word vectors in which words will be embedded
- Input_length: It is the length of input sequences. If the length of documents is not the same which is obvious so we need to make padding.
Let’s see how to create embeddings for our small dataset:
Building model from scratch using Embedding layer
## Importing all the modules from Keras
So, the above code helps in training a word embedding for a small dataset.
In the above blog we have seen how to generate word embeddings using Embeddings Layer from Keras. I have discussed the basics of Word2Vec and how it is generated using Skip Gram, CBOW algorithms.
Furthermore with this Embedding layer we can do a lot of stuff like saving the generated embeddings and will use it for future modelling purposes.