2 minute read

In Machine Learning, we want to convert words into vectors to fully utilize the mathematical functions for the model.

There are various techniques using which we can convert words into vectors. We will look into those techniques in this article.

Below are few techniques which can convert words to vectors:

  1. Bag of Words (BOW)

  2. TF-IDF Vectorizer

  3. Word2Vec

Input Text

Lets say we have 4 sentences which we want to convert to vectors:

> This headphone is amazing

> Noise Cancelling in this headphone is really awesome

> This headphone is not good

> This headphone is TrulyWireless and works as described in the description

Bag of Words (BoW)

This is the simplest way to convert text to a vector. Basic idea of this technique is store all unique words in a list and for each sentence we will create a vector of length same as unique words. This creates a sparse vector where most of the values in the vector are zeros.

From our above example, our Bag of unique words contains:

{This, headphone, is, amazing, Noise, Cancelling, in, really, awesome, not, good, TrulyWireless, and, works, as, described, the, description}

Here we have 18 unique words in the bag, so we will represent each of the sentense in a vector of length 18.

Representation of sentence 1 -> This headphone is amazing -> [1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Similarly other sentences can be represented as,

Noise Cancelling in this headphone is really awesome -> [1,1,1,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0]

This headphone is not good -> [1,1,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0]

Same is repeated for other sentence

TF-IDF Vectorizer

Bag of Words simply convert each word into a vector without considering the importance of the word. In TF-IDF, we will consider the frequency of word into consideration.

TF (Term Frequency):

How often a word occurs in corpus. If a word occurs multiple times, then TF of the word will be high

IDF (Inverse Document Frequency):

How rare a word occurs in corpus. If a word is rare, then the IDF of the word will be high

By combining TF-IDF, we are giving importance to rare word in corpus and frequency of occurance in current sentence

From our above example, our unique words are:

{This, headphone, is, amazing, Noise, Cancelling, in, really, awesome, not, good, TrulyWireless, and, works, as, described, the, description}

For computing the TF-IDF of a word in Sentence 1, consider ‘This’ word

TF('This') : 1/4
IDF('This'): log(4/4)
TF-IDF('This')=TF('This')*IDF('This')=(1/4)*(log(1))=0

Here word ‘This’ occurs very frequently in the document corpus and is not a important word and hence TF-IDF of ‘This’ is 0

so, we will update TF-IDF of Sentence 1 as -> This headphone is amazing ->

[TF-IDF('This'),TF-IDF('headphone'),TF-IDF('is'),TF-IDF('amazing'),0,0,0,0,0,0,0,0,0,0,0,0,0,0]

Word2Vec

Word2Vec considers the semantic meaning and generates vectors for each word. Word2Vec considers the context and relation between the words. Distance between vectors of similar words is less. For example, distance between vectors generated by Word2Vec for King and Queen will be similar to distance between vectors for Man and Woman. Word2Vec learns the vectors from large document corpus.

Conclusion:

Using above techniques, we can convert words into vectors on which we can apply transformation that are useful for machine learning.

I will try to create blog for each of the techniques more elaborately.

Updated: