A Beginner's Guide to BOW for AI Projects

ยท

5 min read

Hi, I hope you are doing, so today we will discuss about a very commonly used and easy to understand word embedding technique called BOW. In this blog we willfirst start of with the basic introduction to word embeddings, techniques we can use for creating embeddings and a detailed explanation of BOW. So without any further delay let's get started.

What are word embeddings

Word embeddings are simply numerical representation of textual words, where numerical representation means vector.

Types of word embeddings

There are basically 2 different approaches which we can use for creating word embeddings, these are ๐Ÿ‘‡๐Ÿป

  1. Frequency based word embeddings

  2. Prediction based word embeddings

๐Ÿ’ก
In this blog our focus will be on one of the mosts commonly used frequency based word embdding tehchnique called bag of words, but before that let us first understand what does frequency based word embedding actually means.

Frequency based word embeddings

As the name suggest these are basically those techniques which use the frequency of the word in the vocabulary to create embeddings. Such techniques include

  1. Bag of words (BOW)

  2. Term frequency inverse document frequency (TF-IDF)

Prerequisites for BOW implementation

Before the implementation of any of mentioned technique there are some processes which must be performed โฌ‡๏ธ

  1. Tokenization : If we are dealing with textual data stored form of long paragraphs then we must perform tokenization on those long papagraphs to break them down into small texts , known as token.

  2. Lowering the sentences : After pplying tokenization , lowering the sentences is also an important process to perform , because even we humans know that both 'hut' and 'Hut' means some , but when we will convert the words into their numerical representtion then , both of these word will be considered different , thus in order to prevent this we lower the sentences.

  3. Stop word removal : The process of stop word removal is also very much important in order to remove less important words which don't contribute a lot while deducing the meaning of the sentence . Some of these words are : The , he , she , is , are etc.

  4. Stemming or Lemmatization : Stemming or Lemmatization is the last process which you can perform in order to reduce the inflection of words to their base form , so that Machine learning algorithm could easily process and learn the similar words.

๐Ÿ’ก
All the above mentioned processes are like sort of genral processing steps which are recommended, and you must do the processing based on the data your are dealing with.

Bag of Words

BOW which is a short form for Bag of words is a Natural language processing technique that can is used for creating word embeddings by utilizing the vocabulary of the corpus.To better understand the working of the Bag Of Words , let us take 3 documents as our reference โฌ‡๏ธ

  • Doc 1 : He is a good boy

  • Doc 2: She is a good girl

  • Doc 3 : Both boy and girl are good listeners and good speakers.

Here in this case after lowering the sentences , removing the stop words and applying the stemming/lemmatization we will be having above 3 documents in the form ๐ŸŽฐ

  • Doc 1 : He good boy

  • Doc 2: She good girl

  • Doc 3 : Both boy girl good listeners good speakers

Now after cleaning the documents the next step is buidling vocabulary out of this corpus of 3 documents. In case you are not aware about vocabulary don't worry it is nothing more than collection of all the unique words. Now once vocabulary will be build the algorithm will simply assign a numerical value to each word present in the document and this numerical value would represent frequency count of the word.

๐Ÿ’ก
After BOW [ He good boy -> 1 0 1 1 0 0 0 0 ]

Types of Bag Of Words

Now since you are aware about how does BOW technique works, let me give you some information abut BOW. There are 2 different ways in which we can implement Bag of Words technique.

  1. Normal Bag Of Words : In this type the numerical value assigned to a word in the document represents its occruence in that document.

  2. Binary Bag Of Words : In this type only 2 type of numerical values are assigned to every word in document and these 2 values are 0 or 1, where 1 represents the presence of word and 0 represents the absence of word in the document.

๐Ÿ’ก
In the previous example we say normal bag of words, and now down below we have Binary bag of words.

Advantages of Bag Of Words

  1. It is very much simple to understand and implement

  2. It make sure that the dimensionality of vectors remain same

Drawback of Bag Of Words

  1. It doesn't consider out of vocabulary words which have the possibility of providing useful information

  2. It creates high dimensional sparse vectors in case vocabulary is very large.

  3. It doesn't capture semantic information and also doesn't focus on ordering of words in document.

Practical implementation of Bag Of Words Using Python

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

# Create a list of documents
documents = ["This is a bag of words example.", "This is another example."]

# Create a CountVectorizer object
vectorizer = CountVectorizer()

# Fit the CountVectorizer to the documents
vectorizer.fit(documents)

# Transform the documents into vectors
bow_representations = vectorizer.transform(documents)

# Print the bow representations
print(bow_representations.toarray())
๐Ÿ’ก
The output will be array([[2, 1], [1, 1]])

Short note

I hope you good understanding of what are word embeddings, what are the various techniques we can use, how does BOW works and types of BOW. So if you liked this blog or have any suggestion kindly like this blog or leave a comment below it would mean a to me.

๐Ÿ’ก
Also I would love to connect with you, so here is my Twitterand linkedin

Did you find this article valuable?

Support Yuvraj Singh by becoming a sponsor. Any amount is appreciated!

ย