Generative Data Intelligence

Ranking resumes for a given job description using Natural Language Processing — A Toy project

Date:


A Toy project

Vishwanath Beena
Image downloaded from Google

While learning Natural Language Processing (NLP) concepts, I thought it is good to build a mini project which we can use in real time.

During this time, my manager has discussed this idea with me. A service which can be used by Talent acquisition team to filter resumes based on job description before passing them to technical team for further processing.

I thought it is a good idea and I started working on it. Basic concept here is, when we upload a job description and a bunch of resumes to the tool, it should rank resumes in descending order according to the percentage it matches with the job description.

Pre-processing:

I have used below pre-processing techniques.

  1. Removed stop words using stop words from nltk.corpus
  2. Used WordNetLemmatizer to lemmatize the words.

Let’s discuss briefly about the above two techniques.

Stop words are the words which occur more commonly in any language. While processing any document using NLP, often we remove stop words because if we keep them it increases the size of the document. By removing them we can work with lesser set of words.

sample of stop words = {‘a’,’is’,’am’,’the’,’in’}

Lemmatization is the process of stemming the words in a document to its root form. Such that, we can convert all the different verb forms to its root form. So, we will be left with absolute words to compare two documents.

For example, eating, ate and eaten all correspond to the same root word eat. If we apply lemmatization on our document all the words will be converted to its root form.

1. Conversational AI: Code/No Code

2. Chatbots 2.0: Simplifying Customer Service with RPA and AI

3. Question Answering on Medical conversation

4. Automating WhatsApp with NLP: Complete guide

After pre-processing, we need to convert the documents into a vector form. There are different forms of vectorization methods available in python scikit-learn library.

We will briefly discuss two methods.

  1. Bag of Words(BoW)
  2. TF-IDF vectorizer

Bow:

As the name suggests, it is a Bag of all the words in a document. In case if we have more than 1 document, Bag contains all the words from all the documents.

The output of above code is [‘beautiful’, ‘country’, ‘democracy’, ‘in’, ‘india’, ‘is’, ‘largest’, ‘the’, ‘world’].

All the unique words are presented in the list. This is called Bag of Words. Each word is referred with its index. Index of the word ‘beautiful’ is 0, similarly for ‘country’ it is 1…so on. This Bow contains 9 words.

Now, using above Bow, let’s vectorize a sample text and see what happens.

Above code emits [[1 2 0 0 1 1 0 0 0]] as output. Let’s understand what it means. First thing, length of this vector is 9.This is coming from the length of the BoW. Each number signifies the number of times the word in that index in BoW is occurring in this particular text.

1 at index-0 indicates that the word at index-0 in Bow (beautiful) is occurring once in this text. 2 at index-1 indicates that the word at index-1 in BoW (country) occurs twice in our text so on so forth.

This raises another question, what if our text contains a word which is not in the original BoW ? Our text3 contains a word which is not BoW. ‘My’ is not present in original BoW. So, we will not consider this word while vectorizing text3. So, vector size will always be the size of BoW.

There is a modification to this original Bow, which is called Binary Bag of Words. Instead of counting the number of occurrences for each word from BoW, we will just consider if a word exists we represent that as 1, if a word does not exists then we will consider it as 0.

If we construct Binary Bow for text3 we get output as [[1 1 0 0 1 1 0 0 0]].Even though country is present twice in text3 we represented it using 1.Because, in binary Bow we are only interested in word’s existence in BoW.

TF -IDF:

TF-IDF is another approach we use to convert text in to a vector form.

TF means Term Frequency. Which can be expressed using below given formula.

TF(word-i) = (# of times word-i appears in a document)/(# of words in the document)

IDF means Inverse Document Frequency. Which measures how important a word is. If a word is occurring in all the documents it is given less importance. This can be expressed using below given formula.

IDF(word-i) = log(Total No of documents /Total No of documents with word word-i)

Here, log is with base ‘e’. Above formula indicates if a word is present in all the documents, then the denominator becomes equal to the numerator. So, log1 becomes 0.So, it is given lesser importance.

Once we get TF and IDF. To get TF-IDF, we need to multiply them both.

When we apply TF_IDF vectorizer on text3 using above given code we get output as [[0.4078241 0.81564821 0. 0. 0.29017021 0.29017021
0. 0. 0. ]].

For my case study, I have used TF-IDF vectorizer to convert my word documents into vectors.

Cosine similarity:

Once we get vectors, to compare the similarity between them we need a measure. Cosine similarity is one such measure.

This measures the cosine of angle between the given two vectors. As we already know cosine(90) is 0 and cosine(0) is 1.

This means, if two vectors are similar the angle between them will be very small. If two vectors are not similar the angle between them will be more. If two vectors are 90 degrees apart, they are orthogonal and its cosine value will be 1.

Cosine similarity considers cosine angle between them and converts it to a similarity score between 0 to 100 percent.

Project Details:

I have used above discussed methods in my mini project. I used TF-IDF vectorizer to convert my text documents in to vector form. Used cosine similarity as comparison function.

To read word documents I have used textract python library.

Just to make whole project intractable with user I have used flask.

I am making a BoW using job description document and comparing every individual resume with it. Because, I wanted resumes to have words from job description. I went this way assuming job description will have more technologies and requirements listed and every resume might or might not have all of the requirements from Job description.

If I go with other approach, where I make a BoW using all the resumes then the size of BoW will be much larger and BoW will consist of other words which might not even relevant with respect to job description. Thus every document will have very less cosine similarity when compared with job description.

Once I combined all the pieces together,I got this.

Uploading job_description and resumes

After clicking on submit we get following screen with resume ranked.

resumes ranked according to the percentage match

Conclusion:

I know, this has some limitations. Often just considering TF-IDF or BoW to match job description and resumes may not be practical. Sometimes, resume might have described technology in other words than what is listed in job description. So, these will not be considered same in this approach, like J2EE and Java Enterprise Edition and 3+ years of experience vs Three plus years of experience…etc.

But, this is a toy example written just to demonstrate the power of Machine learning and NLP and the practical applications of it.

I am thinking, this can certainly be improved by considering state-of-the-art word embedding techniques. But, when I have used doc2vec for this particular scenario, I found it under performed. TF-IDF model gave better results.

I am always, interested in receiving feedback and ways to improve this article and project.

Code can be accessed from my github repo.

Source: https://chatbotslife.com/ranking-resumes-for-a-given-job-description-using-natural-language-processing-a-toy-project-1f49d3156b44?source=rss—-a49517e4c30b—4

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?