in datascience

Categorizing Job Orders with a Naive Bayes Classifier

Meridian Staffing has about 12,000 job orders from 2010 to present and each is assigned zero or more categories such as “Application Developer”, “Project Manager”, “Network Engineer”, etc.  We regularly extract this job information from our Applicant Tracking System (Bullhorn) and load it into our Posse Analytics server for data analysis and reporting.

Unfortunately, nearly 50% of these jobs are either not categorized or categorized as “Other Area(s)”.  As MSS moves towards being a data-driven organization, categorization will inform activities like capacity and candidate pipeline planning.  As such, having good, clean data becomes  more and more important and we need to mitigate this issue.

Naturally, the first line of approach is to address the source of the data.  But, while we may fix import processes and train people to correctly assign categories when entering new jobs, this is unlikely to fix the problem entirely.  Additionally, we may want a job to be in multiple categories (e.g. a NOC may need a network engineer with systems programming skills, leading to both “Network Engineer”).  This is not a common practice today even though the ability to provide multiple categories is there.

To fully address this issue, I am implementing an automated categorization step to the job extract process.  The goal of this process is to use the title of the job order to add an “Automated Categories” property on each job containing the categories predicted by this process.

The Approach

I’m borrowing a statistical, machine learning technique used in solving the document classification and sentiment analysis problems.  In document classification, we try to classify a document based on the words it contains.  For example, given a set of documents that may be French, English, or German – classify each document as such.  In sentiment analysis, snippets of text (such as tweets from Twitter) are classified as having a “positive”, “negative”, or “neutral” sentiment.

Our problem is to categorize a job order based on the title (or potentially other information in the job order).  So, this essentially the same problem.

The Naive Bayes classifier is a simple but effective approach to these problems.  It is a statistical approach that uses existing data to learn a model of the text and the categories, then uses Bayes theorem to predict categories when new data is presented.

Bayes theorem, in terms of this application, can be summed up like this:  For any job, the probability of the category given the title (posterior) is the probability of the title given the category (prior) multiplied times the probability of the category (likelihood) .  This is usually divided by the probability of the title (evidence).

cedd117f3768b05f1822ae874d3fc303

If you want more information about the Naive Bayes classifier, see the wiki page.  There are links to the formals from there.  I don’t intend to go into the details of Bayes inference here.  But, here’s the gist of it.

We want to calculate the probability of a job being in a category given its title.

P( category | title) = P ( title | category ) * P ( category ) / P ( title )

So, I have to calculate the right-hand side.  This is easy, because we have this data, we just have to count them up.

P (category = “Application Developer”) using the number of jobs in “Application Developer” category.

P( title = “Software Engineer”] ) using the number of jobs with the title “Software Engineer”

P( title = “Software Engineer” given that category = “Application Developer”) – using the number of jobs in the “Application Developer” category with the title “Software Engineer”

The Methodology

The first thing I needed to do was to build out a prototype to test this idea.  Because the Posse Analytics server is built with JavaScript on NodeJS, it made sense to implement this in JavaScript as well.  It turns out this language is pretty decent for dealing with data.  I used the lodash utility library to help work with the large collection of data, and Async for program control.

I’ve tokenized the titles and run the counts on each individual word.  My hypothesis is that this bag-of-words approach may be more robust.  Bayes theorem then let’s us turn these stats around and get what we need.  At the end of the day, we get a probability distribution across all categories.

Let’s say that we have a job order that called “Software Systems Engineer” that isn’t categorized at all.  After this process, we would see the “Automated Categories” properties populated with a probability distribution across categories like this ( zeros omitted ):

‘project manager’ 0.000171727
‘quality assurance’ 0.02245666
‘other areas’ 0.536501943
‘business analyst’ 0.015962375
‘systems administrator’ 0.330674367
‘application developer’ 0.938109813
‘clerical’ 0.010166159
‘database developer’ 0.897096803
‘network’ 0.826128332
‘architect’ 0.005096363
‘mainframe programmer’ 0.928861332
‘data analyst’ 0.017928504
‘engineering’ 0.955837254

Alternatively, we could select which categories to include based on a decision strategy such as threshold, top N, or by some other strategy.  The remaining could be dropped.

If you’re only interested where this will fit in the overall data pipeline, skip down to the Implementation section.  The next section will focus on the specifics of collecting the data and calculating the probability distribution.

Collect the data and train the model

To start out, I retrieve all of the job orders from our Mongo DB and process them into a statistical model.  I need a count of how many times the words in the title of a job show up for a given category.  Here is a snippet of the result of this first step of counting up the words according to categories.  You can tell from looking at this that the words “Technical” and “Writer” show up in a large portion of the jobs categorized as “Technical Writer”.  The probability P( word = “writer” | category = “Technical Writer” ) = 20 / 24 = ~0.83.

To do this, my train function looks like this:

Don’t let the nested reduces throw you.  Basically, for every job (outer reduce), I’m going through each category and reducing the Array of words in the title to a object of counts for each word with the word as the key.  The outer reduce reduces the Array of jobs to an object with the category as the key referencing an object holding the word counts for that category and the count of jobs in that category.

I finish off getting all the stats needed by summing up along the categories.

Calculate the probability distribution

This model is passed into a score function along with the original job data.  For each job, I go through each possible category and calculate the probability that job is in the category according to the job title.  Because floating-point math stinks on a computer, we convert the probabilities to a logarithmic space.

Mostly Correct Data Assumption

Some assumptions were made about this data.  While we know the data to be dirty, we assume that much of the categorization is correct.  Correct data will override / overpower the incorrect data and lead to good classification.

Implementation

Handling new data

There is another process that extracts job data from our Applicant Tracking System.  The score function can be used in that process to augment the data before loading into Posse Analytics.

Handling old data

The existing prototype code can easily write back the augmented job data to the Posse Analytics server.

Updating the model

As time goes by, new data is loaded into the system.  This new data can be used to retrain the model.  The process for updating existing job orders can then be used to re-score according to the new model.

Appendix

The following is a list of the categories in Bullhorn.  I’ve grouped them by my own ideas about their similarity.  There are some obvious duplications (“Other Areas” is the same as “Other Area(s)”) and some arguable similarities (“Application Developer” is similar to “Software Engineer”).

Trainer
Human Resources
Recruiter
Finance and Accounting
Sales
Mortgage
Clerical
Customer Service
Technical Writer
Desktop Support
Information Security
Engineering
Network
Business Analyst
Data Analyst
Systems Analyst
Architect
Database Administrator
Database Developer
Mainframe Programmer
Software Engineer
Application Developer
Systems Administrator
Quality Assurance
IT Director
Project Manager
Product Manager
Other Area(s)
Other Areas

Resources

https://en.wikipedia.org/wiki/Log_probability

http://www.kdnuggets.com/2015/01/text-analysis-101-document-classification.html

https://www.burakkanber.com/blog/machine-learning-naive-bayes-1/