Mining Jobmine - Fall 2010 - Part 2

This is the second part in a multi-part series, read the first part here.

JobMine is the tool we use at the University of Waterloo for our co-op. In this article, I show some more insightful data based on n-grams.

##n-grams##

n-grams are sequences of n consecutive words. They are quite simple but can provide a lot of insight into a collection of text. Let’s take tweets from twitter as an example. If a tweet was ‘I don’t like customer service’ then the tokenizer would produce a list ‘I, don’t, like, customer, service’. The classifier would then ignore the fact that don’t and like are connected and only have the correct meaning when they are considered together. The classifier may see the ‘like’ token and classify the tweet as positive when clearly it is negative. In order to rectify this, the features (and subsequently the classifier) must take into account the positions of words relative to each other. A simple way to do this is with n-grams.

In the first article I used 1-grams or uni-grams from the title.

The chart is reproduced here:

Click on the image for a bigger version

1-gram Frequency in Titles (Top 25)

Now that is all very interesting but it does not provide me with enough detail. I know most jobs in the categories I’m looking for involve software, but how many are listed as “software tester” and how many are “software developer”?

##Bigrams## So I ran through the titles and collected the top 25 bigrams…

Click on the image for a bigger version

2-gram Frequency in Titles (Top 25)

As you can see, the highest frequency n-gram is “software developer”. Bigrams can obviously provide a lot more information than single words. I also find the distribution very interesting, it seems similar to that of the earlier article.

If I had more data, I’d really like to show the charts for the different job levels (In JobMine there are Junior, Intermediate and Senior classes on jobs).

##Trigrams## Bigrams were definitely more descriptive, so how about trigrams?

Click on the image for a bigger version

3-gram Frequency in Titles (Top 50)

Hmm, not really that interesting. Since titles are short, the number of trigrams that reoccur in titles is small.

##Next Time## So we’ve looked at the titles, but what about the job descriptions themselves?

The next article will go into the meat of the job descriptions. Read it here.