I’ve had an interesting month experimenting with Predictive Analytics using a customer’s anonymised recruitment data.
First, I used SAP Predictivie Analytics. It’s a great product, but I found that I felt ‘too distant’ from the analytics process. While the end result is explained in statistical terms, I wanted more control over how the result was generated. I’ve experience of regression and statistical modelling from my time studying Actuarial Science (maths of death and money), so wanted to be more involved with selecting the methodology and an ability to tweak the settings.
I settled on Microsoft Azure Machine Learning. I’ve been using Azure extensively for databases and virtual machines (this site is hosted on Azure), so was keen to extend my experience of the products available. Being a cloud product, you only pay for what you use, so it’s for more economical for intermittent use than buying on-premise software.
I looked at existing and past employees to build a model to predict how long an employee would remain. The business had a high turnover of staff, so excluding ‘certain leavers’ would reduce training costs significantly
The initial model used Linear Regression to predict how many months candidates would stay. For the non-statisticians, Linear Regression just means that a weighting is applied to each of the inputs. This results in a nice transparent model so you can see how the probabilities are calculated. Problem was that the results were very poor.
Second model used a Neural Network. This is a ‘black box’ with no outputs to help understand the probabilities. It’s a classic ‘computer says no (or yes)’ model.
Considering the quality of the data (many missing values), the results were remarkably good. While the predictions of who would stay more than 1 year were mediocre, by tweaking the model it was possible to exclude about 20% of candidates who would almost certainly leave within 12 months.
This all looks great until you look at the inputs: age at application, sex, religion, ethnicity, nationality, marital status, smoker and recruitment source.
Recruitment source and smoker are no problem, not sure about marital status, but the rest are toxic!
Before criticising my choice of inputs, please bear in mind that this was all the information that was populated in my customer’s data. Don’t forget that the use of a Neural Network hides the decision process so you can’t see what the biases are.
The customer isn’t going live with this (it was purely for testing), but I’d welcome feedback and opinion. Does this model:
a) Improve the recruitment process by removing cultural bias in recruitment by using a ‘black box’ to predict candidates’ futures.
b) Break the law by making recruitment decisions based upon race/age/religion/nationality.