Recidivism Prediction Pitfalls: An explanation through Collaborative Filtering

5 min readAug 13, 2021

According to one of the case studies done, the overall precision that the existing model provides is between 61–70%, with two main issues in its predictions:

False positives for African-American defendants:
The defendant was predicted as medium/high score risk but did not re-offend
False-negative for Caucasian defendants:
The defendant was predicted as Low score risk but actually did re-offended.

These two factors caused the model results to present x2 higher risk factors to black defendants over whites.

The official algorithms for the ‘COMPAS’ solution are trade secrets,
But there are publications online that examine classifiers and linear regression models and come up with similar results.

Do it like Netflix

Netflix is really good at it! Why can’t we?

For the sake of this article, let’s assume we can solve the recidivism prediction problem in a similar way to a Netflix recommendation engine, recommending the judge- based on past offenders similarly to a defendant, what is its risk score to re-offend

How will something like that work?

In the original ProPublica examination, this dataset was used to measure the model’s results,
Assuming we learn from this data set the most important factors in predicting risk score, let’s say defendants similarity can be calculated by:

Offense severity
Crime violence level
Past convictions
Past jail behavior score
Time passed between convictions
Association with criminal environment/activity
and a final factor- that states if they have re-offended as tagged data helps predict the final risk score.

Each defendant will receive a normalized score of 1–10 for each factor.

Thus, similarly to how a Netflix recommendation engine knows how to suggest a movie based on my similarity to my neighbors, we can use collaborative filtering (in this case, the Oboe Auto-ml solution based on this post) to calculate a defendant’s risk score to re-offend.

Snapshot from Jupyter notebook of the dataset sample — Let’s assume we are trying to predict Person1’s Risk score using the similarity data to other defendants.

Sounds Great, right?

This way, we can automatically learn similarities, with having only minimal domain knowledge to normalize the factors. And, incorporating a feedback matrix can train back the model and improve our similarity predictions for the long run.

Screenshot from Jupyter notebook, predicting Person1 Risk score based on similarity — Predicting Person1 Risk score based on its similarity to the other defendants

So, what could go wrong?

Collaborative filtering has its pitfalls.

Cold start issue

Let’s assume a defendant is pending trial, With little to no criminal record. Causing him to have higher data sparsity than other defendants, making the system less reliable in predicting its similarity to other defendants with many more data points.

Is having no criminal record not a factor to be recognized in recidivism? It could be. But collaborative filtering usually ignores the missing data and only predicts based on the available one.

Scalability

Every day in every minute, thousands of courtrooms across the US conduct hearings to determine the defendant’s short and long-term future.

This means that the data grows every moment, and more and more data points are added to the model.

This requires great resources to store and compute and is ever-increasing.

Grey Sheep/Black Sheep

Grey sheep refers to users who do not fit precisely to any other group of people. In Netflix, this means we might fail to find them the best movie to see, but if we try to predict their likelihood of re-offending, this is a much more significant challenge with tremendous implications.

For black sheep, we might not be able to predict at all. How will that look for a courtroom judge to learn of a defendant who doesn’t fit any group? This might not mean anything about the person’s risk to re-offend but might not serve the defendant in the courtroom.

Diversity and long tail

Collaborative filtering has an advantage in helping diversity by identifying similarities to multiple groups,

But as there can be a bias towards positive feedback in recommender systems, in the use-case at hand, this could create a bias toward specific factors that could potentially, by proxy, be related to race and socioeconomic status.

Data quality

The “Shilling attacks” concept refers to users' manipulations when rating content in original recommendation systems. If they want to “help” their favorite brands, they can provide positive feedback for those they support while giving negative feedback on the competition. What requires the identification of such issues and a way to handle them.

Our system requires the normalization of offense information and criminal record data. Here, the issues start in making sure this data is collected equally for all defendants. And that it is not pre-collected with biases to the system even before normalization. This matter, by itself, requires the identification of problematic cases and a way to handle them.

Cultural data biases

The law, and society, are constantly evolving and adaptive. Regulations and laws change every once in a while. And risk factors could be diverse as well. If you use past data to predict risks factors, you might be ever immortalizing the past biases. Same as ‘COMPAS’ algorithm critic says.

In Conclusion

Would this be a good solution to predict risk factors based on identifying similarities between defendants? Maybe.

But this alone, will not save us from preventing biases in our models.
And it is up to us to make sure we address each of every models’ issues when solving a product need.
We can ensure we are doing more good than bad with our ML solutions to real-world problems by paying the right attention to the pitfalls and data risks. And addressing those at every stage of the way. Measuring the results, testing for biases, and adapting our models accordingly.

Only then are, we as data and product professionals are truly building and optimizing for humans.