Here’s the problem I want to address:
It’s not trivial to compare a very diverse set of Machine Learning models and identify where each model stands out and/or where it can be improved.
Artificial Intelligence (AI), Machine Learning (ML) and Deep Learning (DL) are the focal point of a vast amount of articles and books being written by researchers and practitioners. In many instances, a common denominator is the claim that great AI algorithms are expected to be fast, accurate, and deliver novel insights. And adding to that list, if working with those algorithms wasn’t hard enough already, a more recent trend also expects those well-tuned models to also get ethics and transparency.
Data science teams certainly face a lot of pressure these days. Can they succeed?
Sure they can! It’s already a common practice to use a range of methods to evaluate a model’s performance. Some approaches include working with a confusion matrix and its many rate values (e.g. accuracy, precision, recall, sensitivity, specificity, F1-score), or using feature engineering, selection, and cross-validation to tune up classification/prediction models.
But to everyone’s despair, the number of variables and scenarios to analyze can very quickly escalate and spiral out of control … so what can help data scientists evaluate the best mix of input features, training processes, and model hyperparameters to create and deliver best-in-class model outputs?
I want to propose the following mix: Automation + Artificial Intelligence + Benchmarking
In this article I’m proposing a new strategy for teams that have to manage a complex suite of ML models as part of their data science initiatives. I hope to shed some light on the subject of model selection and optimization through the use of insights discovered using comparative performance analysis, sometimes referred simply as ‘benchmarking’.
Many data science teams work with hundreds (sometimes thousands) of different ML models and they have a need to constantly monitor and optimize them. Automated comparative performance analysis is a novel AI-based technology that can be used to monitor model performance and discover insights like the one below, which are reported in English, just like human analysts would have.
Model_023 has the lowest precision rate (61%) among the 27 models with at least 97% recall rate and that are recurrent neural networks and use sigmoid activation function. That 61% compares to an average of 95.1% across those 27 models.
⇒ What are the challenges to the ML model evaluation process?
When it comes to working with Machine Learning, there are a lot of well-known challenges, and, in my opinion, the most obvious one is that the current state of AI still hasn’t seen a single universal model that can solve every problem you throw at it. For that reason, data scientists are constantly testing problems with different models in a quest for better results and more optimized processes. And for the use cases where individual models just don’t seem to be sufficiently adequate, custom strategies are used to combine the performance of multiple models with the expectation that the combined effort will deliver much better performance and results. And although that usually yields better results, the complexity of combining models (and leveraging the best that each one has to offer) just adds more layers to the original problem of how to decide what models should be used, and when.
And let’s not forget about all the hyperparameters (for example, the ‘k’ in the k-nearest neighbors clustering, or the learning rate in neural networks). They are the configuration settings that define how ML algorithms should operate – they are used to optimize and control the models’ behavior and performance. In most cases, those controls are important to help find the right balance between variance and bias but because the guidelines to tune them are still sort of a mystical process, there’s always room for debate as performance results can change depending on the datasets and the choice of configuration values.
Although companies can hire the best and brightest teams of data scientists to manually come up with a lot of interesting insights, time is usually limited and leaders want to ensure their budgets are rightly spent and their projects efficiently pay off. There’s a cost to delete and replace models that have already been deployed to users. Many times it’s expensive to re-train models from scratch, at scale. And it’s not uncommon to see decisions that opt for models to be chain-linked, boosted, or just maintained/improved over time. Understanding what application use-cases can leverage existing models and figuring out areas of performance improvement is a strategy that aligns well with the long-term task of not having to throw models away because they were not nurtured enough.
Independently of their industry, most data science professionals do routinely compare different models to identify which ones are getting properly trained/optimized, and which ones can deliver the best classifications, predictions, translations, transcriptions, results, etc. Occasionally, it’s also important to identify what models are likely to perform well with specific target audience demographics, output types, etc.
It is critical to evaluate different ML algorithms and compare their performance with consistency. And although algorithms’ accuracy and performance will vary depending on a number of factors, it is relevant to find out how the different models compare among themselves so the best ones can be selected for the most appropriated tasks. However, comparing a huge set of models where each one is associated with a lot of metadata variables is a daunting task. The large search space makes it a hard task for most humans, and it’s something that can be better dealt with if automation is used.
So, what if we could use automated Artificial Intelligence algorithms to improve the selection and optimization of ML models?
Discovering that a model (that used to constantly do well before) is now poorly performing in certain specific conditions is a valuable insight to have. By automating the task of comparative performance analysis, one can discover novel insights that can lead to optimization gains and a better understanding of those assets.
⇒ How to benchmark Machine Learning models?
To visualize the task of comparing models, let’s picture a large data table, where each row represents an individual model and the columns contain all available model metadata, which could include some of the following:
- algorithm/model type, and all of its hyperparameters/configurations
- information about the model input features
- information about the target classes/groups/outputs
- metrics related to model training, and expected/actual performance outcomes
- information about the computing hardware/data center/network architecture running the models
- business strategy and user behavior associated with the models
- etc.

sample model metadata
The metadata columns can include numbers, text, boolean, and even sets/lists of values. In addition, they can represent not only current data but previous data as well, which allows for insights based on temporal changes to be reported. The actual discovery of insights is done by looking at each and every model and comparing it with every peer group.
That’s certainly a very complex task but the AI breakthrough innovation that can look at every possible combination of model scenarios already exists, and for each model, it can identify what is working and what isn’t, and the peer groups that go along with each finding. This technology can also point out whether the insights have a positive, negative, neutral, or ambivalent impact. It uses automated discovery and writing of comparative performance insights to automate the reasoning task of benchmarking, the task of going from abundant data on traits, behaviors, outcomes, and feedback, to detailed insights written in English that can be read, shared, and used to persuade and motivate.
By leveraging an automated comparative performance analysis engine that can use automation to benchmark ML models I believe that data science teams will easily get answers to questions like…
- Where can each model be improved?
- Where does each model stand out?
- How does each model compare across all other models?
- How does each model compare to a given custom peer group?
- How does each model compare to its most similar peers?
- What model behavior (or outcome) shows changing/trending signs?
- What models are best-in-class in specific dimensions?
- How do models rank when examined from a holistic point of view?
Knowing what makes each individual model noteworthy across all the different models available, and tracking how they change over time, might be an interesting use case as that would efficiently point out where each model stands out – that is, where each model can be more efficiently applied. The insights also provide a better understanding of what models are falling behind in terms of user/business expectations, so, in addition to assisting with model management and optimization, the insights might also give a lot of new strategy direction/visibility to senior-level management who might want to keep an eye on the work being done by their teams. Tracking all the metadata and insights associated with models might help with the identification of any eventual signs of trouble (e.g. overfitting, bad precision/recall rates, lengthy (re)training time, increased latency, GPU underperformance, dissatisfying end-user feedback, etc.). And depending on the benchmarking findings, there’s room to improve only certain components or parameters (e.g. the choice of ANN activation function, or the choice of GPU architecture) and possibly see substantial gains with minimum cost and a reduced amount of required changes.
Here are some other illustrative examples of what these insights can look like.
- Model_021 has the 4th-biggest drop in training time (-19.0%) among the 250 models that are trained with k-fold cross validation and k is greater than 10. That -19.0% compares to an average of +3.4% across those 250 models. Also, that -19.0% represents a drop from 307 seconds to 249 seconds.
- Of the 48 models that have at least a 88% total accuracy, Model_095 is one of just 3 that is a convolutional neural network. Incidentally, all 3 are used by the YXZ application and have 2000-or-more distinct output classes.
- Model_076 in server cluster US-East-92 has the 2nd-lowest learning rate (0.015%) among the 141 models that use the SGD optimizer in Keras. That 0.015% compares to an average of 0.21% across those 141 models.
- Model_049 has the 2nd-lowest precision rate (61%) of the 4,423 models that augment training data and have 30-or-less input features. Those 61% compare to an average of 94% across the 4,423 models.
- Model_173 has the lowest number of epochs (23,000) among the 29 models that are recurrent neural networks and have a sigmoid activation function. That 23,000 compares to an average of 65000 across those 29 models.
- Of the 243 models that are better than the average in total accuracy and are used for image classifications, Model_45 is one of just 3 that don’t meet criteria for classifying objects of output class C2.
Take a special look at the first example above. Because today most of us know exactly how much it costs to run operations in the cloud, that insight could have gone one step further and reported how much money the company is now likely saving because of that drop in training time. Talk about ROI validation!
If you’re intrigued and want to see what real comparative performance insights look like for real-world data, you can look into this comparative performance analysis of public healthcare datasets.
⇒ What about ML models having to get ethics and transparency?
From an ethical perspective, if the input data fed into a model is biased and doesn’t fully represent the population that is supposedly being analyzed, the output will be biased as well so it’s important to have checks and balances in place to ensure the model’s training data is properly balanced and contain the right amount of variance without discriminating any specific population segments. The level of importance of the decisions that will be driven by the model’s implementation has to be aligned with the amount of effort put into ensuring that the model outputs won’t derail the organization’s business and reputation. But not all is lost if the model doesn’t offer a 100% representation of all the data as not every model needs to trained with every type of input in mind. If a model is trained with only a reduced set of input data, then as long as the application knows that only entities of those specific types should go through that model, things should be fine, and sometimes, that’s even desirable. Depending on the data available to compare the models, the insights might be able to point out what models seem to be doing better in the ‘ethics’ department than others.
On the subject of what ideal models should look like, Forester Research has written a great report that focus on the need for ML models to be FAIR – “The Ethics Of AI: How To Avoid Harmful Bias And Discrimination. Build Machine Learning Models That Are Fundamentally Sound, Assessable, Inclusive, And Reversible”. Definitely worth the reading!
And as for the transparency angle, what is needed are tools that allow both the technical and non-technical sides of the organization to monitor the model’s metadata, bias, accuracy, and overall performance. Sometimes it might even be desirable to allow people outside the organization to have visibility to that information. So an efficient user-friendly interface that supports different access levels is certainly important.
⇒ In conclusion…
There’s no doubt that cherry-picking models that are best-in-class and are aligned with the goals of the business is a hard task. So, is it really worth it to benchmark Machine Learning models? Absolutely!!! Particularly because I believe that doing so can drive the creation of stronger models, boosting their performance.
If this ML model optimization subject interests you, I’d love to hear your thoughts about the value of performance insights like the ones I’m presenting in this article. Get in touch!
By André Lessa