Machine learning operations (MLOps) technology and practices enable IT teams to deploy, monitor, manage, and govern machine learning projects in production. Much like DevOps for software, MLOps provides the tools you need to maintain dynamic machine learning-driven applications. The security of your future enterprise depends on the decisions you make today related to these new applications and the code that powers them. So, what are the risks?
Good People, Bad Code – Data scientists are known for building predictive models and not for their coding skills. Taking their handwritten code and putting it straight into production is a recipe for failure and a potential security risk.
Malicious Code – If someone wanted to harm your business, introducing code into your production machine learning applications would be one way to cause problems. This problem is compounded when your data science team uses a language like Python or R that your IT team doesn’t understand, making it so that your IT team cannot review the code. This code could return bogus results or overload servers and create any number of issues. Malicious code is most likely to work if you don’t have a proactive way to know if production models and their related artifacts are performing as expected.
Adversarial Inputs – Someone is submitting requests that your machine learning model has never seen before, and it responds in a way that you don’t expect. Suddenly, a process that seemed pretty solid is under attack, and your business is giving out approvals, returns, or something that costs you money or hurts your reputation.
Data Pollution or Poisoning – Models are the product of data and algorithms. If the data used to train those models contains patterns that are unknown to you but favorable to someone outside your business, that could be bad for you. In the case of spam filtering, for example, hackers could report a bunch of items that are not really spam to your spam detection system. This could dilute the effectiveness of your spam detection model, resulting in more spam or specific spam getting through.
Denial of Service Attack on ML Endpoints – All machine learning platforms are not created equal. Many data science teams deploy their own “production” endpoints in front of their models and try to use those to support production business applications. Unfortunately, the servers powering these endpoints were built for experimentation and validation and not for real production use. Therefore, when they start to see a load, they can’t scale, and your business application starts to fail. If hackers find these weak endpoints, they can shut them down or slow them down with some fake traffic.
Model Theft – Your business has paid a lot to develop machine learning models, including hiring data scientists, purchasing data science platforms, and building out specific AI infrastructure. AI and machine learning create a competitive advantage for your business. As such, they are particularly desirable assets for theft, most likely from people within your organization who are leaving for a new job. You need to make sure you have tight controls on model access.
MLOps and InfoSec
Machine learning operations technology and practices can mitigate security issues with machine learning models and applications. Here’s how:
Production Coding Practices - Production coding best practices are critical for all software projects, including machine learning models. Your data scientists are not developers. As the first line of defense, you should pair a data scientist with a production developer when developing production models. Consider providing training for your data science teams on production coding best practices. Having people on your IT team that understand the languages your data science team is using is also a good idea, as is testing your machine learning code. As you move towards production, you should have a set of tests you can run to ensure your machine learning models are performing as you would expect. Having safeguards in place for your production machine learning projects is also important, like having the ability to version control the code and roll back when you encounter issues.
Shadow/Warmup for Model Updates – Models should not be turned on in production without extensive testing under production conditions. Model updates should be deployed in a shadow mode on production environments without providing results to your endpoint. The results and service performance should be logged for analysis. This warmup period allows the operator to see that the model is behaving as expected before replacing the production model with the updates.
Production Endpoints – Production machine learning requires production endpoints that can scale with production needs. This includes running the endpoints on production servers that leverage technologies like Kubernetes and autoscaling to ensure that the services can scale up as load increases.
Data Drift and Anomaly Detection – Your machine learning models are trained on a profile of data. When a request comes in that does not fit the profile of data you trained on, that could indicate an issue. When this change is to the overall pattern of the data, then data drift detection can alert your team to the change. Anomaly detection will alert you when significant outliers appear.
Failover and Fallback – What action should you take when a machine learning-based application starts to misbehave in production? You will need time to debug the issue, and that could involve taking the time to contact the data scientists and getting their input. In the meantime, you need to know that your machine learning endpoint is returning something reasonable. Having a fallback model or just a value that you know could suffice, or you can even trigger the fallback automatically for known conditions like timeouts within your code.
Access Controls and Audit Trails – Controlling access to your production machine learning applications is critical. Only a limited number of trusted people in your organization should be able to put code into production. Even this group should also have checks on their work, including a deployment administrator and full audit trails of their work. Full audit trails on human and machine actions will also allow you to understand what happened as you are troubleshooting production incidents.
MLOps is much more than just the ability to deploy models into production environments. Successful machine learning in your organization requires trust in machine learning outputs. That trust, at least in part, will come from how you design your security architecture and manage the information security of your machine learning projects.