My name is Rich and I am an automation addict – there, I’ve said it! For the last decade I have been evangelising and putting automation into action to help provide repeatable and reproducible (more on these two important words later) systems to deliver code and monitor systems.
If you looked at the projects I have delivered you would suggest my main drivers have been to provide low-risk systems that can operate at scale – facilitating both the Devs’ appetite for pace, and the Ops’ appetite for stability. At the outset this is true. However, in reality I have been working on making people happier – saving them from the day to day drudgery and freeing up their time to focus on the interesting and challenging work. The recipe is fairly simple. Find a process (one that gets repeated a lot) and codify it where you can – you know the story…
I’ve had a lot of success over the years putting these sorts of systems in place but there has always been an area that has eluded me. My automation-nemesis. Security and Governance process.
This changed in March 2014 after I joined Betfair. I was new to the business and doing my rounds trying to understand the key drivers and blockers for automating their software delivery in a standard way. For context – Betfair run a microservices estate with some 300 components and an Engineering team in the hundreds split across several core products. Naturally part of my investigation crossed paths with the Security team. My expectations could not have been more wrong.
Betfair’s security team wanted automation. They wanted it as much as the Devs and for all the right reasons:
- They wanted to run security testing the same way every time – repeatable. A pass or fail should be repeatable and for tangible reasons. There should be nothing that was a subjective pass one day and a fail the next nor should a test execute differently one run to the next.
- Regardless of who was running the testing they wanted a consistent result – reproducible. This would allow them to share testing frameworks across teams and compare results across teams – apples for apples.
- They wanted to provide instant feedback to the delivery teams, the security teams and management on the status of each test. No waiting for reports to run or aggregate, no parallel slow running manual effort. Just an instant pass/warn/fail.
To this end the team had been working on a tool called the ‘Application Security Risk Calculator’ to replace a form based manual risk assessment. This clever idea had the concept of inherent and dynamic risk at its core. A certain component would carry some inherent risk that would never change. An example might be that it caters for a certain regulatory component, or that it handles personally-identifiable information. A certain delivery of that component would carry some dynamic risk such as the number of lines of code changed, the time of day or the unit test coverage. Together these two concepts gave an overall risk score to the deployment. What became even more powerful was that by using a consistent test framework they could set a standard across the business for delivery risk. This tool would gather the information and report it into our CMDB where it could be stored and reported on.
For me this was great news – but it needed to go further – it needed to be embedded as a component within the delivery pipelines themselves. This would offer instant feedback but also allow a level of interaction with the delivery pipeline itself. I pitched my thoughts to the team, selling them on the following benefits:
- If the test fails – then we can let the team know through their delivery tools. And let them know asap – much in the same way you want your CI testing to be fast so the committing team does not context switch. Let them know it failed, for what reason and where to look to fix it.
- Control without bureaucracy. By embedding this sort of process into the delivery pipelines themselves then you can block a deployment if it fails the delivery risk assessment whilst letting all other changes sail through unimpeded. It gives the level of assurance and control needed whilst still allowing a risk based decision around when and how we do manual assessments (within or outside of the pipeline). It gets better though – you can warn if a risk rating is getting near without blocking the deployment. A delivery manager cannot only be informed that they need to take medium-term action but can also look back and see the change in score over previous releases.
- Instead of trying to link change tickets to deliveries to manual risk ratings you can simple export the database showing the risk scores for each and every deployment across the business.
There was one final advantage which, to me, made this approach ground-breaking. With the benefit of this sort of tooling – leading to a standard risk framework and a standard control feature we could go one step further. We could introduce a risk-volume dial. On days where we wish to limit our delivery risk (large sporting events, periods of high change elsewhere in the business, Christmas) we could turn the dial to a low setting – thus reducing the allowable risk for each delivery pipeline. This would stop some of the high inherent risk pipelines regardless of dynamic risk and give less wiggle room for others – exactly what the business does via a heightened change awareness period. When we were in a more normal time we could twist the dial back up to 11 and let changes flow through the system faster than ever – providing a level of assurance without impacting the pace of the delivery teams.
And that is exactly what Betfair have done with the vast majority of our components now using the AppSecRiskCalc in their pipelines. The Dev teams no longer have to fill in forms and the security team no longer need to repeat manual testing and deal with subjective data – pace and stability co-exist. And, as important as ever, we have manufactured happiness as a by-product of automation with our talented teams now able to focus their time on more challenging activities. Betfair is not alone however. The benefits of this approach are starting to be appreciated elsewhere. Some of the DevOps household names – such as Chef – are working towards ways to integrate similar ‘compliance at speed’ features into their tooling allowing almost everyone to benefit from automating their main security and compliance testing with all of the benefits mentioned above.
Richard Haigh is the Global Head of Reliability and Operations at Betfair.