Engineering manager Johnson Cheng on the Engineering Excellence initiative at the BBC
BBC Future Media started talking about Engineering Excellence in 2011. At that time, the BBC had just gone through a re-organisation that put significant emphasis on product management. This was partly a reaction to the proliferation of bespoke websites Future Media is constantly creating, which can end up lacking updates and ongoing support. However, we felt it equally important to put the focus on engineering. Future Media is fundamentally a technology organisation, and in order to compete with the top technology organisations out there, the BBC’s engineering output needs to be of the highest quality possible.
One of the problems we observed was the demarcation between development and operation teams, resulting in a culture where many of the development teams took little interest in how their code performed in the production environment. Another problem was that developers did not feel a shared ownership for how the product worked in the hands of actual users, partly because of the perception that they have no say about engineering quality in a product-management-centred world. (For example, tech debt can be sometimes be deemed low priority by product managers, even if it has significant impact on the overall performance of the application.) To correct both of these problems, we put together a working group of senior engineering leads from across the Future Media team. The purpose of the group was to look at ways of making developers more accountable for engineering quality and more empowered to deliver this high quality.
Initially, the group wanted to follow a 'top-down' approach, by creating metrics and measuring the engineering quality of each team, then making improvement suggestions based on these measurements. However, it quickly became evident that 'engineering quality' is extremely difficult to measure. This is partly because there are no consistent metrics that apply across all teams, but also partly because we didn’t even have a comprehensive list of tech teams, tech leads, and what they were responsible for engineering-wise; we only had a list of products and product managers. Even if we could put together a list, we weren’t convinced that it would really provide an accurate picture of the engineering quality, or enable us to come up with improvement suggestions that would apply to every team.
At about the same time we realised that measuring engineering quality is a real challenge, we also realised that we didn’t want to create yet another box-ticking process that the engineers didn’t really believe in. Therefore, the group felt that the best way forward was to place power in the hands of engineers themselves, rather than dictating to them how they should work. So, we came up with a comprehensive list of tech teams, with each team containing a technical individual (someone able to write and read code) who would function as the lead engineer. From this list we then selected a subset of engineers to be ambassadors, and did some group exercises which resulted in coming up with 'four good things' for engineering excellence:
- Meaningful code reviews
- Developers being accountable for non-functional requirements
- Continuous integration
- Automated acceptance testing
The idea is that these four things are so self-evidently 'good' that we don’t have to spend any time debating them, and can instead dedicate our time to figuring out how to get every engineering team doing them. We made sure that none of the four things are prescriptive, so the engineers are empowered to do them in whatever way works for them.
Meaningful code reviews
This means recording who does each code review and ensuring that only successfully reviewed code is merged to the release branch. The end result should be that a developer can look at every piece of code in a release and find out who wrote it and who reviewed it.
Code reviews are now integrated into the rituals and process of development. People are generally doing code reviews in two ways, depending on the size of the review. If the code review is a massive change then developers will do a pairing code review. If it’s a very concrete change then they will just send out a diff (a file comparison utility).
Code reviewing is a skill that engineers need to learn, and it’s up to more senior developers to help their junior counterparts to make better and more meaningful reviews. Better tooling will also help with code review – for example, GitHub is great for code reviews because it’s not a separate tool so it encourages developers to constantly review and comment on the code, and a culture of looking at code can be built up within the community. We are in the process of getting useful tools like that in the BBC so everyone can benefit from it; currently certain teams are using such tools, but there’s no consistency across teams as yet.
Before the engineering excellence initiative, NFRs were not being discussed enough, and product managers were being unfairly held accountable when there were failures. The new process needed to empower engineers to say when something wasn’t ready. This meant adding NFRs to the ‘definition of done’, so that a product is not considered ready until the team is confident that it can be managed in production by operations and perform well under foreseeable load and circumstances. It also meant making developers accountable for monitoring the ongoing health of the product.
NFRs have to happen right at the beginning of the process. Technical architecture is key, as is coming up with a set of principles as soon as the very first line of code is designed and written. For example, with the iPlayer business layer (IBL) we spent a lot of time defining our NFRs at the start. We considered what is important about the IBL service, how we were going to build in profiling from the very beginning and how we could make it flexible so we would be able to swap different modules into it.
Being from engineering backgrounds, that is naturally where our bias lies. When we started building the IBL we didn’t drive it from the user experience end - instead, we looked at the problems we were trying to solve, the engineering solution, and proceeded to build from that.
Real continuous integration
This includes ‘smoke tests’ that are automatically run on each commit (and at least daily) in order to confirm that a work has not fundamentally broken the build. Continuous integration must take place in a shared environment - not in the sandbox - so that all developers are taking each other’s latest check-ins into account. Real continuous integration happens on the code trunk.
The IBL is showing great uptake of continuous integration, with around 85% test coverage for unit tests. It has little manual testing and everything is driven by automation. It also means less division between developer and tester.
Automated acceptance tests
Our responsive web PAL app is automatically deployed on every check-in and it runs all the unit and acceptance tests in the integration environment. Acceptance tests must include all functional tests. It’s the responsibility of the developer to make sure there is appropriate NFR test coverage to ensure the ‘definition of done’ is met.
At present, the unit test coverage is around 70%, which is really good considering that on the previous code it was more like 20%. Also, iPlayer used to be heavily manually tested, so this new level of automation frees teams from having to sit watching videos and clicking buttons.
We asked each tech team in Future Media to audit themselves against the four principles, which took a long time - some teams knew all about these ideas, some needed an explanation and others had a lot of questions. We put the emphasis on teams being truthful with themselves, encouraged teams to talk to each other, put the summary data of who was doing what on to our public Wiki and continued to monitor and reflect change.
Because the steering group couldn’t check everything and because we wanted developers to feel more involved in the process, we asked the teams to peer-review each other to check what they had self-assessed. That way, we had independent reviews but still from within the tech community.
We organised training courses based around the four good things, with the BBC Academy and arranged lunchtime knowledge-sharing sessions using teams that were already employing these principles.
Our summary of teams started to show improvement as we continued to update our data. We started with only about 30% of the teams reporting that they were doing each of the four good things. At the last check, non-functional requirements and code reviews were being reported at around 70-80%, and continuous integration and automated testing were at around the 50-60% range. The first two were able to grow a lot faster because they weren’t constrained by tooling.
We hope to launch version one of the IBL on the cloud. This should allow us to get feedback much more quickly, release multiple times a day, run non-stop tests and be able to set up test environments quickly and easily. All of this is important for giving engineers as much flexibility as possible.
I’ve been asked about doing a ‘part two’ to Engineering Excellence; however, I believe it makes more sense to stick with the original four good things and continually ask how we can enable the engineers to do them better. I think these principles remain relevant, and what’s really stopping people doing them well is tool constraints. Companies like Google, Amazon and Netflix invest a lot of time in building the tools necessary and making them a pleasure for their developers to use.
If you look at the landscape of all kinds of industry, whether it’s publishing, broadcasting or gaming, the companies that are taking over are the technical ones. The only real differentiator between these and the more traditional companies is the fact that they can look at things through the lens of technical innovation. As an organisation, putting non-technical teams in charge of managing a product does not make sense. Going forward, companies need to operate like Google or Amazon, who allow their technical people to choose what they want to innovate, then see what product offering emerges out of this.
Recruitment plays a huge part too, because it can easily go viral - hiring one good engineer brings in more good engineers and ultimately helps drive your engineering excellence. That is the simple and best legacy of engineering excellence for BBC Future Media - we ended up with good engineers, and with good engineers we can achieve whatever we want.