DORA metrics

What are the DORA metrics and how do they inform DevOps success?

DORA metrics are four key measurements that help team leaders understand the effectiveness of their DevOps work practices. The DevOps Research and Assessment (DORA) group created the metrics after six years of research into successful DevOps adoption. The best way to measure the impact DevOps is having on your organization is to measure data. This port will focus on the guiding principles identified by DORA, as we explain how the four metrics teach DevOps success.

Frequency of deployment

Deployment frequency measures how often you push new code into your production environment. Since the overriding objective of DevOps is to deliver working code more efficiently, deployment frequency is a great starting point when you’re evaluating success.

You can collect this data by analyzing how many times new code is deployed over a period of time. Then you can look for opportunities to increase your release rate without sacrificing any guard rails that maintain quality standards. Using continuous delivery to deploy code automatically as you assemble is one way you can speed up your workflow.

The ideal deployment frequency depends on the type of system you are building. While it’s now common for web apps to deliver multiple times a day, this cadence isn’t ideal for game developers creating multi-gigabyte builds.

In some situations, it may be helpful to recognize this distinction by thinking about deployment frequency a little differently. You can communicate with it the frequency with which you can deploy code if you want to cut a new release at a certain time. This can be a more effective way to gauge throughput if true continuous delivery is not feasible for your project.

Change Lead Time

A change lead time is the interval between when a code revision is committed and when that commit enters the production environment. This metric reveals the delays that occur during code reviews and iterations after developers complete the main sprint.

Measuring this value is straightforward. You need to find the time the developer signed off on a change, then the time the code was distributed to users. The lead time is the number of hours and minutes between two values.

For example, consider a simple change to send a security alert email after users log in The developer completes the work at 11 am and commits their work to the source repository. At 12:00 PM, a reviewer reads the code and passes it to QA. By 2pm, the QA team’s tester noticed a typo in the copy of the email. The developer commits a fix at 3pm and QA merges final changes into production at 4pm. The lead time for this change was 5 hours.

Lead time

It’s used to uncover inefficiencies between work items. While values vary widely by industry and organization, a high average lead time can be indicative of internal friction and a poorly considered workflow. Extended lead times can also be caused by poorly performing developers producing poor-quality work as the first iteration of a task.

Some organizations use different measurements for lead time. Many choose the time a developer spends between starting a feature and entering the final code into production. Others may look further back and use the time at which a change was requested – by a customer, client, or product manager – as a starting point. These methods can generate information that is more broadly useful within the business and outside engineering teams.

DORA’s interpretation of using commit timestamps has one big advantage: although the data is automatically captured by your source control tool, developers don’t have to manually record the start time for each new commit.

Change Failure Rate

The change failure rate is the percentage of deployments in production that matter an incident. An incident is any bug or unexpected behavior that causes outages or disruptions to customers. Developers and operators need to spend time troubleshooting.

You can calculate your change failure rate by dividing that number by the number of deployments you’ve made that caused errors. The latter value is usually achieved by labeling bug reports in your project management software with the deployment they were introduced to.

Accurately attributing events to what caused them can sometimes be tricky, especially if you have a high deployment frequency, but in many cases, it’s possible for developers and triage teams to determine the most likely triggers.

Another challenge is what failures to agree on: do minor bugs increase your failure rate, or are you only interested in major outages? Both types of problems affect how customers perceive your service so it can be useful to maintain different values for this metric, each looking at a different class of problems.

You should always aim to drive change failure rates as low as possible. Using automated testing, static analysis and continuous integration can help prevent broken code from ever making it into production.

Time to Restore Service

Unfortunately, failure cannot be completely eliminated. Eventually, you’re going to run into a problem that hurts your customers. The fourth DORA metric, Time to Restore Service, analyzes how effectively you can react to these events. Similarly to changing lead times, the measured duration may vary between organizations. Some teams will use the time at which the bug was deployed, others will be from the first customer report, and some may take the time at which the incident response team was paged. Whatever trigger point you adopt, you should use it consistently and measure it until the incident is identified as a solution.

A higher average recovery time is a signal that your incident response processes require fine-tuning. An effective response depends on having the right people available to identify the fault, create a patch, and contact affected customers. You can reduce recovery time by developing agreed response procedures, keeping critical information centrally accessible across your organization, and turning on automated monitoring to alert you as problems occur.

Optimizing this metric is often neglected because many teams assume that a major outage will never happen. If your service is generally stable you may have relatively few data points to work with. Running incident response rehearsals using techniques like chaos testing can provide more meaningful data representing your current recovery time.

Conclusion

The four DORA metrics provide DevOps team leaders with data that uncover improvement opportunities. Regularly measuring and analyzing your deployment frequency, changing lead times, changing failure rates and service recovery helps you understand your performance and make informed decisions about how to improve it. DORA metrics can be calculated manually using information from your project management system. Google Cloud also has four key-like tools that will automatically generate them from commit information. Some ecosystem tools like GitLab are also starting to include integration support.

The best DevOps implementations will facilitate rapid changes and regular deployments that rarely introduce new defects. Any regressions that occur will be dealt with immediately, minimizing downtime so customers can get the best experience of your service. Tracking DORA trends over time allows you to check whether you’re achieving these ideals, giving you the best chance for DevOps success.

Scroll to Top