How to measure GitHub Copilot's impact on productivity

Charlie Ponsonby

Co-founder & CEO, Plandek

How to measure the impact of GitHub Copilot on engineering productivity

Most software delivery teams are considering adopting AI in some form to help engineers accelerate their value delivery and increase delivery effectiveness.

GitHub Copilot is one of the first examples of AI-powered engineering assistance. It is self-styled as “your AI pair programmer’ and can autocomplete lines of code.

Early adopters report ‘productivity improvements’ of up to 20% using GitHub Copilot, but it is not cheap (at c$10 per user per month), so how can you build a business case for GitHub Copilot?

As such, what methodology (and metrics) should you use to accurately assess the impact on productivity and value delivery of a tool like GitHub Copilot?

GitHub Copilot: first-generation AI assistance for software engineers

Copilot is clearly the start of a long and accelerating journey as AI is applied to many areas of the SDLC. It uses the OpenAI Codex to suggest code and entire functions in real-time from your editor.

Other AI tools arrive almost daily and can help software engineers in a myriad of time-saving and efficiency-enhancing ways. Here are just a few examples:

Grit.io will help manage technical debt
Mintlify provides automated documentation for developers
Code AI helps translate (some) languages, debug, navigate code and act as a pair programmer
Tools like AdrenalineAI use AI to improve understanding of your codebase

So, at a time when cost control is the order of the day, you may be considering how you can accurately quantify the impact of tools like these and justify the added expense.

A methodology for measuring the impact of a tool like GitHub Copilot

To robustly measure the impact of GitHub Copilot (and similar AI engineering-enhancement tools), the methodology must be:

Quantitative – based on hard measurable data
Holistic – considering all benefits and potential impacts across the end-to-end SDLC (software delivery lifecycle)
Balanced – inclusive of subjective survey data alongside software delivery data

This requires a metrics scorecard that fully captures the benefits and potential costs of GitHub Copilot.

The metrics reflect the SPACE framework for measuring developer productivity, with an emphasis on the key areas GitHub Copilot is likely to impact.

These metrics can be tracked over time for a representative group of GitHub users to see the ‘before and after’ effect. We suggest that a representative sample would include engineers of different seniority and activity – and the time period for analysis would be at least three sprint cycles (e.g. 6 weeks+).

Key metrics to quantify the impact of GitHub Copilot

An end-to-end software delivery analytics platform like Plandek provides a single pane of glass to measure the real impact of a tool like GitHub Copilot.

It surfaces a range of engineering and software delivery metrics to capture the impact of GitHub Copilot on five key variables that determine ‘productivity’:

Velocity and throughput – measures of team ‘output’
Time to value – time taken to deliver an increment of software
Quality
Dependability – a key benefit if teams more reliably deliver against their plans.
Developer satisfaction – impact on speeding up repetitive/less interesting tasks

These metrics can be tracked over time for a GitHub Copilot control group versus non-users.

1. Velocity and throughput metrics

Throughput is a core measure of ‘output’ over time for Scrum and Kanban teams – and can be calculated in tickets, story points, pull requests, builds or value points. A tool like Plandek will easily calculate Throughput per engineer for users and non-users of GitHub CoPilot. This can be expressed as a percentage increase.

Sprint Velocity considers the rate of work achieved within a sprint and how it varies over time. It can be calculated in tickets or story points. Advanced analytics tools like Plandek will also show you the amount of work carried over by Sprint to see an even better underlying measure of delivery.

This would be a key metric when considering the impact of GitHub Copilot.

2. Time to Value

Cycle Time is a core agile software delivery metric which tracks an organisation’s ability to deliver software early and often. It calculates the time taken to deliver an increment of software from dev start to deployment. The shorter the Cycle Time, the shorter the feedback loops, hence the quicker the organisation is going to receive new features and respond to customer needs. This is a vital KPI when assessing technology delivery efficiency.

Code Cycle Time typically accounts for 20-30% of overall Cycle Time. It calculates the average time taken from a pull request (PRs) opening until it is merged/closed. The bulk of this time is usually spent during the approval process.

In theory, GitHub CoPilot enables quicker, easier development. Therefore, developers should have greater availability to review each other’s PRs. If code quality is improved, then the outcome of the reviews should result in fewer changes requested and an approval time.

3. Quality

Escaped Defects is a simple but effective measure of overall software delivery quality. It can be tracked in numerous ways, but most involve tracking defects by criticality/priority.

Any analysis of delivery efficiency pre/post the implementation of GitHub Copilot should include consideration of Escaped Defect rates as it would be a poor trade-off to increase velocity and ‘productivity’ at the expense of quality.

Build Failure Rate identifies the percentage of builds which fail and the overall risk this poses to a team working productively. Notable changes to the failure rate after implementing GitHub Copilot is an indicator that code quality may be impacted.

4. Dependability

Sprint Target Completion tracks the percentage of the sprint goals achieved each cycle. ‘Scrum Teams’ and ‘Sprints’ are the basic building blocks of Scrum Agile software delivery. If Scrum Teams consistently deliver their Sprint goals, Agile software delivery becomes relatively dependable, enabling the prediction of delivery outcomes across multiple teams and longer time periods.

Scrum team predictability is, therefore, a critical success criterion in Agile software delivery. If GitHub Copilot can improve the likelihood of a team delivering their tickets faster and with fewer bugs, then this is a major contributor to the overall improvement in effectiveness.

5. Developer Satisfaction

eNPS tracks employee satisfaction and loyalty within teams and organisations. Anecdotal reports suggest that developers find that GitHub Copilot makes the more tedious aspects of coding less taxing and positively impacts wellbeing. An employee NPS makes this straightforward to validate and quantify.

Although an important factor of productivity measurement, it shouldn’t be viewed in isolation from the other metrics when quantifying overall developer productivity.

The above are some examples of relevant metrics to consider when analysing the impact of GitHub Copilot on delivery productivity. The key is to take a balanced set of metrics that holistically considers software delivery a complex process.

Combining the balanced scorecard of metrics to create a business case for GitHub Copilot

Typically, we would combine data from the ‘balanced scorecard’ of metrics discussed above using simple weightings to create an overall Productivity Impact Assessment (PIA) of GitHub Copilot. See the below table:

GitHub CoPilot Productivity Impact Assessment – example template

GitHub CoPilot Productivity Impact Assessment - example template

The weighted average productivity improvement calculated in the PIA can then be applied to the estimated cost of the delivery capability (headcount x fully loaded staff costs). This provides a productivity improvement monetary calculation based on resource costs. It excludes the potentially (larger) benefits of delivering more value to customers earlier, which is not a benefit that is easily or necessarily calculated.

Productivity improvements from using GitHub Copilot – the empirical data

There is a distinct lack of independent data in this regard.

GitHub’s own survey of 2,000 developers showed that 88% of developers claimed ‘to be more productive’ when using the tool, while a task test undertaken by 95 developers saw the group that used GitHub Copilot was 55% faster and had a 7% higher rate of completing the task (see below).

GitHub’s own Survey data – the impact of GitHub Copilot on users (2022)

Survey responses measuring dimensions of developer productivity when using GitHub Copilot

Summary of the experiment process and results

Our own analyses show improvements using a PIA (as shown above) of circa 5%. However, this is bound to improve further as AI technology improves so rapidly.

About Plandek

Plandek is an intelligent analytics and performance platform to help software delivery teams deliver valuable software faster and more predictably.

Plandek enables technology teams to track and drive their improvement and share understandable KPIs with stakeholders interested in accelerating value creation/ improving delivery efficiency.

Plandek works by mining data from delivery teams’ toolsets (such as issue tracking, code repos and CI/CD tools) to provide actionable and intelligent insight across the end-to-end software delivery process.

Plandek is recognised as a top global vendor in the DevOps Value Stream Management space by Gartner and Forrester and is used by private and public organisations globally to optimise their technology delivery and accelerate R&D ROI.

For more information, please visit www.plandek.com