AI Productivity Metrics in Software Engineering: What Leaders Should Measure

Charlie Ponsonby

Co-founder & CEO

AI increases engineering activity. When AI-enabled, your teams likely generate more code, open more pull requests, and so forth – research from GitHub suggests developers can complete tasks up to 55% faster using AI tools.

But more activity does not necessarily mean better software delivery.

So, the harder question for engineering leaders is: what should we measure in order to track and verify real delivery gains?

In many organisations, lead time to production does not improve at the same rate. Teams can look more productive in isolation, while outcomes at a system level remain unchanged or become less predictable.

That is what makes AI productivity metrics difficult, and why measuring AI in software engineering requires more than tracking usage, output, or time saved.

In this blog:

  • How DORA and flow metrics apply, and where they fall short for AI

  • The metrics that matter for software development productivity

  • Which AI metrics mislead and why they create false confidence

  • How to move from AI adoption to impact using the RACER framework

DORA and Flow Metrics (And Where They Fall Short for AI Developer Productivity)

You’re probably familiar with one or both of DORA metrics and flow metrics. Engineering leaders widely use them to understand  software delivery performance. They look at the system from slightly different angles, but are often used together in practice.

Area

DORA metrics

Flow metrics

Where AI creates a gap

Primary focus

Delivery performance and stability

Movement of work through the system

AI increases activity unevenly across stages

Key measures

Lead time, deployment frequency, change failure rate, MTTR

Cycle time, WIP, throughput, queue time

Metrics may improve locally while system constraints remain

Strength

Strong benchmark for delivery outcomes and operational performance

Strong visibility into software engineering bottlenecks and workflow efficiency

Neither fully explains why software development performance changes under AI

Typical insight

How fast and reliably software is delivered

Where work slows down or accumulates

Harder to see whether AI is improving flow or just increasing input

Limitation in AI context

May not reflect early-stage bottlenecks (e.g. review, testing) until later

May show increased throughput without showing impact on quality or predictability

Neither captures how AI shifts capacity, rework, or delivery stability across the system

These frameworks still matter, and most engineering leaders should keep using them. The issue is not with the metrics. It is with the context.

DORA and flow metrics were built to measure delivery performance in systems that are broadly behaving as expected. 

AI changes that. It tends to accelerate specific parts of the SDLC first, usually code generation, without removing the constraints around review, testing, release, or planning.

That is why you can see movement in the metrics without seeing real improvement in delivery. 

A System-Level View: What Is AI Productivity in Software Engineering?

If you want to understand whether AI is actually improving productivity, you have to look beyond isolated metric movement and judge whether the system itself is getting healthier.

When AI works effectively, it creates more capacity for value delivery, with better flow, without degrading execution.

In practice, that means improving four things at once:

  • Focus: are teams spending more capacity on roadmap work and value delivery?

  • Speed: is work moving through the system more efficiently?

  • Predictability: are teams delivering more consistently?

  • Quality: is throughput increasing without creating more defects and rework?

This is Plandek’s Four Pillars of Productivity.

These exist to give us a broader way to interpret our metrics – including DORA and flow metrics – particularly in AI-enabled SDLCs.

Software engineering productivity is a system outcome. It does not sit inside one tool or one stage of the workflow. It emerges from how work moves from idea to production. If any part of that system cannot absorb increased throughput, the constraint still controls delivery.

This is also why AI does not create uniform gains across teams.

Plandek’s 2026 Engineering Productivity Benchmarks showed lower-performing teams cut Lead Time to Value from 62 days to 33 days and doubled their stories-to-bugs ratio. Higher-performing teams improved too, but much more modestly, with Lead Time to Value moving from 22.5 to 20 days.

The point is not whether a team is “low” or “high” performing. The point is how constrained the system already is. Where review, testing, planning, or release are creating drag, AI can remove a meaningful amount of friction. Where delivery is already efficient, the gains are more incremental.

The Four Pillars of Productivity: AI productivity metrics engineering leaders should track

Once productivity is defined at a system level, there are really two things to understand:

  • whether AI is being used at all

  • whether it is improving delivery

The Four Pillars of Productivity Framework defines what better delivery looks like (focus, speed, predictability, and quality), and groups metrics into four corresponding groups:

Focus – are we working on the right things?
How much engineering capacity is directed toward value delivery, rather than being consumed by support, rework, or maintenance?

  • Value Delivery %

  • Support and Maintenance %

Speed – are we delivering efficiently?
How quickly does work move from idea to production, and how efficiently teams convert effort into delivered output?

  • Lead Time to Value

  • Cycle Time

  • Throughput Quotient

  • Time to Merge PRs

  • PR Efficiency Quotient

  • Merge Frequency per author (per week)

Predictability – how consistently are we delivering?
How reliably can teams plan and execute work, without excessive volatility or disruption?

  • Sprint Capacity Accuracy

  • Sprint Target Completion

  • Mid-Sprint Scope Change

  • Velocity Volatility

Quality – are we delivering sustainably?
Is increased throughput creating lasting value, or generating defects and rework that consume future capacity?

  • Bug Resolution Time

  • Stories Delivered : Bugs Raised ratio

  • Bugs Resolved : Bugs Raised ratio

Use these metrics for a system-level view of engineering productivity, to show not just how much work is being done, but how effectively that work turns into delivered value.

AI metrics that often mislead engineering leaders

The easiest AI metrics to collect are often the least useful.

Most of them sit too close to the point of code generation. They tell you that AI is being used, or that activity is increasing. They do not tell you whether the system is delivering better outcomes.

That is why these metrics need to be treated carefully:

  • AI-generated code percentage – shows usage, not whether that code improved delivery

  • PR volume and commit volume – can rise while review slows, quality drops, or lead time stays flat

  • Self-reported time saved – useful as a supporting signal, but too subjective to treat as proof of productivity

  • Tool usage alone – adoption matters, but adoption is not impact

Even established frameworks can be misread in the same way. DORA metrics, for example, may show stable deployment performance while earlier stages such as review or testing are quietly becoming more constrained under higher AI-driven throughput.

The issue is rarely the metric itself. It is whether you are asking it to answer a question it cannot answer.

If the question is whether AI is improving engineering productivity, metrics that only capture activity, usage, or local speed will not get you there.

How to move from AI adoption to real impact: the RACER framework

Metrics tell you what is happening in your system. They do not tell you where you are in the journey from AI adoption to real impact – measurable improvements in delivery.

This is where a simple adoption and progression model becomes useful.

We use the RACER framework to structure that journey:

  • Rollout – are AI tools being adopted across teams and workflows?

  • Approach – are they being applied to the right parts of the SDLC?

  • Constraints – where is increased throughput meeting system bottlenecks?

  • Engineering Impact – is AI improving focus, speed, predictability, and quality?

  • Results – are those improvements translating into increased value delivery?

In practice, most teams move quickly through rollout, but slow down when impact depends on the surrounding system.

RACER helps make that progression visible. It connects AI adoption to how work behaves across the SDLC, and whether that ultimately leads to better delivery outcomes.

Used alongside the Four Pillars, it provides a practical way to understand not just whether AI is being used, but whether it is improving how the system performs.

Struggling with AI in your SDLC? Understand the RACER Framework

Where Plandek fits: connecting AI activity to real delivery outcomes

Plandek’s best-in-class software brings together data from across your SDLC – planning, development, review, testing, and release – so you can see how work actually flows, where it slows down, and how AI is affecting that flow.

Plandek gives you that visibility you need to understand where and how AI impacts your SDLC, and ultimately, value delivery.

In practice, that means you can:

  • see whether faster coding is improving Lead Time to Value, or just increasing review pressure

  • understand how AI is affecting predictability, not just output

  • track whether quality holds as throughput increases

  • measure whether more capacity is actually shifting toward value delivery

👉 See how Plandek helps engineering leaders measure, manage, and scale AI impact across the SDLC

Key takeaways

  • AI increases activity, not necessarily delivery – more code and PRs don’t guarantee faster or better outcomes

  • Productivity is a system outcome – it depends on how work flows from idea to production, not just coding speed

  • DORA and flow metrics still matter – but they don’t fully explain AI’s impact without broader context

  • The Four Pillars define real productivity – focus, speed, predictability, and quality must improve together

  • Most AI metrics are misleading in isolation – usage and output metrics rarely reflect delivery performance

  • AI impact depends on system constraints – without addressing bottlenecks, increased activity won’t translate into value

FAQs

What are AI productivity metrics in software engineering?

AI productivity metrics measure whether AI tools improve software delivery, focusing on system outcomes like speed, quality, predictability, and value delivery, not just activity.

How do you measure AI productivity in engineering teams?

Measure AI productivity by tracking changes across the SDLC, including lead time, quality, predictability, and capacity spent on value delivery, alongside adoption.

Why doesn’t higher AI usage always improve delivery?

Because AI often speeds up coding without fixing bottlenecks in review, testing, or release, leading to more work entering the system without improving output.

Are DORA metrics enough to measure AI impact?

DORA metrics are essential but incomplete for AI. They show delivery performance but don’t fully explain how AI affects flow, constraints, or system behaviour.

What are the most important metrics for AI in software engineering?

The most important metrics align to focus, speed, predictability, and quality – such as Lead Time to Value, Cycle Time, defect ratios, and Value Delivery %.

How does AI affect software engineering bottlenecks?

AI typically increases throughput at the coding stage first, which can expose or worsen bottlenecks in review, testing, and release if those constraints are not addressed

Written by

Charlie Ponsonby

Co-founder & CEO

Charlie Ponsonby is CEO and Co-founder of Plandek, the leading Developer Productivity Insight (DPI) platform that helps software engineering teams drive productivity and transition to AI-led engineering. He writes widely on the opportunities and challenges inherent in the transition to the agentic SDLC. Prior to founding Plandek, Charlie founded Simplydigital, which grew to become the UK's largest broadband and digital services comparison business before being acquired by Europe's largest consumer electronics retailer. He started his career at Accenture and has held senior leadership roles in retail and telco. Charlie holds a degree from the University of Cambridge.

See how your engineering efforts translate into measurable business impact

Measure delivery performance, AI impact, and engineering productivity with hundreds of metrics, OOTB dashboards and custom configurations.