The engineering team was shipping good software. But each deployment cycle involved a CI pipeline that took between 45 and 70 minutes to complete. Add the time for human PR review — typically two to four hours in elapsed time, even when the actual review took 20 minutes — and the cycle from merged PR to deployed code was often half a day.
The goal was to cut that in half. The actual result was better: a 60% reduction in pipeline time and a meaningful improvement in review quality.
Here’s what we did, what we tried first that didn’t work, and what the actual change was.
What we tried first (and why it didn’t work)
The initial hypothesis was that the pipeline itself was the bottleneck. Forty-five minutes of CI for a moderately-sized codebase is long; there must be inefficiencies in the test execution, the build steps, or the parallelisation.
We used AI to analyse the pipeline configuration and recommend optimisations. It identified several valid improvements — redundant dependency installations, sequential test steps that could run in parallel, cache invalidation that was too aggressive.
Implementing these changes reduced pipeline time by around 15%. Not bad, but not the improvement we were looking for. And it revealed something more important: the pipeline wasn’t the primary bottleneck.
The bottleneck was the elapsed time between PR creation and CI pass — the human review step, and the back-and-forth that followed when issues were caught late.
What actually moved the needle
The bigger opportunity was earlier in the workflow. If issues were caught before the CI run — during the PR review step — fewer CI runs were needed, and each run was more likely to pass on the first attempt.
AI-assisted PR review was the first change. Every PR triggered an automated review pass before a human reviewer was assigned. The AI agent checked the diff against the codebase’s conventions (encoded in the AGENTS.md and established coding standards), flagged potential issues, and annotated the PR with specific concerns.
Human reviewers then reviewed the AI’s annotations alongside the diff. In practice, this meant human reviewers were faster — they could focus on architecture and design questions and trust that stylistic and convention issues had been flagged. Average human review time dropped from 20 minutes to 12 minutes. More importantly, the rate of “approved but then failed CI” PRs dropped significantly, which eliminated a whole class of wasted pipeline runs.
Test dependency analysis was the second change. A significant portion of pipeline time came from running the full test suite sequentially. The codebase had natural groupings of tests that had no shared state, but identifying them accurately was manual and error-prone.
An AI-generated dependency graph of the test suite — produced by analysing import relationships and shared fixtures — identified groups of tests that could run concurrently without conflicts. The resulting test partitioning reduced the test execution phase from 28 minutes to 11 minutes.
Combined, the two changes took average pipeline time from 52 minutes to 21 minutes: a 60% reduction.
What the data showed about review quality
One unexpected finding: PR defect escape rate — issues merged and discovered post-deployment — dropped by roughly a third over the same period.
The AI review didn’t just speed up the process. It caught a consistent category of issues that human reviewers were systematically missing: convention deviations that were technically correct but inconsistent with the codebase’s established patterns, and edge cases that the implementation hadn’t covered.
Senior engineers, freed from reviewing boilerplate correctness, spent more review attention on architecture and design. The quality of that review improved.
What the hard part actually was
The technical implementation was not the hard part. The hard part was defining what ‘good’ PR review looked like in enough detail for an AI to apply it consistently.
The AGENTS.md file for this codebase went through seven revisions before the AI review was producing output that the team trusted. Each revision came from a specific case where the AI missed something important or flagged something incorrectly. Over two months, the governance file became genuinely precise.
That work — encoding engineering standards explicitly rather than implicitly — turned out to be valuable beyond the AI application. The process of writing it down forced the team to resolve disagreements about conventions they hadn’t realised they had.
If your CI/CD pipeline is a consistent source of friction and you want a structured approach to reducing cycle time, let’s talk.
