I’ve spent the last fifteen years drawing on manufacturing principles to solve software problems. The inspiration started early: working with clients in industrial sectors where defect rates were measured in parts per million and process rigour was the culture, not the exception.
One framework that has served me consistently is the Ishikawa diagram — and I rarely see software engineering teams using it.
Where it comes from
Kaoru Ishikawa was a Japanese quality engineer who developed the fishbone diagram in the 1960s as part of the Toyota Production System. The diagram gets its name from its shape: a central spine (the problem) with branching causes spreading out to the sides, like the skeleton of a fish.
The framework organises potential causes into six categories, known as the 6Ms: Man, Machine, Method, Material, Measurement, and Mother Nature. The discipline of the framework is that it forces you to consider all six categories before settling on a cause — preventing the natural human tendency to assume the first plausible explanation is the right one.
Manufacturing used this to reduce defect rates on assembly lines. The underlying problem it solves is universal: when something goes wrong in a complex system, identifying the true root cause is much harder than it looks.
Software has the same problem.
The 6Ms adapted for software
The categories translate cleanly. You don’t need to force-fit manufacturing terminology — the logic holds regardless of the labels.
Code / API Logic — bugs, incorrect business logic, regression from a recent change, integration contract violations between services. This is where engineers tend to look first and most narrowly. It’s the obvious category.
Infrastructure / Server — resource constraints, misconfigured environments, deployment failures, container orchestration issues, dependency version mismatches. The class of problems that reproduces in production but not locally.
People / Process — the human factors. A change deployed without review, a runbook that wasn’t followed, a rollback decision made too late, a miscommunication between teams about a shared dependency. Uncomfortable to put in a diagram, but consistently present in post-mortems.
Data / Database — schema issues, data quality problems, unexpected null values, query performance at scale, migration failures. The category that bites hardest in production because it’s often the hardest to reproduce in staging.
Observability / Measurement — this one is often missing entirely. If you can’t measure it, you may think the system is healthy when it isn’t. Gaps in monitoring, incorrect alerting thresholds, logs that don’t capture the right context — these cause incidents of their own and obscure the root cause of others.
External Dependencies / Network — third-party APIs, downstream services, CDN configuration, DNS resolution, certificate expiry. Problems you didn’t cause but still own the impact of.
The 5 Whys in practice
The Ishikawa diagram tells you where to look. The 5 Whys tells you how deep to go.
Take a real scenario: your API starts returning 500 errors for a subset of users after a Friday evening deployment. (Yes, this is also one of the reasons you do not ship to production on a Friday — but that’s a separate post.)
Why? — The order creation endpoint is throwing a null pointer exception.
Why? — A field the endpoint depends on is returning null from the database query.
Why? — The query was updated in the deployment to use a JOIN that assumes a related record exists, but some older user accounts don’t have that record.
Why? — The related table was added six months ago and only backfilled for accounts created after that date. The backfill script excluded accounts flagged as inactive, some of which were subsequently reactivated.
Why? — The reactivation flow was built by a different team and didn’t update the flag the backfill script used to filter accounts.
The surface cause was a null pointer exception. The root cause was a cross-team data assumption that was never explicitly documented or tested. The fix isn’t just updating the query — it’s adding a contract test between the two services and auditing what other backfills may have made the same assumption.
Without the 5 Whys, you fix the query. The problem recurs in a different form six months later.
Why this matters more in the AI era
The speed of software delivery has increased significantly with AI-assisted development. MVPs that used to take months take weeks. Features that used to take weeks take days.
The speed of failure has increased proportionally.
When you’re shipping faster, the feedback loop from decision to consequence is shorter. That’s mostly good — you learn faster, you iterate faster. But it also means that systemic weaknesses surface faster. A poorly understood data model, an untested integration assumption, a monitoring gap: these used to cause an incident every six months. Now they might cause one every six weeks.
Rigorous root-cause methodology is not a slowdown. It’s the mechanism that lets you sustain high velocity without accumulating a hidden backlog of systemic failures.
How to start using this
You don’t need to formalise a process on day one. Start with the next incident that warrants a post-mortem.
Before the team digs into the fix, draw the fishbone on a whiteboard (or a shared doc). Label the six branches. Spend ten minutes asking: what could have caused this from each category? Then apply the 5 Whys to the two or three most plausible branches.
You’ll find the root cause faster. And more importantly, you’ll find the right root cause — the one that, when fixed, prevents a category of future incidents rather than just this specific one.
Manufacturing learned this decades ago. It’s time software caught up.
If your engineering team is dealing with recurring incidents or wants to build a stronger post-mortem culture, let’s talk.
