How to Diagnose an Engineering Organization with Data

6 minute read

How to Diagnose an Engineering Organization with Data


A while back I did a consulting engagement for a LATAM payments fintech — an organizational diagnosis of their engineering area. I’m sharing it because I believe the approach is replicable — it doesn’t depend on the company or the team size. If you lead engineering at a fintech or startup, you’ll probably recognize several of these patterns.

Names, data, and details have been changed. What’s valuable here is the approach: how to analyze the data, what to prioritize, and why.

This is the first of three posts. Here’s the diagnosis. Next up: metrics strategy and a 90-day execution plan.

The scenario

A payments orchestration platform for marketplaces in LATAM. Marketplaces integrate an API to handle split payments, seller payouts, escrow, and compliance (tax withholding, anti-money laundering).

The problems:

  • Client deadlines slipping
  • Too much firefighting from incidents
  • Unpredictable cycle times (how long a ticket takes from start to delivery) across squads
  • Stakeholders asking for more visibility
  • CTO needing metrics and a clear execution system

Four squads:

Squad Purpose Size Seniority
Payments Core Payment orchestration + split engine 4 eng 1 Sr, 2 Mid, 1 Jr
Payouts Seller disbursements + reconciliation 2 eng 1 Sr, 1 Mid
Onboarding Seller KYC + marketplace integration 3 eng 2 Sr, 1 Mid
Compliance AML pipeline + tax withholding 2 eng 1 Sr, 1 Mid

Key terms:

  • Flow efficiency: ratio of active work time vs waiting time
  • Cycle time P75: how long 75% of tickets take to complete
  • CFR (Change Failure Rate): percentage of deploys that cause an incident
  • GMV (Gross Merchandise Volume): total transaction volume processed
  • MTTR (Mean Time to Restore): average time to bring a service back up

Diagnosis summary

Squad Status Main problem Action
Payouts Critical 13% flow efficiency, 23% CFR on Payout Service Process + stabilize
Payments Core At risk 27% flow efficiency, 9% CFR on Split Engine Improve flow
Onboarding Stable 44% flow efficiency, <7% CFR Maintain
Compliance Stable 32% flow efficiency, <5% CFR Maintain

Identifying the bottlenecks

Payouts is the main bottleneck.

  • Flow efficiency at 13% — for every hour of code, almost 7 hours waiting on code reviews, dependencies, and blockers.
  • Cycle time P75 of 19.7 days — most work doesn’t fit in a sprint.
  • CFR of 23% on the Seller Payout Service — nearly 1 in 4 deploys breaks something.

Payments Core is the hidden problem.

  • Flow efficiency at 27% and cycle time P75 of 10.2 days — it doesn’t raise alarms because Payouts is worse.
  • But Core moves $9.3M monthly in GMV between the Split Engine and the Payment Gateway API.
  • Any degradation there has a disproportionate impact.

Onboarding works.

  • 44% flow efficiency and controlled cycle times.
  • Doesn’t need immediate intervention.

Compliance is also stable.

  • 32% flow efficiency, low CFR, reasonable cycle times.

Calculating Revenue at Risk

To prioritize with data instead of intuition:

Revenue at Risk = (Deploys/week × CFR) × (GMV/week) × Severity × (MTTR/168)
  • (Deploys/week × CFR) = expected incidents per week
  • (GMV/week) = weekly processed value (monthly GMV / 4)
  • (Severity) = impact weight (High: 1.0, Medium: 0.5, Low: 0.2)
  • (MTTR/168) = average restore time divided by 168 hours in a week

This is a simplified version. In reality there are more variables — recoveries, fallbacks, retries, queues — that can mitigate or amplify the actual impact.

Example: Seller Payout Service

  • Deploys/week: 1.2
  • CFR: 23% → expected incidents = 1.2 × 0.23 = 0.276/week
  • GMV/week: $1.8M / 4 = $450K
  • Severity: High (1.0)
  • MTTR: 6.5 hours → 6.5 / 168 = 0.0387

Revenue at Risk = 0.276 × $450K × 1.0 × 0.0387 ≈ ~$4.8K/week

It’s not an exact number — it’s a heuristic for prioritization. But it shifts the conversation from “the payouts service has bugs” to “Payouts puts ~$19K/month in revenue at risk.”

By service

Service Squad Monthly GMV Severity Revenue at Risk/week
Seller Payout Service Payouts $1.8M High ~$4.8K
Reconciliation Engine Payouts $1.8M High ~$2.3K
Split Engine Payments Core $4.2M Medium ~$1.2K
Payment Gateway API Payments Core $5.1M Medium ~$0.6K
Marketplace Connector Onboarding $1.1M Low ~$0.05K
AML Pipeline Compliance $3.4M Low ~$0.08K
Tax Withholding Service Compliance $0.6M Low ~$0.02K
KYC Flow Onboarding $0.7M Low ~$0.01K

Payouts accumulates ~$7.1K/week in Revenue at Risk — the most critical squad by far. Payments Core adds ~$1.8K/week, but with $9.3M in monthly GMV, any degradation in CFR or MTTR scales fast.

Engineering Leverage

Leverage = Revenue generated / Engineering cost.

Costs are estimated from team size and market-average salaries for each seniority level. Incremental revenue comes from the finance team.

Squad Weekly cost Incremental revenue Leverage
Payments Core $18k $67k 3.7x
Onboarding $14k $38k 2.7x
Compliance $10k $24k 2.4x
Payouts $10k $7k 0.7x

Leverage isn’t the only metric that matters, but it’s the one that aligns engineering with finance the fastest. When the CFO asks “why do we need more headcount?”, this number answers.

Payouts has leverage below 1 — it costs more than it generates. This isn’t the team’s fault. It’s 2 people spending 57% of their time in reactive mode (26% bugs + 31% client requests). It’s a process and staffing problem.

Misaligned Time Allocation

Payouts spends only 29% on roadmap (should be >50%) and 57% on reactive work. There’s no intake process — everything comes in unfiltered and unprioritized.

What to escalate to the CTO (and how)

  1. Payouts with negative leverage (0.7x) — spending more than it generates, ~$7.1K/week in Revenue at Risk
  2. Seller Payout Service unstable — 23% CFR, High severity, blocks seller disbursements
  3. Payments Core is bad but hidden — $9.3M in monthly GMV with 27% flow efficiency
  4. Client commitments at risk — cycle time P75 of 19.7 days in Payouts is incompatible with SLAs (the service level agreements promised to clients)

First 4 weeks

Step 1: Fix Payouts. Intake process for client requests. Protect at least 50% of time for roadmap. Identify root causes of blockers.

Step 2: Stabilize Seller Payout Service. Review architecture and test coverage. Feature flags for fast rollback. Deploy freeze on Fridays.

Step 3: Investigate Payments Core. Map what’s driving the 27% flow efficiency. Slow reviews? Cross-service dependencies? Unclear ownership?

In parallel: Evaluate headcount for Payouts — what profile, context, and justification.

How to replicate this

If you want to run a similar diagnosis:

  1. Map squads → services → GMV. Without this analysis linking engineering to business impact, you can’t measure impact.
  2. Measure flow efficiency and CFR per squad. You don’t need sophisticated tooling — Git history + issue tracker is enough. But this is where an engineering leader who codes shows their real strength: getting into the code, understanding what breaks, how everything works, how long it takes to deploy something to production. It’s important to identify problems, blockers, and flows that actually work.
  3. Calculate Revenue at Risk to prioritize with data. You can start with the simple formula I propose here and refine it over time based on your team, goals, context, and other variables that change along the way.
  4. Present leverage as an investment argument, not as a performance evaluation. This metric works as a guide for where to move: where you can generate more impact or where you need to strengthen the team.

This is a first approach with limited context. Once you’re inside the organization studying the teams, services, and day-to-day up close, you can go much deeper and refine the decisions made. But as a starting point, this kind of diagnosis already gives you direction.

The diagnosis is about connecting engineering data to business decisions. When you can say “this squad has negative leverage and this service puts ~$28K/month at risk,” the conversation with the CTO changes completely.

Next post: how to build the metrics system that sustains this kind of analysis over time.

If you want to dig deeper into the metrics I used as reference, check out the DORA framework.

Comments