How to Diagnose an Engineering Organization with Data
How to Diagnose an Engineering Organization with Data
A while back I did a consulting engagement for a LATAM payments fintech â an organizational diagnosis of their engineering area. Iâm sharing it because I believe the approach is replicable â it doesnât depend on the company or the team size. If you lead engineering at a fintech or startup, youâll probably recognize several of these patterns.
Names, data, and details have been changed. Whatâs valuable here is the approach: how to analyze the data, what to prioritize, and why.
This is the first of three posts. Hereâs the diagnosis. Next up: metrics strategy and a 90-day execution plan.
The scenario
A payments orchestration platform for marketplaces in LATAM. Marketplaces integrate an API to handle split payments, seller payouts, escrow, and compliance (tax withholding, anti-money laundering).
The problems:
- Client deadlines slipping
- Too much firefighting from incidents
- Unpredictable cycle times (how long a ticket takes from start to delivery) across squads
- Stakeholders asking for more visibility
- CTO needing metrics and a clear execution system
Four squads:
| Squad | Purpose | Size | Seniority |
|---|---|---|---|
| Payments Core | Payment orchestration + split engine | 4 eng | 1 Sr, 2 Mid, 1 Jr |
| Payouts | Seller disbursements + reconciliation | 2 eng | 1 Sr, 1 Mid |
| Onboarding | Seller KYC + marketplace integration | 3 eng | 2 Sr, 1 Mid |
| Compliance | AML pipeline + tax withholding | 2 eng | 1 Sr, 1 Mid |
Key terms:
- Flow efficiency: ratio of active work time vs waiting time
- Cycle time P75: how long 75% of tickets take to complete
- CFR (Change Failure Rate): percentage of deploys that cause an incident
- GMV (Gross Merchandise Volume): total transaction volume processed
- MTTR (Mean Time to Restore): average time to bring a service back up
Diagnosis summary
| Squad | Status | Main problem | Action |
|---|---|---|---|
| Payouts | Critical | 13% flow efficiency, 23% CFR on Payout Service | Process + stabilize |
| Payments Core | At risk | 27% flow efficiency, 9% CFR on Split Engine | Improve flow |
| Onboarding | Stable | 44% flow efficiency, <7% CFR | Maintain |
| Compliance | Stable | 32% flow efficiency, <5% CFR | Maintain |
Identifying the bottlenecks
Payouts is the main bottleneck.
- Flow efficiency at 13% â for every hour of code, almost 7 hours waiting on code reviews, dependencies, and blockers.
- Cycle time P75 of 19.7 days â most work doesnât fit in a sprint.
- CFR of 23% on the Seller Payout Service â nearly 1 in 4 deploys breaks something.
Payments Core is the hidden problem.
- Flow efficiency at 27% and cycle time P75 of 10.2 days â it doesnât raise alarms because Payouts is worse.
- But Core moves $9.3M monthly in GMV between the Split Engine and the Payment Gateway API.
- Any degradation there has a disproportionate impact.
Onboarding works.
- 44% flow efficiency and controlled cycle times.
- Doesnât need immediate intervention.
Compliance is also stable.
- 32% flow efficiency, low CFR, reasonable cycle times.
Calculating Revenue at Risk
To prioritize with data instead of intuition:
Revenue at Risk = (Deploys/week Ă CFR) Ă (GMV/week) Ă Severity Ă (MTTR/168)
- (Deploys/week Ă CFR) = expected incidents per week
- (GMV/week) = weekly processed value (monthly GMV / 4)
- (Severity) = impact weight (High: 1.0, Medium: 0.5, Low: 0.2)
- (MTTR/168) = average restore time divided by 168 hours in a week
This is a simplified version. In reality there are more variables â recoveries, fallbacks, retries, queues â that can mitigate or amplify the actual impact.
Example: Seller Payout Service
- Deploys/week: 1.2
- CFR: 23% â expected incidents = 1.2 Ă 0.23 = 0.276/week
- GMV/week: $1.8M / 4 = $450K
- Severity: High (1.0)
- MTTR: 6.5 hours â 6.5 / 168 = 0.0387
Revenue at Risk = 0.276 Ă $450K Ă 1.0 Ă 0.0387 â ~$4.8K/week
Itâs not an exact number â itâs a heuristic for prioritization. But it shifts the conversation from âthe payouts service has bugsâ to âPayouts puts ~$19K/month in revenue at risk.â
By service
| Service | Squad | Monthly GMV | Severity | Revenue at Risk/week |
|---|---|---|---|---|
| Seller Payout Service | Payouts | $1.8M | High | ~$4.8K |
| Reconciliation Engine | Payouts | $1.8M | High | ~$2.3K |
| Split Engine | Payments Core | $4.2M | Medium | ~$1.2K |
| Payment Gateway API | Payments Core | $5.1M | Medium | ~$0.6K |
| Marketplace Connector | Onboarding | $1.1M | Low | ~$0.05K |
| AML Pipeline | Compliance | $3.4M | Low | ~$0.08K |
| Tax Withholding Service | Compliance | $0.6M | Low | ~$0.02K |
| KYC Flow | Onboarding | $0.7M | Low | ~$0.01K |
Payouts accumulates ~$7.1K/week in Revenue at Risk â the most critical squad by far. Payments Core adds ~$1.8K/week, but with $9.3M in monthly GMV, any degradation in CFR or MTTR scales fast.
Engineering Leverage
Leverage = Revenue generated / Engineering cost.
Costs are estimated from team size and market-average salaries for each seniority level. Incremental revenue comes from the finance team.
| Squad | Weekly cost | Incremental revenue | Leverage |
|---|---|---|---|
| Payments Core | $18k | $67k | 3.7x |
| Onboarding | $14k | $38k | 2.7x |
| Compliance | $10k | $24k | 2.4x |
| Payouts | $10k | $7k | 0.7x |
Leverage isnât the only metric that matters, but itâs the one that aligns engineering with finance the fastest. When the CFO asks âwhy do we need more headcount?â, this number answers.
Payouts has leverage below 1 â it costs more than it generates. This isnât the teamâs fault. Itâs 2 people spending 57% of their time in reactive mode (26% bugs + 31% client requests). Itâs a process and staffing problem.
Misaligned Time Allocation
Payouts spends only 29% on roadmap (should be >50%) and 57% on reactive work. Thereâs no intake process â everything comes in unfiltered and unprioritized.
What to escalate to the CTO (and how)
- Payouts with negative leverage (0.7x) â spending more than it generates, ~$7.1K/week in Revenue at Risk
- Seller Payout Service unstable â 23% CFR, High severity, blocks seller disbursements
- Payments Core is bad but hidden â $9.3M in monthly GMV with 27% flow efficiency
- Client commitments at risk â cycle time P75 of 19.7 days in Payouts is incompatible with SLAs (the service level agreements promised to clients)
First 4 weeks
Step 1: Fix Payouts. Intake process for client requests. Protect at least 50% of time for roadmap. Identify root causes of blockers.
Step 2: Stabilize Seller Payout Service. Review architecture and test coverage. Feature flags for fast rollback. Deploy freeze on Fridays.
Step 3: Investigate Payments Core. Map whatâs driving the 27% flow efficiency. Slow reviews? Cross-service dependencies? Unclear ownership?
In parallel: Evaluate headcount for Payouts â what profile, context, and justification.
How to replicate this
If you want to run a similar diagnosis:
- Map squads â services â GMV. Without this analysis linking engineering to business impact, you canât measure impact.
- Measure flow efficiency and CFR per squad. You donât need sophisticated tooling â Git history + issue tracker is enough. But this is where an engineering leader who codes shows their real strength: getting into the code, understanding what breaks, how everything works, how long it takes to deploy something to production. Itâs important to identify problems, blockers, and flows that actually work.
- Calculate Revenue at Risk to prioritize with data. You can start with the simple formula I propose here and refine it over time based on your team, goals, context, and other variables that change along the way.
- Present leverage as an investment argument, not as a performance evaluation. This metric works as a guide for where to move: where you can generate more impact or where you need to strengthen the team.
This is a first approach with limited context. Once youâre inside the organization studying the teams, services, and day-to-day up close, you can go much deeper and refine the decisions made. But as a starting point, this kind of diagnosis already gives you direction.
The diagnosis is about connecting engineering data to business decisions. When you can say âthis squad has negative leverage and this service puts ~$28K/month at risk,â the conversation with the CTO changes completely.
Next post: how to build the metrics system that sustains this kind of analysis over time.
If you want to dig deeper into the metrics I used as reference, check out the DORA framework.
Comments