Originally written on March 2018
A few years into building large-scale systems, I realised our biggest breakthroughs didn’t come from a clever framework or a shiny new cloud service. They came when the technology and the team culture clicked—when our platform made the right things easy, and our people had the space (and the habits) to do them well.
You can assemble the usual ingredients—microservices, APIs, containers, big data tooling, IaC, CI/CD, DevOps, agile—and still end up with a brittle system and a tired team. The difference is almost always balance: choosing the right level of standardisation, the right team structure, and the right boundaries between “build” and “run.” What follows is the shape that balance has taken for me in practice.
When the platform is right, it mostly disappears. Engineers don’t spend their mornings wrestling with bespoke scripts or debating which base image to use. They write a service, describe how it should run, and the platform handles the rest: building, deploying, observing, and scaling it in a consistent way.
For us, that meant containers and a standard runtime. Docker/Kubernetes wasn’t the point; standardisation was. Services spoke to each other through stable, versioned APIs; logs and metrics followed shared conventions; dashboards had a familiar layout; and deployments moved through environments in predictable steps. Moving a service to a different cluster—or even a different cloud—became a conversation about capacity and configuration, not a rewrite.
The key to keeping that platform consistent was Infrastructure as Code. We treated the platform like a product. Terraform modules, Helm charts, IAM policies, dashboards, and runbooks all lived in version control and travelled through the same pull-request gates. If we changed a node pool or added a WAF rule, it was reviewed, tested, and tagged—not executed as a one-off shell session on a Friday afternoon.
Another lesson: coupling data tightly to services is a great way to slow everyone down. We pushed hard on decoupling: events carried well-defined schemas; storage contracts were explicit; services owned their data but didn’t pretend to be the only consumers of it. “Big data” wasn’t a trophy term—it just meant picking technologies that handled volume and variety while encouraging loose coupling. Once the event streams and schemas stabilised, new use cases stopped requiring permission slips from five different teams.
Automation is only useful if people trust it. Our pipelines started simple—build, test, package—and grew to include contract tests, integration checks, security scans, and end-to-end smoke tests that mimicked real user paths. Promotion between environments was automatic but gated: a service could only advance if it met the quality bar and didn’t violate platform policies. Deployments became boring, which is exactly what you want.
We also automated rollbacks and made them humane. If a release tripped a canary or SLO, the system reverted quickly and told us why in human language, not just exit codes. That single change de-escalated releases from high-wire acts to daily routines.
“DevOps” as a job title suggests a unicorn who writes domain logic in the morning, tunes BGP in the afternoon, and loves on-call at night. Those folks exist, but they’re rare—and they burn out. The team shape that worked for us acknowledged different strengths without building walls.
Service Developers/Architects lived closest to the product. They shaped APIs, wrote the business logic, instrumented meaningful metrics, and defined alarms that reflected what users actually feel. In the pipeline, they leaned toward CI—tests, quality gates, sign-off.
Platform Developers/Architects built the runway: cluster configs, deployment mechanics, security baselines, observability tooling, golden paths for new services. They leaned toward CD—how software gets to production reliably and repeatably.
Service Support lived where the product meets the user. They knew the common failure modes: the 404 that only appears after a specific feature flag, the data drift that shows up at month end, the pagination bug that hurts one particularly large customer.
Platform Support guarded the shared foundation: nodes, networks, storage, IAM, costs, upgrades. They handled incidents that cut across services—noisy neighbours, failing disks, certificate expiries, and the occasional “why is the east region three milliseconds slower today?”
Was this four teams? No. It was one product organisation with a single roadmap and one backlog viewed from different angles. We synchronised on priorities, held joint postmortems, and rotated people across roles enough to build empathy without forcing everyone into everything. Over time, skills overlapped and the boundaries softened, but we never pretended they didn’t exist.
The clearest proof came during an incident that started as a spike in 404s on a public API. Service Support spotted it first in the user-facing dashboards and pulled in the Service Developers. The metrics told a story: requests failing only when a new feature flag was enabled in a specific region. Platform Support checked the release timeline and canary results—no platform anomalies, healthy nodes, normal latencies. We rolled back the feature via the pipeline, errors disappeared, and within an hour we had a focused PR that fixed the routing logic.
Nothing heroic happened. The platform made rollback safe; the pipeline made promotion honest; the dashboards spoke the same language across service and platform; and the roles clicked into place without a turf war. Users barely noticed.
We didn’t arrive at this shape by decree. We wrote things down: API conventions, logging schemas, deployment descriptors, runbook templates. We built starter kits so new services weren’t blank slates. We paired across roles and made post-incident learnings part of the backlog, not a PDF that gets forgotten. And we kept the bar for automation high: if something hurt twice, we didn’t congratulate ourselves for powering through—we automated the pain away.
Most importantly, we resisted over-engineering. Not every service needs a custom queue. Not every problem needs a new data store. Standards create speed, and speed creates the space to improve the standards.
Working in a genuinely Agile environment made this balance possible. We ran with a single, shared backlog across platform and service work, so priorities were visible and trade-offs were explicit. Regular grooming kept items small, well-scoped, and testable, which made handoffs lighter and knowledge sharing natural—any engineer could pick up a task and understand its context. Good task breakdown turned big initiatives into steady, incremental wins; retros fed improvements back into both platform and product; and shared ceremonies aligned everyone on outcomes rather than ownership lines. Agile wasn’t a process veneer—it was the operating system that kept the system (and the people in it) learning, predictable, and fast.
When the balance is right, you don’t celebrate the platform every week. You celebrate the product shipping features predictably, incidents resolving quickly, and engineers who still have energy at the end of the quarter. Microservices, APIs, containers, IaC, big data tooling, and CI/CD provide the spine. The four-role topology provides the muscle. Culture—shared standards, shared backlogs, shared responsibility—holds it together.
That’s the mix that’s worked for us. Not perfect, not static, but stable enough to build on and flexible enough to evolve. And when it all clicks, the platform fades, the team shines, and the product moves.