skybyte

skybyte

AWS

EKS / KUBERNETES

MICROSERVICES

FINOPS

DEVSECOPS

86+ Services Migrated to AWS EKS in 8 Weeks - Zero Downtime, Zero Incidents

A Series B SaaS company's deployment process had become the company's biggest bottleneck. Every release meant a 4-hour maintenance window, a shared RDS bottleneck acting as an accidental ESB, and 86 services with no environment parity between dev, QA, and prod.
Industry:
B2B SaaS
Cloud:
AWS
Timeline:
8 weeks
Services migrated:
86
86+ Services Migrated to AWS EKS in 8 Weeks - Zero Downtime, Zero Incidents

8 weeks

Dev → Production

86+

Services Migrated

0

Downtime Incidents

38%

Infra Cost Reduction

15 min

Deploys (Was 4 Hrs)

The Challenge

The problem we were handed

The client ran 86 backend services on bare EC2 instances with a single shared RDS instance acting as an accidental ESB - services were communicating via database rows rather than APIs. Every release required a full-environment freeze. P99 latency spikes during deployments hit 12 seconds, and the platform team was spending ~40% of sprint capacity on rollback coordination. No environment parity: services had inconsistent Dockerfile base images (several still on Node 14), environment variables were managed manually in AWS Parameter Store with no versioning, and there was zero observability below the ALB layer. Cloud spend had grown without governance or tagging - over 60% of instances were overprovisioned m5.xlarge or larger running at single-digit CPU utilization.

  • 86 EC2-managed processes with no container orchestration - several running via cron

  • Shared RDS bottleneck causing cascading failures across unrelated services

  • Every release required a 4-hour maintenance window with manual rollback runbooks

  • No namespace isolation - a misconfigured service could reach production secrets

  • No cost attribution: AWS spend had grown without a tagging policy

  • Dev, QA, and prod environments diverged - "works on staging" was not a guarantee

The Helm library was the thing that clicked for my team. We went from 200 lines of Ansible and bash per service to a 40-line values file. Spinning up a new service used to take three days - now it's under an hour, and the junior engineers can do it on their own.

Our Approach

How We Delivered

Strangler fig migration + parallel EKS cluster, phased traffic rollover

01

Terraform-first EKS foundation with strict IAM

Provisioned a multi-AZ EKS cluster via Terraform with environment-specific node groups. Security built in from day one: IRSA (IAM Roles for Service Accounts) mapped fine-grained IAM policies directly to Kubernetes service accounts, so each of the 86 services had strictly least-privilege access to AWS resources. No shared EC2 instance profiles. No over-permissioned cross-service access.

02

Helm shared library: one chart, zero config drift

Built a centralized Helm library chart with opinionated defaults: resource limits/requests templates, PodDisruptionBudgets, HPA configs, readiness/liveness probe skeletons, and configmap injection. Each service chart became a ~40-line values.yaml consumer of the library. Integrated directly into GitHub Actions CI/CD - PRs merge to main → auto-deploy to staging → smoke tests → manual gate → QA → soak period → prod. Config drift between environments? Gone.

03

Karpenter for intelligent node provisioning

Replaced Cluster Autoscaler with Karpenter. Defined NodePools scoped to Spot for stateless workloads and On-Demand for stateful services. Karpenter's bin-packing reduced average node count by 31% at peak load. Pending pod times dropped by 80% compared to the legacy autoscaling group response times.

04

Zero-downtime cutover: Route 53 weighted routing + Argo Rollouts

Used AWS Route 53 weighted routing policies combined with an NGINX Ingress controller to gradually shift live traffic from the legacy EC2 fleet to the new EKS cluster - 10% increments, monitoring error rates and p99 latency at every step. Each service used Argo Rollouts blue-green strategy for the final production flip. We ran the full 86-service cutover during peak business hours. Not a single dropped request.

05

OPA Gatekeeper + External Secrets Operator for security posture

OPA Gatekeeper policies enforced at admission: no privileged containers, no host networking, no images from non-ECR registries. External Secrets Operator synced all secrets from AWS Secrets Manager directly into pods - zero plaintext environment variables, zero secrets stored in Kubernetes etcd. VPC network policies locked east-west traffic to declared service communication paths only.

06

Kubecost: per-service cost accountability from day one

Kubecost deployed with namespace-level cost allocation. Within two weeks of go-live, every engineering team had a per-service P&L view. Identified 18 services consuming 3× their actual resource usage - right-sized within the first sprint post-migration. The 38% infra cost reduction came from this visibility. When engineers see what their service costs, they fix the waste themselves.

Key Outcomes
Zero-downtime production cutover via Route 53 weighted routing + Argo Rollouts blue-green across all 86 services
38% infra cost reduction via Karpenter Spot integration and Kubecost-driven right-sizing of 18 over-allocated services
Deployment cycle cut from 4-hour maintenance windows to fully automated 15-minute pipelines via GitHub Actions + ArgoCD
Full IRSA + OPA Gatekeeper security posture - passed internal security audit immediately post-migration
Pending pod scale time reduced 80% - Karpenter provisions the exact right instance type in milliseconds vs. ASG lag
New service onboarding reduced from 3 days to <1 hour using the shared Helm library and standardized values schema

Technologies Used

EKSKarpenterHelm Library ChartsArgoCDArgo RolloutsGitHub ActionsKubecostOPA GatekeeperIRSAExternal Secrets OperatorNGINX IngressAWS Route 53HPAPodDisruptionBudgetsTerraformECRAWS Secrets Manager
Frequently Asked Questions

Skybyte completed a full migration of 86 microservices from EC2 to AWS EKS in 8 weeks, including Terraform provisioning, Helm chart standardization, CI/CD pipeline setup with GitHub Actions and ArgoCD, and a phased zero-downtime traffic cutover using Route 53 weighted routing.

Yes. Skybyte used AWS Route 53 weighted routing combined with Argo Rollouts blue-green deployments to gradually shift traffic from EC2 to EKS in 10% increments, monitoring error rates and p99 latency at each step. All 86 services were cut over during peak business hours with zero dropped requests.

In this engagement, Skybyte achieved a 38% infrastructure cost reduction through Karpenter Spot instance optimization, Kubecost-driven right-sizing of 18 over-allocated services, and bin-packing that reduced average node count by 31% at peak load.

left abstractright abstract

Ready to transform your business?

Join the 25+ engineering teams that trust Skybyte with their infrastructure.

Get Started Now
© 2026 Skybyte Technologies Private Limited. All Rights Reserved.
Privacy