skybyte

skybyte

AZURE

FINOPS

AKS

COSMOS DB

COST OPTIMIZATION

55% Azure Cost Reduction in 11 Weeks - $137K/Month Saved on a $250K/Month SaaS Spend

A high-traffic B2B SaaS platform had watched their Azure bill grow 3× in 18 months as the platform scaled rapidly. 43% of spend was untagged, QA clusters ran at 6% CPU 24/7, and Cosmos DB was provisioned for traffic it had never seen. We fixed it without touching a line of application code.
Industry:
B2B SaaS
Cloud:
Microsoft Azure
Timeline:
11 weeks
Pre-engagement spend:
~$250K / month
Code changes:
Zero code changes required
55% Azure Cost Reduction in 11 Weeks - $137K/Month Saved on a $250K/Month SaaS Spend

55%

Total Cost Reduction

$137K

Monthly Savings

11 wks

Engagement Duration

80%

QA/Staging on Spot

0

Production Incidents

The Challenge

The problem we were handed

The client's Azure environment had scaled significantly, and cost visibility had not kept pace. 43% of resources had no cost center tags, making attribution difficult. Three AKS clusters were running production-sized D8s_v3 node pools 24/7 for QA and staging workloads averaging 6% CPU utilization. Azure SQL Hyperscale had been selected for a database handling 300 req/min. Cosmos DB was provisioned at fixed throughput for a workload that was quiet 14 hours a day and spiked for 4. Premium SSD storage was attached to non-production environments that had no IOPS requirements. The core challenge: engineering, finance, and platform teams each had visibility into different parts of the spend. Previous optimization efforts had stalled because no single team owned the bill end-to-end.

  • 43% of spend completely untagged - no owner, no environment, no cost center

  • QA/staging AKS clusters on D8s_v3 On-Demand node pools running 24/7 at 6% CPU

  • Azure SQL Hyperscale: 12× over-provisioned for actual query volume and data size

  • Cosmos DB on fixed provisioned RU/s - paying for peak 24 hours a day

  • $34K/month in orphaned resources: unattached disks, idle App Service Plans, unused public IPs

  • Reserved Instance coverage: 0% - 100% of compute on pay-as-you-go after prior RIs expired

  • 18 AKS deployments with no KEDA or HPA - all running at max replica count permanently

We'd tried to cut costs twice internally and both times it fizzled. Three weeks in with Skybyte, we'd already saved more than our best full-year attempt - and they hadn't touched production once. That surprised us.

Our Approach

How We Delivered

Four levers: eliminate waste → right-size → automate scaling → commit on the new baseline

01

Tagging enforcement via Azure Policy - visibility before optimization

Deployed Azure Policy with Deny effect on resources missing required tags (env, team, service, cost-center). Backfilled existing resources via a Terraform import + tag-patch pipeline. Built team-level Cost Management workbooks within 5 days. Engineers who can see their service's line on the cloud bill consistently fix it - this culture shift was the precondition for everything else.

02

Orphan resource sweep - immediate $34K/month recovery

Wrote an Azure Resource Graph KQL query scanning for unattached managed disks, public IPs with no NIC association, empty App Service Plans, and SQL databases with zero connections in 30 days. Cross-validated with the engineering team - deleted or snapshot-archived 847 resources across all subscriptions. Week 1 savings recovered the engagement cost in full.

03

AKS QA/staging → Spot node pools (80% of non-prod workloads)

Identified 68 of 85 microservices as stateless and eviction-tolerant. Migrated QA and staging AKS clusters entirely to Azure Spot node pools (Standard_D4s_v5 Spot at ~73% discount). Implemented Pod Topology Spread Constraints and Node Affinities to ensure graceful degradation during Spot evictions - critical workers automatically rescheduled onto on-demand nodes if Spot capacity evaporated. KEDA scale-to-zero on all non-critical deployments: QA cluster scales to 0 after 20 minutes of inactivity. QA cluster cost: $1,200/month, down from $18,500/month.

04

KEDA + HPA on production AKS - right-size without risk

Deployed KEDA on 18 event-processing services using Azure Service Bus queue depth as the scaling metric. Replaced fixed-replica configs with HPA on CPU/memory for synchronous services. Average pod count at off-peak dropped 67%. Production node pool baseline reduced from 24 nodes to 9, scaling dynamically to 22 at peak - same performance, 60% fewer idle nodes.

05

PaaS right-sizing: SQL, Cosmos DB Autoscale, and storage tiers

Migrated the over-provisioned Azure SQL Hyperscale to General Purpose tier after load testing confirmed it handled 2× peak traffic on a 4-vCore instance. Migrated spiky Cosmos DB containers from fixed provisioned throughput to Cosmos DB Autoscale - the platform now pays for maximum RU/s only during active usage windows, not overnight. Downgraded Premium SSD to Standard SSD across all non-production storage accounts where high IOPS provided no benefit. Deleted 38TB of stale snapshots predating a product pivot.

06

Reserved Instances + Savings Plans - lock in the optimized baseline

After 6 weeks of right-sizing (giving the new usage patterns time to settle), purchased 1-year Reserved Instances for production AKS node pools and key VMs. Combined with a Compute Savings Plan covering flexible workloads. RI coverage went from 0% to 71% of committed compute spend - additional 34% discount on the already-reduced baseline. The order matters: right-size first, then lock in commitments on the optimized baseline.

Monthly Savings - Broken Down by Lever
$34K/mo - orphan cleanup: unattached disks, idle App Service Plans, unused IPs (week 1)
$37K/mo - AKS Spot migration + KEDA scale-to-zero across 80% of non-prod workloads
$28K/mo - right-sizing: AKS production nodes, SQL Hyperscale → GP, Cosmos DB Autoscale, SSD tiers
$38K/mo - Reserved Instances + Savings Plans applied to already-reduced right-sized baseline
Zero production incidents across the entire 11-week engagement - no latency degradation, no user impact
Permanent culture shift: per-team cost dashboards + Azure Policy tagging enforcement prevent regression

Technologies Used

AKSAzure Spot Node PoolsKEDAHPAPod Topology Spread ConstraintsAzure PolicyAzure Cost ManagementAzure Resource Graph (KQL)Azure MonitorPrometheusCosmos DB AutoscaleAzure SQL General PurposeReserved InstancesCompute Savings PlansAzure Service BusTerraform
Frequently Asked Questions

Results vary by environment, but in this engagement Skybyte achieved a 55% total cost reduction ($137K/month savings) on a $250K/month Azure spend in 11 weeks - through orphan cleanup, AKS Spot migration, KEDA autoscaling, PaaS right-sizing, and Reserved Instance purchases. No application code changes were required.

Always right-size first. Skybyte spent 6 weeks optimizing workloads before purchasing Reserved Instances, ensuring commitments were based on actual optimized usage rather than locking in overprovisioned baselines at a discount.

KEDA (Kubernetes Event-Driven Autoscaling) scales pods based on event sources like Azure Service Bus queue depth. Skybyte deployed KEDA on 18 AKS services with scale-to-zero for non-production workloads, reducing QA cluster costs from $18,500/month to $1,200/month - a 93% reduction.

left abstractright abstract

Ready to transform your business?

Join the 25+ engineering teams that trust Skybyte with their infrastructure.

Get Started Now
© 2026 Skybyte Technologies Private Limited. All Rights Reserved.
Privacy