Environment on Demand (Part 1): Architecture & Implementation

  1. 1 What Is Environment on Demand?
  2. 2 The GitOps Interface: Argo CD ApplicationSets
  3. 3 Infrastructure Trees: How Environments Become Resources
  4. 4 Provisioning Models: The Good, The Bad, and The Slow
  5. 5 Cloud Implementation: Example with Kubernetes + Serverless + IaC
  6. 6 Performance Implications: When EoD Shines vs. Struggles
  7. 7 Alternative Approaches: The Trade-Offs
  8. Summary: Environment on Demand Architecture

Every pull request your team opens—whether it’s a simple bug fix or a complex feature spanning multiple microservices—can have its own isolated, production-like environment: Environment on Demand (EoD).

This pattern enables teams to:

  • Spin up preview environments in minutes, not days
  • Test changes in isolation before merging to main
  • Validate infrastructure changes safely with Infrastructure as Code
  • Automate cleanup when PRs close, avoiding cost waste

But it’s also why teams adopting EoD hit similar pain points around provisioning latency, CDN propagation delays, cost overruns, and operational complexity. Here’s the deep dive: what Environment on Demand is, why teams need it, how to architect it, and where reality bites back.


1 What Is Environment on Demand?

Environment on Demand is an infrastructure pattern where development, staging, and preview environments are provisioned automatically via GitOps workflows, typically triggered by pull requests or branch pushes.

Each environment includes:

  • Compute namespace or cluster (Kubernetes with serverless or node groups)
  • Application deployments (microservices, frontend, backend)
  • Supporting infrastructure (managed databases, message queues, object storage)
  • Networking (load balancer ingress, DNS, CDN distributions)
  • Policies (admission control, IAM roles, service mesh)

This pull-based model means infrastructure flows from Git, one commit at a time, until the environment is ready for testing.

Is It a Platform? A Pattern? Something Else?

Environment on Demand is often described using different terms. Here’s the precise classification:

Term Is EoD This? Why
Deployment Pattern Most accurate Defines how environments are created and managed
Architectural Pattern Also correct Defines high-level structure (GitOps-driven, ephemeral resources)
Platform ⚠️ Partially Built on top of Kubernetes + CI/CD tools, but more than just a platform
Software Architecture Too broad It’s part of a team’s DevOps architecture, not the whole architecture
Methodology No It’s an implementation pattern, not a process methodology

The Relationship:

GitOps (Methodology)
        ↓
    Enables
        ↓
Environment on Demand (Deployment Pattern / Architectural Pattern)
        ↓
    Implemented with
        ↓
Argo CD + Terraform + Kubernetes (Platform Stack)

Why the Confusion?

Source Uses Term Reason
DevOps blogs “Platform” Marketing; sounds more substantial
Engineering teams “Pattern” Familiar from architecture vocabulary
Vendor docs “Solution” Product-focused naming
SRE teams “Workflow” Operations-focused naming

The Precise Answer:

Environment on Demand is best described as a deployment pattern for infrastructure that:

  • Uses GitOps methodology as its foundation
  • Defines a provisioning model (automatic, ephemeral, per-PR)
  • Is part of a team’s overall DevOps architecture

Think of it like this:

  • GitOps = “How do I manage infrastructure via Git?”
  • Environment on Demand = “How do I create isolated environments per PR?”
  • Argo CD + Terraform + Kubernetes = “The actual tools that implement EoD”

Simple Example:

# PR #123 opens → GitOps workflow triggers
# Environment: pr-123.neo01.com

Provisions as:

flowchart BT A[GitHub PR #123] --> B[Argo CD ApplicationSet] B --> C[EKS Namespace: pr-123] C --> D[Fargate Pods + Istio Sidecars] D --> E[Aurora DB Subset] E --> F[S3 + CloudFront] F --> G[Route 53: pr-123.neo01.com] G --> H[Environment Ready] style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style C fill:#e8f5e9,stroke:#388e3c style H fill:#fce4ec,stroke:#c2185b

Provisioning Flow:

Developer: "Open PR #123"
  ↓
GitHub Actions: "Trigger Terraform Cloud"
  ↓
Terraform: "Create namespace + resources"
  ↓
Argo CD: "Sync applications to namespace"
  ↓
Environment: "Ready at pr-123.neo01.com"

Each component is independent. The namespace doesn’t know if pods run on Fargate or EC2. The DNS doesn’t know if it’s a preview or staging env. This modularity is EoD’s superpower.


2 The GitOps Interface: Argo CD ApplicationSets

In a GitOps-driven setup, every environment is represented as an Argo CD Application (or ApplicationSet for templated environments):

# Simplified Argo CD ApplicationSet
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: preview-environments
spec:
  generators:
    - pullRequest:
        github:
          api: https://api.github.com
          tokenRef:
            secretName: github-token
            key: token
          repo: neo01/neo01.com
          branch: main
  template:
    metadata:
      name: 'pr-{{number}}'
    spec:
      project: default
      source:
        repoURL: https://github.com/neo01/neo01.com
        targetRevision: 'pr-{{number}}'
        path: 'environments/preview'
      destination:
        server: https://kubernetes.default.svc
        namespace: 'pr-{{number}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true

The Contract:

Field Meaning
generators.pullRequest Watch for PRs, create env per PR
template.metadata.name Environment name (e.g., pr-123)
destination.namespace Kubernetes namespace isolation
syncPolicy.automated Auto-sync on Git changes, prune on deletion

Generic Sync Loop:

# Argo CD continuously reconciles
while true; do
  desired_state = git_repo.get_latest()
  current_state = k8s_cluster.get_current()
  
  if desired_state != current_state
    k8s_cluster.apply(desired_state)
  
  sleep 3s  # Reconciliation interval
end

This loop—reconcile Git to cluster, repeat—is the entire GitOps model. Every environment, no matter how complex, reduces to this pattern.


3 Infrastructure Trees: How Environments Become Resources

When you open a PR, Terraform builds an infrastructure tree. Each node is a resource type with specific dependencies.

Common Resource Types

Resource What It Does Provisioning Time
Kubernetes Namespace Logical isolation in cluster < 1 minute
Serverless Pods Compute (no node management) 2-5 minutes
Service Mesh Sidecars mTLS, traffic shaping 1-2 minutes
Admission Policies Security, compliance < 1 minute
Managed Database PostgreSQL/MySQL (can be serverless) 10-15 minutes
Message Queue Topics Kafka/RabbitMQ (or use shared cluster) 5-10 minutes
Object Storage Buckets (assets, uploads) < 1 minute
CDN Static asset distribution 5-15 minutes
DNS Records CNAME/ALIAS per environment 1-5 minutes
Load Balancer Path-based routing 2-5 minutes

Example: Full Preview Environment

# Simplified Terraform module
module "preview_env" {
  source = "./modules/preview"
  
  pr_number      = var.pr_number
  namespace      = "pr-${var.pr_number}"
  domain         = "pr-${var.pr_number}.neo01.com"
  image_tag      = var.image_tag
  cloud_region   = "ap-east-1"
  
  # Shared resources (cheaper, faster)
  shared_mq_cluster_arn = data.aws_mq_cluster.shared.arn
  shared_vpc_id          = data.aws_vpc.main.id
  
  # Auto-destroy after inactivity
  ttl_hours = 24
}

Infrastructure Plan (Simplified):

flowchart BT A[IaC Apply] --> B[Kubernetes Namespace: pr-123] B --> C[Load Balancer + DNS] C --> D[CDN Distribution] D --> E[Compute Pods + Service Mesh] E --> F[Managed Database] F --> G[Message Queue Topics] G --> H[Object Storage] H --> I[Admission Policies] I --> J[GitOps Sync] J --> K[Environment Ready] style A fill:#e3f2fd,stroke:#1976d2 style B fill:#e8f5e9,stroke:#388e3c style D fill:#fff3e0,stroke:#f57c00 style F fill:#fff3e0,stroke:#f57c00 style K fill:#fce4ec,stroke:#c2185b

Provisioning Flow (First Environment):

1. Developer opens PR #123
2. CI/CD triggers IaC apply
3. IaC creates namespace (1 min)
4. IaC provisions managed database (10-15 min) ← Blocking
5. IaC creates CDN distribution (5-15 min) ← Blocking
6. IaC sets up DNS records (1-5 min)
7. GitOps syncs applications to namespace (2-5 min)
8. Compute pods start with sidecars (2-5 min)
9. Health checks pass, environment marked ready
10. Notification: "pr-123.neo01.com is ready"

Notice: Database and CDN must complete before environment is usable. These are blocking resources—they break the fast feedback loop.

🤔 Why Does This Matter?

Blocking resources like databases, CDN, and message queues force IaC to wait for cloud provider APIs before proceeding. This means:

  • Developer wait time — 15-30 minutes before testing
  • Cost accumulation — Resources bill even while waiting
  • Feedback delay — Can't validate changes quickly

When you see these in your IaC plan, ask: "Can I use shared resources instead of per-env provisioning?"


4 Provisioning Models: The Good, The Bad, and The Slow

The Good: Why EoD Works Well

1. Isolation

Each PR gets its own namespace with dedicated resources:

# pr-123 can't affect pr-124
namespace: pr-123
resources:
  cpu_limit: 2
  memory_limit: 4Gi
  network_policy: deny-cross-namespace

Blast Radius: O(1) per environment (just that namespace)


2. Modularity

Environments compose from reusable modules. The same IaC module works with:

  • Preview environments (per-PR)
  • Staging environments (shared, long-lived)
  • Development environments (persistent, team-specific)

No custom code needed for each tier.


3. Automatic Cleanup

# GitOps + cron job
when PR.closed OR TTL.expired:
  delete namespace
  destroy IaC resources
  invalidate CDN cache (if needed)
  notify team: "Environment pr-123 destroyed"

Environments self-destruct after 24 hours of inactivity—no manual cleanup required.


4. Audit Trail

Every environment change is tracked in Git:

$ git log --oneline environments/preview/
a1b2c3d  feat: Add payment service to pr-123
e4f5g6h  fix: Update database config for pr-122
i7j8k9l  chore: Bump TTL to 24h for all previews

~3 lines of Git history. Easy to audit. Easy to rollback.

💡 Key Insight: Simplicity Enables Governance

Because every environment is defined in Git, compliance teams can review infrastructure changes just like code changes. This is why EoD works in regulated industries (finance, healthcare, wagering). The GitOps audit trail is what makes EoD compliant.


The Bad: Where EoD Struggles

1. Provisioning Latency

Every environment requires:

  • IaC apply (5-30 minutes)
  • GitOps sync (2-5 minutes)
  • Health checks (1-3 minutes)
  • DNS propagation (1-5 minutes, or 10-15 for CDN)

For 10 concurrent PRs: 50-300 minutes of cumulative wait time.


2. CDN Propagation

CDN invalidations for static assets complete in seconds to ~2-5 minutes globally, but can spike to 10-15+ minutes (or rarely hours during cloud provider peaks/API throttling).

Challenges in EoD:

Each ephemeral env needs:
  - Custom domain: pr-123.neo01.com
  - CDN behavior: /assets/* → Object Storage
  - DNS record: CNAME to CDN
  - Invalidation: /* (or use versioned paths)

When CI/CD/GitOps flows trigger invalidation per PR:

  • PR merge → deploy → invalidate → user sees old content
  • “Why isn’t my change live?!”

3. Cost Accumulation

10-30 concurrent previews with:
  - Serverless compute: $0.04/vCPU-hour × 2 vCPU × 24h = ~$2/env/day
  - Managed database: $0.12/unit-hour × 2 units × 24h = ~$6/env/day
  - CDN: $0.085/GB (egress) + $0.009/10k requests
  - DNS: $0.50/hosted zone + $0.40/million queries
  - Load Balancer: $0.0225/hour + $0.008/LCU-hour

Monthly cost for 20 envs: ~$500-1500 (if aggressively torn down) to $3000-5000 (if left running)


4. Dependency Complexity

Some resources must coordinate across services:

Blocking Dependency Why It Blocks
Database init Must complete migrations before app starts
CDN deploy Must have valid SSL cert (validation can take minutes)
Message Queue topic creation Must exist before producers/consumers start
Secrets sync Must have secrets manager entries before pods start

When a blocking dependency is in the plan, upstream resources can’t proceed—they must wait.


5 Cloud Implementation: Example with Kubernetes + Serverless + IaC

In a production setup, the EoD stack typically uses cloud-native services. The following examples use AWS, but the patterns apply to Azure (AKS + Container Apps), GCP (GKE + Cloud Run), or any Kubernetes platform:

# Simplified Terraform for Kubernetes namespace
resource "kubernetes_namespace" "preview" {
  metadata {
    name = "pr-${var.pr_number}"
    
    labels = {
      "app.kubernetes.io/name"       = "preview"
      "app.kubernetes.io/instance"   = "pr-${var.pr_number}"
      "environment.on-demand/owner"  = var.github_user
      "environment.on-demand/ttl"    = var.ttl_hours
    }
    
    annotations = {
      "environment.on-demand/created-at" = timestamp()
      "environment.on-demand/pr-url"     = var.pr_url
    }
  }
}

resource "kubernetes_pod" "app" {
  # ... Serverless pod spec with service mesh sidecar ...
}

resource "dns_record" "preview" {
  zone_id = data.dns_zone.main.zone_id
  name    = "pr-${var.pr_number}.neo01.com"
  type    = "A"
  
  alias {
    name                   = cdn_distribution.preview.domain_name
    zone_id                = cdn_distribution.preview.hosted_zone_id
    evaluate_target_health = true
  }
}

Each resource type implements its own provisioning logic:

Resource Type IaC Resource Provisioning Complexity
Kubernetes Namespace kubernetes_namespace Low (< 1 min)
Serverless Pods kubernetes_pod Medium (2-5 min)
Managed Database managed_database_cluster High (10-15 min)
CDN cdn_distribution High (5-15 min)
DNS dns_record Low (1-5 min)
Message Queue Topics mq_topic (or shared) Medium (5-10 min)

Example: CDN with Versioned Assets

# Avoid invalidation by using versioned paths
resource "cdn_distribution" "preview" {
  origin {
    domain_name = object_storage.assets.bucket_regional_domain_name
    origin_id   = "Storage-pr-${var.pr_number}"
    
    # Custom origin path per PR
    origin_path = "/pr-${var.pr_number}"
  }
  
  # No invalidation needed if using /assets/v123/ paths
  # Instead of invalidating /*, use immutable caching
  default_cache_behavior {
    # ... cache policy with 1-year TTL for versioned assets ...
  }
  
  # Only invalidate on actual content changes
  # (handled by CI/CD, not per-deploy)
}

Key Observations:

  1. Versioned paths avoid CDN invalidation entirely
  2. Namespace isolation prevents cross-env contamination
  3. TTL annotations enable automatic cleanup
  4. Delegates to child resources via IaC dependencies

This pattern repeats across ~20-30 resource types per environment.


6 Performance Implications: When EoD Shines vs. Struggles

EoD Excels At:

Workload Why
Feature development (isolated testing) Each dev gets their own env; no conflicts
Integration testing (multi-service) Full stack available per PR
Stakeholder demos Shareable URL (pr-123.neo01.com)
Infrastructure changes IaC plan in PR, apply on merge

Example: Feature Branch Testing

# PR #123: Add payment gateway
environment: pr-123.neo01.com
services:
  - frontend:v1.2.3-pr123
  - backend:v1.2.3-pr123
  - payment-service:v2.0.0-pr123  # New service
database:
  - Managed database (migrated)
testing:
  - E2E tests pass
  - Stakeholder approval
  • Developer opens PR → env provisions in 15-30 min
  • QA tests on live environment
  • Product owner reviews at shareable URL
  • Total feedback time: 30-60 minutes (including provisioning)
  • EoD overhead: Acceptable for feature work

EoD Struggles At:

Workload Why
Quick fixes (typo, CSS tweak) 15-30 min provisioning for 5-min change
High-frequency iteration (A/B testing) Provisioning time exceeds dev time
Resource-intensive testing (load tests) Compute/database limits per env
Cross-PR dependencies (PR #123 needs PR #124’s changes) Coordination overhead

Example: Hotfix Deployment

# Hotfix: Fix typo on homepage
# Expected: Deploy in 5 minutes
# Actual: 20-30 minutes (provisioning + testing)
  • Developer opens PR → env provisions in 15-20 min
  • QA validates fix (2 min)
  • Merge → production deploy (5 min)
  • Total time: 25-30 minutes
  • EoD overhead: 80-90% of total time

⚠️ The Hidden Cost: Not Just Wait Time

The provisioning delay is only part of the problem. EoD also introduces:

  1. Context switching — Devs lose momentum waiting for envs
  2. Debugging complexity — Which env has the issue?
  3. Cost uncertainty — Unexpected bills from orphaned envs
  4. Governance friction — Compliance gates slow down "self-service"

For quick iterations, these workflow inefficiencies often matter more than the provisioning time itself.


7 Alternative Approaches: The Trade-Offs

Lightweight alternatives (virtual clusters, namespace-only, shared staging) reduce provisioning time but sacrifice isolation:

# Alternative 1: Namespace-only (no database, no CDN)
environment:
  type: namespace-isolation
  shared_resources:
    - database-cluster: shared-staging
    - cdn: shared-cdn
  provision_time: 2-5 minutes
  isolation: medium

Execution:

while (pr = pull_request.opened) {
  # Create namespace only (no new database/CDN)
  kubectl.create_namespace(`pr-${pr.number}`)
  
  # Deploy apps with shared DB (schema-isolated)
  helm.install(apps, { namespace: `pr-${pr.number}` })
  
  # Notify when ready (2-5 min total)
  notify(`PR ${pr.number} env ready: pr-${pr.number}.neo01.com`)
}

Benefits:

Aspect Full EoD (Per-Env Resources) Lightweight (Namespace + Shared)
Provisioning time 15-30 minutes 2-5 minutes
Isolation High (dedicated database, MQ) Medium (shared DB, schema isolation)
Cost per env $25-75/day $5-15/day
Complexity High (IaC + GitOps) Medium (GitOps only)
Use case Feature testing, compliance Quick fixes, high-frequency iteration

🤔 So Why Not Always Use Lightweight?

If namespace-only is 6x faster and 5x cheaper, why provision full environments?

  • Data isolation — Some tests need dedicated DB (migrations, data seeding)
  • Performance testing — Shared resources skew load test results
  • Compliance — Regulated industries require env isolation (audit trails)
  • Blast radius — Bad config in one env shouldn't affect others

The answer: Tiered environments (lightweight for quick fixes, full EoD for features).


Why Teams Don’t Standardize on One Approach

1. Trade-Off Complexity

Different PRs need different isolation levels:

  • Typo fix → namespace-only (2 min)
  • Database migration → full database (20 min)
  • Frontend tweak → no CDN (direct load balancer)
  • Payment feature → full isolation (compliance)

2. Cost Coupling

Full EoD for all PRs = 10x cost increase with no velocity gain for simple changes.

3. Governance Requirements

Regulated industries require audit trails, approval gates for certain resources (e.g., no public storage in previews).

Self-service ideal clashes with “needs review” for sensitive changes.


Summary: Environment on Demand Architecture

flowchart LR subgraph "GitOps Workflow" A[Pull Request] --> B[GitOps ApplicationSet] B --> C[IaC Apply] end subgraph "Infrastructure" C --> D[Kubernetes Namespace] D --> E[Compute Pods + Service Mesh] D --> F[Managed Database] D --> G[CDN + Object Storage] D --> H[DNS] end subgraph "Governance" I[Admission Policies] --> D J[Compliance Checks] --> F K[Cost Allocation Tags] --> D end style A fill:#e3f2fd,stroke:#1976d2 style B fill:#fff3e0,stroke:#f57c00 style D fill:#e8f5e9,stroke:#388e3c

Key Takeaways:

Aspect Environment on Demand
Interface GitOps (ApplicationSets + IaC)
Structure Tiered environments (lightweight to full)
Data flow Git → IaC → Kubernetes → Environment
Memory Ephemeral (TTL-based cleanup)
Best for Feature testing, integration, stakeholder demos
Worst for Quick fixes, high-frequency iteration

Environment on Demand isn’t perfect—but for teams shipping fast on Kubernetes (whether EKS, AKS, GKE, or vanilla), it’s the difference between “waiting days for an environment” and “testing in 15 minutes.” Understanding the trade-offs helps you build EoD that works with your team’s workflow, not against it.

✅ Key Takeaway

Environment on Demand is a trade-off, not a silver bullet:

  • Gains: Isolation, self-service, audit trail, faster feedback
  • Losses: Provisioning latency, cost complexity, operational overhead

For ~35-person teams in regulated industries, the gains outweigh the losses. For solo devs or simple apps, it's overkill.

Your job as a platform engineer: Know which tier (lightweight vs. full) fits each use case—and automate the boring stuff (cleanup, notifications, cost tracking).


What’s Next?

In Part 2, we’ll dive into:

  • Environment Lifecycle Management — Why staging should be permanent, previews ephemeral
  • The AI Coding Bottleneck — Why agentic coding makes provisioning the new constraint
  • Deployment Strategies — Blue-green vs. rolling vs. recreate for different tiers
  • ROI & Maturity Model — How to measure success

→ Read Part 2: Lifecycle, AI Coding & Optimization


Further Reading:

Share