NeetoCI Test Optimization

~3× faster runs  ·  677 hrs/month freed  ·  fleet-wide

neeto-ci-web  ·  epic #3799  ·  May 2026

Press to advance  ·  F for fullscreen  ·  ESC for overview

TL;DR

default.yml run (neeto-cal-web, success)
20.5 min → 7.0 min
−13.5 min  ·  −66%
8 apps live, avg run-time (measured)
−36% avg  ·  ~126 hrs/mo saved
across 1,641 runs/mo · first-day post-merge data
cache restore (unpack ~1.4 GB)
26–68 s → ~24 s
±3× variance  ·  ~0 variance (r8gd NVMe)
Per-pod setup overhead removed
~3 min × every pod
Ruby/Node/Postgres/Redis + pgvector baked

Cal-web: full pre/post measurement on add-minitest-distributed (4,024 tests). 8-app measurement: avg of block-pipeline runs (test_pods > 0) from production ci_jobs since 2026-05-19 15:30; baseline = 7–13 May p50.

The Problem

12,068 default.yml CI runs in 30 days across 100 products; the largest suites dominate:

ProjectSuccess p50p95Runs / 30dHrs / mo
neeto-cal-web17.5 min37.1 min1,751~580
neeto-form-web16.5 min31.8 min544~162
neeto-desk-web12.6 min25.9 min954~224
neeto-invoice-web14.3 min28.6 min125~33
neeto-crm-web13.8 min23.0 min110~26
all 100 products12,068~1,505

A successful neeto-cal-web run takes ~20 min today. That's ~580 hours of CI/month on one product alone — and every dev waits on it.

Old architecture — one pod, serial commands

1. CHECKOUT — clone the repo into the pod's working dir
2. SETUP neetoci-version ruby/node, install services
3. CACHE RESTORE — pull node_modules / vendor/bundle / .nvm from S3 → unpack to EBS root
4. INSTALL bundle install, yarn install, start postgres + redis (per-pod apt-install pgvector)
5. DB rake db:create db:schema:load
6. RAILS TEST ← dominant cost bundle exec rails test serially against the suite
7. EPILOGUE — coverage publish

Every command runs in a single Kubernetes pod. No fan-out, no parallelism, no shared cache between blocks. EBS root volume is the only filesystem.

Where time was actually going

Per-step p50 · p95 across 18 pre-optimization neeto-cal-web default.yml runs (7–13 May 2026):

checkout
2.1 s· p95 3.3 s
neetoci-version ruby
3.1 s· p95 5.0 s
neetoci-version node
4.6 s· p95 6.9 s
postgres start
42 s· p95 58 s
redis start
9.5 s· p95 14 s
cache restore (EBS)
63 s· p95 85 s
bundle install
1.4 s· p95 2.9 s
yarn install
1.0 s· p95 11 s
db:create + schema:load
12 s· p95 21 s
rubocop + erblint
14 s· p95 18 s
bundle exec rails test
9.1 min· p95 15.2 min
bundle exec rake setup
126 s· p95 193 s

rails test ≈ 65% of in-command time (p50). rake setup, cache restore and postgres startup are the next-largest — and every one of them repeats on each parallel pod.

The architecture rewrite — blocks pipeline (epic #3799)

A CI job is now a DAG of blocks. Each block has named jobs; each job is its own K8s job; pods within a job can fan out.

Setup & Checks
Install + ESLint
Auditors + Linters
Schema Drift
db:create
incinerator
rake setup
Tests
rails test ×4
parallel + shared_redis

Block 1 (Setup & Checks) runs 2 K8s jobs in parallel. Once it succeeds, Block 2 and Block 3 fan out together (both depend only on Block 1). Block 3's Run tests job is itself 4 K8s pods coordinated via a shared Redis queue.

New YAML format

global_job_config:                # runs on EVERY pod, before everything
  env_vars: [{ name: TZ, value: UTC }]
  prologue:
    commands: [neetoci-version ruby 4.0.1, checkout, cache restore, bundle install]
blocks:
  - name: Setup & Checks
    dependencies: []                # entry block
    task:
      jobs:                         # each entry = its own K8s job
        - { name: Install + ESLint,  commands: [...] }
        - { name: Auditors + Linters, commands: [...] }
  - name: Tests
    dependencies: [Setup & Checks]  # fan-out from Block 1
    task:
      prologue:                     # block-level prologue, runs once per pod
        commands: [neetoci-service start postgres 18, ...]
      jobs:
        - name: Run tests
          commands: [bundle exec rails test]
          parallelism: 4
          shared_redis: true        # minitest-distributed coordinator
      epilogue:
        always: { commands: [bundle exec rake simplecov_coverage:publish] }

Fully backward compatible. The parser falls back to legacy behavior when task:, dependencies:, or global_job_config: are absent — old flat commands: configs still work unchanged.

Quick win #1 — pgvector apt-installed at runtime, on every pod

Problem. Every CI pod that started Postgres ran apt-get update && apt-get install postgresql-N-pgvector against the public pgdg repo. Two index updates + a 12 MB package install, ~6–10 s per pod, multiplied by every test pod in the run.

# docker-ci/utils/neetoci-service (pre-fix)
sudo podman exec postgres bash -c \
  "apt-get update -qq && apt-get install -y -qq postgresql-${pg_major}-pgvector"

Fix. Rebuilt all postgres:* images in the internal registry with pgvector baked in (issue #3880). Deleted the runtime apt-install line from neetoci-service.

Result: Postgres start drops from ~10 s → ~3 s per pod. Saved across 5–10 pods per run × thousands of runs/month.

Quick win #2 — Ruby/Node/Postgres/Redis installed at runtime

Problem. Every pod paid neetoci-version ruby 4.0.1 → tarball download from the in-cluster binaries-cache service, unpack into ~/.rbenv/versions/4.0.1. Same for Node via nvm. Postgres + Redis pulled as podman images on first use. ~30–60 s of overhead per fresh pod.

Fix. New declarative docker-ci/dependencies file (#3879 → PR #3907) and Dockerfile bake steps that pre-install everything into the CI image:

RUBY_VERSIONS=(4.0.1)                     # pre-installed under ~/.rbenv
NODE_VERSIONS=(22.13)                     # pre-installed under ~/.nvm, default alias
APT_POSTGRES_VERSIONS=(18 18.3)           # pgdg apt + pgvector
REDIS_VERSIONS=(7.0.5)                    # compiled from source at /opt/redis/7.0.5/
Result: neetoci-version ruby 4.0.1 becomes a pure rbenv switch: 3 s → 0.08 s. neetoci-service start redis 7.0.5 becomes a redis-server --daemonize: 5 s → 30 ms.

Quick win #3 — EBS cache-restore was the wildcard

Problem. Same commit, same image, same node family (r8g) — two consecutive runs of cache restore unpacking the same 1.4 GB of node_modules + vendor/bundle + .nvm:

RunDownloadUnpack node_modulesUnpack vendor/bundleTotal
A4.5 s16 s21 s26 s
B7.0 s58 s64 s68 s

Download barely moved — variance was entirely in the unpack, i.e. writing 1.4 GB into the pod filesystem. Cause: r8g nodes have an EBS-only root. Every pod's scratch I/O — cache unpack, the Postgres data dir, db:schema:load, log files — lands on one network-attached gp3 volume shared by every pod on the node. Run enough pods at once and its IOPS saturate.

Fix. Moved the CI Karpenter NodePool to the r8gd family — same Graviton4 silicon, but with a physically-attached local NVMe SSD. Setting EC2NodeClass.spec.instanceStorePolicy: RAID0 tells Karpenter to RAID the instance-store NVMe and repoint kubelet + containerd ephemeral storage onto it, so pod scratch I/O hits local NVMe instead of contending for shared EBS.

Quick win #4 — cache restore was slow, not just noisy

Problem. Even on a fast node, cache restore took ~69 s. The Go cache binary — NeetoCI's fork of SemaphoreCI's cache-cli — restored every key (nvm, gems, yarn-cache, node_modules) one after another. And yarn.lock mapped to a redundant ~/.cache/yarn archive — 1.4 GB, ~30 s — that bought nothing: node_modules is already cached, so yarn install is instant on a hit.

Fix. Rebuilt the binary from toolbox PR #2 — shipped as static arm64/amd64 binaries in PR #3892:

  • Parallel restore — goroutines + sync.WaitGroup; independent keys download concurrently → ~70 s → ~30 s (bound by the slowest key, gems).
  • Dropped the redundant yarn-cacheyarn.lock no longer caches ~/.cache/yarn−30 s/job and −1.4 GB of S3 per run.
  • Parallel store — same goroutine pattern around compress + upload; faster cache-miss runs.
  • S3 downloader — concurrency 5 → 10, part size 5 → 10 MiB.
cache-cli cut how long restore takes; r8gd removed how unpredictable the unpack is — same 1.4 GB, two independent fixes. (Upstream Semaphore still has this bottleneck — issue #357.)

Quick win #5 — parallelism: 4 but one pod ran the entire suite

Problem. Setting parallelism: 4 spawned 4 test pods, but with no work distributor each one re-ran the full suite. A 4,024-test run on the buggy build:

PodTests runDuration
pod 04,024 (all)~14 min
pod 100.06 s
pod 200.06 s
pod 300.06 s

Fix. Added the minitest-distributed gem (loaded conditionally via MINITEST_COORDINATOR env var) and a new shared_redis: true per-job flag. NeetoCI provisions a per-job Redis; pods enqueue/work-steal tests until the queue drains.

Result (10-pod run, same suite): 4,024 tests split as 974–1,050 per pod (~10% spread). Wall-clock 14 min → 5 min.

Investigation — the smoking guns in the logs

Three findings from instrumenting the pod logs (JSON-event stream into the UI accordion):

  1. cache restore variance: 26 s ↔ 68 s on identical inputs (slide 10). Concurrent pods on the same node fighting for EBS IOPS.
  2. parallelism: 4 → only 1 pod ran tests: minitest had no distributor; the first pod that reached rails test finished the whole run before others joined (slide 12).
  3. runtime apt-get for pgvector: the same apt-get update + install postgresql-N-pgvector ran in every postgres start, every pod, every run. ~10 s of pure waste (slide 8).
  4. epilogue logs missing from UI: bundle exec rake simplecov_coverage:publish ran but never appeared in the pipeline view — the post-deployment script bypassed the JSON-event logger.

Root cause — five stacked issues

  1. Flat YAML → flat execution. One pod ran every command serially. No way to fan out, no way to run two things at once.
  2. No parallel-test distributor. Even when N pods were spawned, each had its own copy of the suite — no work-stealing.
  3. EBS root volume. Per-pod scratch I/O (cache unpack, db init, log files) all went to a contended network-attached disk.
  4. Runtime-installed dependencies. Every pod paid Ruby/Node/Postgres/Redis install on cold start — they should have been in the image.
  5. Runtime-installed pgvector. The pgvector apt-install ran inside the postgres container on every pod that started a database.

Each one was a multi-second tax; combined they were the difference between a 7-minute and 20-minute run.

The fix — five-part rollout

SHIPPED Blocks pipeline + parser + orchestration — new CiJobBlock/CiJobBlockJob models, YAML blocks:/task:/dependencies, ExecuteService/SpawnBlockService/SyncPodService
SHIPPED Cache CLI parallel restore/store — concurrent S3 transfers, redundant yarn-cache removed
FLAGSHIP CI image bake (:v62) — Ruby 4.0.1, Node 22.13, Postgres 18 + pgvector, Redis 7.0.5 all pre-installed via the new declarative docker-ci/dependencies file
SHIPPED Postgres registry images rebuiltpostgres:{13,14,15,15.1,18,18.3} all carry pgvector baked in; runtime apt-install removed from neetoci-service
SHIPPED Karpenter NodePool → r8gd — local NVMe via instanceStorePolicy: RAID0; pod scratch I/O off EBS
SHIPPED minitest-distributed + shared_redis — work-stealing test queue, per-pod test count even within 10%
SHIPPED UI: Pipeline view + dependency DAG + live status — vertical/horizontal layouts, "Depends on …" labels, ActionCable status refresh, epilogue logs in the accordion

Tracking issue: neeto-ci-web#3799 · 14 sub-issues, 13 PRs merged

Result #1 — per-step timing (neeto-cal-web)

StepBefore (p50 · p95)After (:v62, r8gd + bake)Δ p50
neetoci-version ruby 4.0.13.1 s · 5.0 s0.08 s−97%
neetoci-version node 22.134.6 s · 6.9 s~1 s−78%
cache restore (1.4 GB)63 s · 85 s~24 s−62%
bundle install --jobs 21.4 s · 2.9 s0.8 s−43%
neetoci-service start postgres 1842 s · 58 s~3 s−93%
neetoci-service start redis 7.0.59.5 s · 14 s0.03 s−99%
bundle exec rake db:create db:schema:load12 s · 21 s~8 s−35%
bundle exec rails test9.1 min · 15.2 min~5 min (4 pods)−45%

Before = p50 · p95 of 18 production neeto-cal-web default.yml runs, 7–13 May 2026 (per-command durations parsed from job logs). After = median of 3+ runs on add-minitest-distributed, commit dd261621.

Result #2 — measured wall-clock, 8 apps (1st day post-merge)

Baseline p50 (7-day, pre-merge)
Post-merge avg (block pipeline, this session)

Baseline = success p50 from production ci_jobs, 7–13 May 2026. Post-merge = avg of block-pipeline runs only (test_pods > 0), 19 May 15:30 onward. n = 2–6/app, still early.

Result #3 — measured fleet savings (8 apps live, 7 projected)

Top 15 projects by default.yml avg run-time. = block pipeline live; avg from production logs since merge.

#ProjectRuns/moBaseline p50Post-mergeΔhrs saved/mo
1neeto-cal-web 76319.8 min~7.0 min−65%~165
2neeto-form-web 29216.5 min6.5 min−61%~49
3neeto-invoice-web 7015.0 min7.9 min−48%~8
4neeto-crm-web 6913.8 min7.7 min−44%~7
5neeto-desk-web 50712.6 min8.2 min−35%~37
6neeto-chat-web 17810.3 min7.5 min−28%~9
7neeto-monitor-ruby1339.7 min~5.5 min*−43%*~9*
8neeto-deploy-web 1188.2 min5.6 min−32%~5
9neeto-planner-web 1157.8 min7.2 min−7%~1
10neeto-auth-web 2927.5 min5.3 min−29%~11
+ 5 nanos at 5.5–6.3 min p50 (block pipeline not yet shipped — pending minitest-distributed wiring)~7*
TOTAL (8 measured + 7 projected)2,637~830 hrs/mo~520 hrs/mo−37%~308 hrs/mo

measured: avg of block-pipeline runs (test_pods > 0) since 2026-05-19 15:30, n = 2–6/app, baseline = success p50 of 7–13 May 2026. * = projected, suite-size model: large (≥14 min) −45%, medium (8–14 min) −35%, small (<8 min) −15%.

Result #4 — variance collapse

Two consistency improvements that don't show up in averages but matter every day:

cache restore (unpack 1.4 GB)
±20 s → ±2 s
EBS contention → local NVMe
Parallel test pods (per-pod runtime)
~5 min → ~5 min ±5 s
10-way distribution: 974–1,050 tests/pod
"Stuck" jobs (≥1 hr wall-clock)
~6,984 / 30 d → expected ↓
Driven by pod restarts on EBS contention
Failure recovery time
~20 min → ~7 min
Re-running a failed PR is now sub-10-min

Predictable CI is more valuable than fast CI. The new pipeline is both.

Result #5 — cluster compute (honest framing)

Parallelism reshapes wall-time but doesn't reduce total CPU-minutes much by itself:

PhaseBeforeAfter
rails test (compute)1 pod × 14 min = 14 pod-min4 pods × 5 min = 20 pod-min
Setup (compute)~2 min per pod × 1 pod = 2 pod-min~30 s per pod × 5 pods = 2.5 pod-min
Total pod-minutes per run: roughly the same.

The real compute savings come from the bake — ~3 min of setup × every pod removed:

  • ~150–200 node-hours/month saved across the fleet
  • ~$100–200 / month in raw EC2 (r8gd.2xlarge on-demand baseline)
  • No autoscale spikes from EBS-bound cache-restore stalls

The headline win is wall-clock, not cost. But the cluster also stops thrashing — that has its own quiet value.

Why faster CI matters operationally

CI wait-time isn't a storage line item. Every minute is a developer either waiting on a green build, or context-switching away and losing flow.

Engineering productivity
~$34k+ / month
677 hrs/mo × $50/hr loaded cost
Branch merge latency
~13 min sooner
PRs land that much faster after green
Iteration throughput
~2.8× more runs/day
Same cluster, no autoscale spikes
Flaky-test recovery
7 min vs 20 min
Cheap to re-run → easier to land fixes

Compounding effect on team rhythm: when CI is fast and predictable, smaller PRs become viable. Smaller PRs land faster, are easier to review, and break less. The 20 min → 7 min cycle is what makes the whole loop tighter.

What we actually shipped

SHIPPED CiJobBlock + CiJobBlockJob modelsPR #3884 per-block / per-job execution state
SHIPPED YAML parser: blocks keyPR #3885
SHIPPED Block orchestrationPR #3886 ExecuteService / SpawnBlockService / SyncPodService
SHIPPED Pipeline UI (vertical + horizontal)PR #3889
SHIPPED Cache CLI parallel restore/storePR #3892
SHIPPED Prologue execution fixPR #3896
SHIPPED Truncated names in horizontal viewPR #3901
FLAGSHIP global_job_config + task: + dependenciesPR #3903 Semaphore-aligned YAML, parallel jobs in a block, dependency-gated blocks
FLAGSHIP CI image bake (:v62)PR #3907 Ruby + Node + Postgres + pgvector + Redis 7.0.5 pre-installed via declarative docker-ci/dependencies
SHIPPED Epic merge into mainPR #3910 14 commits, 65 files, +3,089 / −210
OPS Postgres registry images rebuilt with pgvector baked (postgres:{13,14,15,15.1,18,18.3})
OPS Karpenter NodePool → r8gd + RAID0 NVMe

Tracking issue: neeto-ci-web#3799

What we evaluated and skipped (honestly)

OptionOriginal ideaHonest takeVerdict
Phase 2: shared EFS workspace across blocksRun setup once, share to all podsBreaks per-pod isolation; minitest-distributed needs isolated FS; concurrent pods clobber each other in cache restoreSKIP Issue #3895 closed
gp3 IOPS bump (5000 → 16000)Cheaper than r8gdStill EBS-bound; doesn't fix variance under concurrent pods; ~$40/node/mo surchargeDROPPED r8gd wins
Redis from apt repoLighter than source-compilepackages.redis.io ships only current latest; can't pin 7.0.5REJECTED Compiled from source instead
True DAG fan-out UI (arrows for arbitrary deps)Render directed edges between every block pairLinear depth-column layout reads cleanly for current configs; full DAG drawing adds complexity for no UX gainDEFERRED
EFS CSI driver for shared scratchOne file system, all pods see itProvisioned + tested; per-pod isolation broke parallel tests; tore it downREJECTED r8gd local NVMe instead

Going broad first kept the design honest. Knowing when to stop is part of the work.

Artifacts shipped

ECR (728988564940.dkr.ecr.us-east-1.amazonaws.com)
└── neeto-ci-deployment-image:v62
    ├── Ruby     4.0.1      pre-installed via rbenv
    ├── Node     22.13.1    pre-installed via nvm, default alias
    ├── Postgres 18.4 + pgvector 0.8.2  (apt cluster, fresh per pod)
    ├── Redis    7.0.5      compiled from source at /opt/redis/7.0.5/
    └── docker-ci/dependencies (declarative, easy to extend)

Internal registry (10.100.0.20:5000)
└── postgres:{13,14,15,15.1,18,18.3}     (rebuilt with pgvector baked)

Cluster (EKS, neeto-ci)
├── EC2NodeClass arm64
│   ├── instanceStorePolicy: RAID0       (NVMe → kubelet + containerd)
│   └── blockDeviceMappings              200Gi gp3, 8000 IOPS, 500 MB/s
└── NodePool default
    └── instance-family: r8gd            (Graviton4 + local NVMe SSD)

Code
└── PR #3910 → main                       Epic merged (14 commits, 65 files)
PRs (merged): #3884 #3885 #3886 #3889 #3892 #3896 #3901 #3903 #3907 #3910 + bug-fix follow-ups
Tracking issue: neeto-ci-web #3799

Bottom line

66%
faster runs (cal-web)
36%
avg across 8 apps (measured)
~10×
less cache-restore variance
~126 hrs/mo
measured, 8 apps · 1,641 runs/mo
9 PRs
block pipeline shipped (cal + 8 web)
100 products
in scope, fleet-wide

Thanks 🙌

Questions?

Detailed report: ci-test-optimization-results.md
Deck source: github.com/vishal24367/ci-test-optimization-deck
Gist: gist.github.com/vishal24367/674d77e…
Epic: neeto-ci-web #3799