Core concepts

clearvoiance is four things that combine into one workflow. Most of the confusion about the project evaporates once the four are distinct in your head.

1. Capture

The SDK wraps your framework's request lifecycle and streams events to the engine. Every inbound HTTP request, socket message, cron tick, queue job, outbound HTTP call, and DB query becomes an event with a unique id, a timestamp, and a payload.

browser → express → captureHttp(middleware) → gRPC → engine → ClickHouse

browser → express → captureHttp(middleware) → gRPC → engine → ClickHouse

Key properties:

AsyncLocalStorage context. Every inbound event opens a scope keyed on its id. Child operations (outbound fetches, DB queries, emitted socket messages) read currentEventId() so they can be correlated back to the originating request.
Write-ahead log. If the engine's unreachable, the client persists events to wal.dir and drains them automatically when the connection comes back. No lost events on deploys / restarts.
Backpressure-aware. Large response bodies get hashed + offloaded to blob storage (MinIO / S3) instead of traveling the gRPC stream.
Full-fidelity by default. The SDK ships with an empty header denylist — Authorization, Cookie, Set-Cookie all flow through as captured so replay works without auth-strategy acrobatics. For production captures, opt into RECOMMENDED_HEADER_DENY_PRODUCTION per adapter to redact secrets.
Capture modes. Two ways to run the SDK:
- Auto-session: client.start() opens a session immediately; runs until process exit. Good for dev + CI.
- Remote control (Monitors): SDK subscribes to the engine's ControlService and waits idle. The dashboard's Monitors page drives Start / Stop. Zero overhead while idle. The right default for production services. See Monitors.

2. Replay

The replay engine reads captured events from ClickHouse and fires them at a target URL at configurable speedup. A 1-hour capture can replay in 5 minutes at 12×.

ClickHouse → scheduler → dispatcher matrix → SUT (staging)
                        ├── HTTPDispatcher
                        ├── SocketIODispatcher
                        ├── CronDispatcher  → invoke-server on SUT
                        └── QueueDispatcher → invoke-server on SUT

ClickHouse → scheduler → dispatcher matrix → SUT (staging)
                        ├── HTTPDispatcher
                        ├── SocketIODispatcher
                        ├── CronDispatcher  → invoke-server on SUT
                        └── QueueDispatcher → invoke-server on SUT

Key properties:

Per-protocol dispatchers. HTTP fires real requests at the target; cron + queue go through the SUT's hermetic invoke server (see below) so handlers run locally with the captured args.
Virtual users. Fan out each captured event to N concurrent copies to push load past the captured baseline.
Time-window selection. Replay only the 10 minutes around a known incident without re-running the full capture.
Target-duration mode. Say "replay this 1-hour capture in 5 minutes"; the engine derives the speedup automatically.
Auth strategies. jwt_resign, static_swap, custom callback — replay traffic with valid-at-target-time tokens.
Body mutators. Starlark scripts that rewrite JSON bodies between capture and dispatch (unique emails per VU, customer-id swap, etc.) so the SUT doesn't key-conflict on itself.

3. Hermetic mode

The "safe replay" piece. Without it, replaying production traffic against staging would send real emails, charge real Stripe cards, and hammer real third-party APIs. With it, none of that happens.

captured outbounds → mock pack → in-memory LRU on SUT
                                       ↑
SUT.fetch() → intercept → lookup (event_id, signature) ──┘
                                                          (found: serve)
                                                          (not found, strict: throw)

captured outbounds → mock pack → in-memory LRU on SUT
                                       ↑
SUT.fetch() → intercept → lookup (event_id, signature) ──┘
                                                          (found: serve)
                                                          (not found, strict: throw)

Key properties:

Record + serve. During capture, outbound HTTP + fetch calls become OutboundEvent records. During replay (CLEARVOIANCE_HERMETIC=true), the SDK swaps the capturing patches for mock-serving ones: every outbound is looked up by (event_id, signature) and served from the captured response.
Strict vs. loose policy. Strict throws on unmocked outbounds (catches real drift where the SUT now calls something it didn't call during capture). Loose returns 200 {} (quieter during development).
Cron killer. Replaces node-cron.schedule with a registry that never auto-fires. The replay engine's cron dispatcher triggers registered handlers via a POST to an in-process invoke-server.
Invoke server. Listens on 127.0.0.1:7777 (or a Sheet-style middleware mounted on your existing HTTP server). Optional Bearer token for cross-container setups.

4. DB observer

The killer feature. During replay, a sidecar binary polls the SUT's Postgres pg_stat_activity and correlates every slow query back to the replay event that caused it.

SUT.pg_query  — application_name = 'clv:<event_id>'
                       ↑ set by instrumentPg on pool.connect
                       or by instrumentPrisma in $extends.query
observer poll  → pg_stat_activity WHERE application_name LIKE 'clv:%'
                       → DbObservation{event_id, duration_ns, query_text, ...}
                       → ClickHouse.db_observations

SUT.pg_query  — application_name = 'clv:<event_id>'
                       ↑ set by instrumentPg on pool.connect
                       or by instrumentPrisma in $extends.query
observer poll  → pg_stat_activity WHERE application_name LIKE 'clv:%'
                       → DbObservation{event_id, duration_ns, query_text, ...}
                       → ClickHouse.db_observations

Key properties:

application_name as the correlation key. Connection-level SQL that every Postgres connection carries. instrumentPg (node-postgres), instrumentKnex (tarn.js-managed pool), and instrumentPrisma all emit the same clv:<replayId?>:<eventId> shape. See Database correlation for the full adapter story, including Mongoose (which takes an SDK-side path since MongoDB has no equivalent introspection view).
Threshold-based emission. Queries slower than --slow-threshold-ms (default 100ms) become SlowQuery observations; queries in wait_event_type = 'Lock' become LockWait observations.
Debounced. A 5-second slow query doesn't emit 50 observations (once per 100ms poll); it emits one record when the query completes.
Rollups. The UI shows top slow queries by p95, DB time by endpoint, and lock-wait summaries per replay — one click away from the replay detail page.

The workflow end-to-end

Putting it together:

Register your prod service as a monitor with remote: { clientName: "my-api-prod" } on createClient. The SDK subscribes idle — zero capture overhead at rest.
In the dashboard's Monitors page, click Start capture when you want to record a window. Events land in ClickHouse, blobs in MinIO.
Click Stop capture. You now have a replayable snapshot.
Stand up staging with the SDK in hermetic mode pointing at the same engine: CLEARVOIANCE_HERMETIC=true CLEARVOIANCE_SOURCE_SESSION_ID=sess_abc.
Kick off a replay at 12×. The engine fires captured HTTP against staging, captured crons/queues against the invoke server, captured outbounds get served from the mock pack.
The observer watches staging's Postgres and writes per-event DB observations.
Six minutes later, the dashboard shows every slow query, every lock, every deadlock — correlated to the exact captured event that caused it.

That's what makes clearvoiance useful for finding the problems synthetic load tests miss.