Adding an ANN Vector Index to MIF Scenario Search

Pre-implementation research. No code changes made. Findings dated 2026-05-25.

TL;DR

Pick TreeAH (ScaNN-based) over IVF — better fit for our 12.4M-row, L2-normalized, cosine-distance workload, and refreshes automatically every 5–15 min.
Non-disruptive Index creation is fully async and background. It does not block existing queries. Better still — the current MIF query computes cosine with hand-rolled UNNEST + dot product, which does not benefit from the index at all. Creating the index changes nothing in production until we also change the SQL.
Once built, switch the dense ranking path in mif_admin.py to VECTOR_SEARCH(...). That part does require a code change. Gate it behind a flag like the existing SCENARIO_SEARCH_HYBRID.
Dev caveat: encoded-stage-394013.analytics.fct_calls_scenario_embeddings does not exist. The Dagster asset hardcodes the prod project. There is a 100-row sandbox test_hybrid_scenario_search in dev. Index validation effectively has to happen on prod (or via a manually copied partition in dev).

1. ANN options in BigQuery — what exists and what to pick

	IVF	TreeAH recommended
Algorithm	Inverted-file via k-means clustering	Google ScaNN — tree-quantized with asymmetric hashing (product quantization)
Best for	Smaller datasets, smaller query batches, when "stored column" optimization matters	Large vector tables, large query batches; "orders of magnitude faster and more cost-effective"
Distance types	COSINE / EUCLIDEAN / DOT_PRODUCT	COSINE / EUCLIDEAN / DOT_PRODUCT
Incremental refresh	Async, periodic	Automatic background refresh — typically 5–15 min after writes
Stored-column join elimination	Yes (helps if your SELECT only needs indexed columns)	No — base table join is preserved
Partitioning	Supported	Supported (2026 feature) — enables partition pruning
Recall vs latency knob	`fraction_lists_to_search`	`fraction_leaf_nodes_to_search`

Why TreeAH for us

Scale fits the sweet spot. 12.4M rows / 92 GB / 768-dim is exactly the regime where ScaNN's product-quantized search wins over IVF's flat k-means.
Embeddings are already L2-normalized (see utils.py / asset SQL), so cosine ≡ dot product. TreeAH's asymmetric hashing is designed for this exact case.
Auto-refresh in 5–15 min means our daily Dagster partition writes flow into the index with no extra orchestration. IVF refresh is more periodic / coarser.
"Stored column" optimization doesn't help us. The MIF page query already LEFT JOINs the calls table for preset/system_prompt enrichment, so we'd need a base-table join anyway. The one IVF advantage doesn't apply.
Trade-off acknowledged: ANN gives approximate recall. For scenario exploration (humans skimming top-K) this is fine; for any "count above threshold" / histogram metric we should keep a separate exact code path (see §4 below).

2. How the index is created — operational behavior

Async Non-blocking Coverage observable

CREATE VECTOR INDEX is a DDL that returns immediately; BigQuery builds the index in the background using free background slots.
Existing queries are not blocked or affected. The base table behaves identically while the index builds.
Even after the index is created, VECTOR_SEARCH still returns correct results during the build: it serves indexed rows from the index and brute-forces the not-yet-indexed remainder. No code-level fallback logic needed.
Status is observable in INFORMATION_SCHEMA.VECTOR_INDEXES — watch coverage_percentage and last_refresh_time.
One hard rule: the indexed base table must be ≥ 10 MB or the index stays at coverage 0. Our prod table is 92 GB, so this is moot.
Cost: you pay for index storage. TreeAH with PQ compresses heavily — expect single-digit GB, not 92 GB.

Proposed DDL (do not run yet)

CREATE VECTOR INDEX scenario_embedding_idx
ON `sesame-prod-426417.analytics.fct_calls_scenario_embeddings`(embedding)
OPTIONS (
  index_type    = 'TREE_AH',
  distance_type = 'COSINE'
);

We can also pass tree_ah_options = '{"normalization_type":"NONE"}' — our vectors are already L2-normalized so skipping the built-in normalization is correct. Default leaf size is usually fine; tune only if recall is poor.

How to "index without exposing the search" — staging the rollout safely

This is the answer to your "is there a way to index the data without giving people access?" question. There are two layers:

Index creation is invisible to users by itself. Today's MIF search uses hand-written UNNEST cosine, which the planner cannot route to a vector index. So you can create the index in prod and nobody's query path changes. It just sits there, populated and idle.
Query path adoption is gated separately. Mirror the existing SCENARIO_SEARCH_HYBRID pattern: add a SCENARIO_SEARCH_ANN env var (or a Statsig flag) that flips handle_scenario_search from the brute-force similarities CTE to a VECTOR_SEARCH-based version. Off-by-default → no user-visible change.

Validation playbook (no production impact):

CREATE VECTOR INDEX ... on prod table.
Poll INFORMATION_SCHEMA.VECTOR_INDEXES until coverage_percentage = 100 and index_status = 'ACTIVE'.
Run a side-by-side comparison query in the BQ console: top-K from brute-force vs top-K from VECTOR_SEARCH for a handful of representative queries. Measure recall@K and wall time. No app code touched.
If happy, ship the gated code path and flip the flag in dev / internal / prod in that order.

3. Dev table structure — what's there and what's not

Where	Table	Status	Notes
Prod `sesame-prod-426417`	`analytics.fct_calls_scenario_embeddings`	Exists	12,366,493 rows · 912,458 calls · 75 partitions · 2026-03-12 → 2026-05-24 · 92 GB logical · clustered on `(context_mode, user_key)`
Dev `encoded-stage-394013`	`analytics.fct_calls_scenario_embeddings`	Does NOT exist	The Dagster asset hardcodes the prod project (see `fct_calls_scenario_embeddings.py`). There is no dev-side embeddings table being populated.
Dev `encoded-stage-394013`	`analytics.test_hybrid_scenario_search`	Sandbox	100-row toy table used during hybrid-search bring-up. Schema matches prod minus a couple of nullability annotations.

Prod table schema (the one we'd index)

Column	Type	Notes
`ds`	DATE	Partition key
`call_id`	INT64
`call_uuid`	STRING
`window_idx`	INT64
`user_key`	STRING	Clustering key
`character_name`	STRING
`call_duration_s` / `start_time_ms` / `end_time_ms` / `num_utterances`	FLOAT64 / INT64	Window metadata
`value`	STRING	Window text with prepended search keys. `scenario_value_idx` SEARCH INDEX is built on this column (BM25-ish via `SEARCH()`).
`embedding`	ARRAY<FLOAT64> NOT NULL	768-dim, L2-normalized. This is the column we vector-index.
`embedding_model` / `context_mode` / `batch_job_name`	STRING	Clustering: `context_mode`
`created_at`	TIMESTAMP
`content_deleted`	BOOL	Retention nulls out `value` when true; embedding stays.

Existing indexes on the prod table

Index	Type	Column	Status	Coverage
`scenario_value_idx`	SEARCH (BM25-ish)	`value`	ACTIVE	100% (last refresh 2026-05-25 08:57 UTC)
none	VECTOR	`embedding`	—	The gap this work fills.

Practical consequence for "alter the dev table first": we can't, because the dev table doesn't exist. Two options:

Option A — Index directly in prod. Safe because (a) creation is async and non-blocking, (b) no query uses VECTOR_SEARCH today so the index sits idle until we flip a flag. This is the lowest-friction path.
Option B — Copy one partition into dev first. e.g. CREATE TABLE encoded-stage-394013.analytics.scenario_embeddings_ann_test AS SELECT * FROM prod WHERE ds = '2026-05-24', then CREATE VECTOR INDEX against it. Gives an independent sandbox to validate the DDL options and the VECTOR_SEARCH rewrite end-to-end before touching prod. Adds ~1 GB of dev storage; otherwise free.

Recommendation: do Option B first to validate DDL + the new SQL, then Option A in prod under the flag.

4. Does the existing search logic need to change?

Yes — the dense ranking path in mif_admin.py needs to be rewritten to use VECTOR_SEARCH. Creating the index alone does nothing for it.

Today the similarities CTE does this (paraphrased):

SELECT
  e.*,
  (SELECT SUM(ev*qv) FROM UNNEST(e.embedding) ev WITH OFFSET i
   JOIN UNNEST(q.embedding) qv WITH OFFSET j ON i = j)
  / NULLIF(SQRT(...) * SQRT(...), 0) AS similarity
FROM `fct_calls_scenario_embeddings` e
CROSS JOIN query_embedding q
LEFT JOIN `calls` c ON ...
WHERE e.ds BETWEEN @start_date AND @end_date

The planner cannot route hand-written UNNEST arithmetic to a vector index. To get acceleration we'd need something like:

WITH query_embedding AS (
  SELECT ml_generate_embedding_result AS embedding
  FROM ML.GENERATE_EMBEDDING(...)
)
SELECT
  base.call_id, base.window_idx, base.ds, base.value,
  base.character_name, base.start_time_ms,
  1 - distance AS similarity
FROM VECTOR_SEARCH(
  TABLE `sesame-prod-426417.analytics.fct_calls_scenario_embeddings`,
  'embedding',
  (SELECT embedding FROM query_embedding),
  top_k          => 200,
  distance_type  => 'COSINE',
  options        => '{"fraction_leaf_nodes_to_search": 0.05}'
)

Gotcha to design around: the current query computes more than top-K

The existing handler returns three things from one SQL family:

Page results — top-K ranked windows. Maps cleanly to VECTOR_SEARCH
Distance histogram across all rows in the date range. Cannot use VECTOR_SEARCH — needs every row's distance, not top-K.
"Above threshold" count for the date range. Same issue.

Options for the stats query:

Keep stats brute-force. Stats and page are already split (asyncio.gather). Accelerate only the page query — stats stays the same. Simplest, smallest blast radius. Latency on stats stays where it is today (which is the bottleneck either way, but at least no regression).
Replace stats with index-only approximation. Pull top-N (large N) from VECTOR_SEARCH and compute histogram on that. Faster but the histogram becomes "histogram of top-N by ANN", not "histogram of all in range" — a semantic change.
Drop stats entirely for ANN mode. Worst UX impact, simplest code.

My recommendation: do the first one — keep brute-force stats unchanged, accelerate only the page query. The page query is what drives perceived latency; stats can run in parallel and finish whenever.

Hybrid path interaction

The hybrid (RRF) path has a dense_top CTE that already takes LIMIT _HYBRID_TOP_K_PER_SIDE from similarities. That maps perfectly to VECTOR_SEARCH(top_k => _HYBRID_TOP_K_PER_SIDE). The lex side is unchanged. The BM25 search index (scenario_value_idx) already exists, so once we add the vector index both sides of RRF are accelerated.

5. Proposed sequence (no work to start yet)

Copy one prod partition into encoded-stage-394013.analytics.scenario_embeddings_ann_test.
Run the TreeAH DDL against that sandbox; verify coverage hits 100 and a hand-written VECTOR_SEARCH returns sensible top-K with acceptable recall vs brute-force.
Apply the same DDL to prod. Watch INFORMATION_SCHEMA.VECTOR_INDEXES until ACTIVE. No code change yet — index is dormant.
Add a SCENARIO_SEARCH_ANN flag in mif_admin.py. New code path uses VECTOR_SEARCH for the page (and the dense side of hybrid). Stats query stays brute-force. Default off.
Flip the flag in dev / internal, validate UX, then prod.

Sources: BigQuery — Manage vector indexes · BigQuery — Search embeddings with vector search · INFORMATION_SCHEMA.VECTOR_INDEXES · Google Cloud blog — TreeAH / ScaNN in BigQuery · Intro to vector search