Adding an ANN Vector Index to MIF Scenario Search

Pre-implementation research. No code changes made. Findings dated 2026-05-25.

TL;DR

1. ANN options in BigQuery — what exists and what to pick

IVFTreeAH recommended
AlgorithmInverted-file via k-means clusteringGoogle ScaNN — tree-quantized with asymmetric hashing (product quantization)
Best forSmaller datasets, smaller query batches, when "stored column" optimization mattersLarge vector tables, large query batches; "orders of magnitude faster and more cost-effective"
Distance typesCOSINE / EUCLIDEAN / DOT_PRODUCTCOSINE / EUCLIDEAN / DOT_PRODUCT
Incremental refreshAsync, periodicAutomatic background refresh — typically 5–15 min after writes
Stored-column join eliminationYes (helps if your SELECT only needs indexed columns)No — base table join is preserved
PartitioningSupportedSupported (2026 feature) — enables partition pruning
Recall vs latency knobfraction_lists_to_searchfraction_leaf_nodes_to_search

Why TreeAH for us

2. How the index is created — operational behavior

Async Non-blocking Coverage observable

Proposed DDL (do not run yet)

CREATE VECTOR INDEX scenario_embedding_idx
ON `sesame-prod-426417.analytics.fct_calls_scenario_embeddings`(embedding)
OPTIONS (
  index_type    = 'TREE_AH',
  distance_type = 'COSINE'
);

We can also pass tree_ah_options = '{"normalization_type":"NONE"}' — our vectors are already L2-normalized so skipping the built-in normalization is correct. Default leaf size is usually fine; tune only if recall is poor.

How to "index without exposing the search" — staging the rollout safely

This is the answer to your "is there a way to index the data without giving people access?" question. There are two layers:

  1. Index creation is invisible to users by itself. Today's MIF search uses hand-written UNNEST cosine, which the planner cannot route to a vector index. So you can create the index in prod and nobody's query path changes. It just sits there, populated and idle.
  2. Query path adoption is gated separately. Mirror the existing SCENARIO_SEARCH_HYBRID pattern: add a SCENARIO_SEARCH_ANN env var (or a Statsig flag) that flips handle_scenario_search from the brute-force similarities CTE to a VECTOR_SEARCH-based version. Off-by-default → no user-visible change.
Validation playbook (no production impact):
  1. CREATE VECTOR INDEX ... on prod table.
  2. Poll INFORMATION_SCHEMA.VECTOR_INDEXES until coverage_percentage = 100 and index_status = 'ACTIVE'.
  3. Run a side-by-side comparison query in the BQ console: top-K from brute-force vs top-K from VECTOR_SEARCH for a handful of representative queries. Measure recall@K and wall time. No app code touched.
  4. If happy, ship the gated code path and flip the flag in dev / internal / prod in that order.

3. Dev table structure — what's there and what's not

WhereTableStatusNotes
Prod
sesame-prod-426417
analytics.fct_calls_scenario_embeddings Exists 12,366,493 rows · 912,458 calls · 75 partitions · 2026-03-12 → 2026-05-24 · 92 GB logical · clustered on (context_mode, user_key)
Dev
encoded-stage-394013
analytics.fct_calls_scenario_embeddings Does NOT exist The Dagster asset hardcodes the prod project (see fct_calls_scenario_embeddings.py). There is no dev-side embeddings table being populated.
Dev
encoded-stage-394013
analytics.test_hybrid_scenario_search Sandbox 100-row toy table used during hybrid-search bring-up. Schema matches prod minus a couple of nullability annotations.

Prod table schema (the one we'd index)

ColumnTypeNotes
dsDATEPartition key
call_idINT64
call_uuidSTRING
window_idxINT64
user_keySTRINGClustering key
character_nameSTRING
call_duration_s / start_time_ms / end_time_ms / num_utterancesFLOAT64 / INT64Window metadata
valueSTRINGWindow text with prepended search keys. scenario_value_idx SEARCH INDEX is built on this column (BM25-ish via SEARCH()).
embeddingARRAY<FLOAT64> NOT NULL768-dim, L2-normalized. This is the column we vector-index.
embedding_model / context_mode / batch_job_nameSTRINGClustering: context_mode
created_atTIMESTAMP
content_deletedBOOLRetention nulls out value when true; embedding stays.

Existing indexes on the prod table

IndexTypeColumnStatusCoverage
scenario_value_idxSEARCH (BM25-ish)valueACTIVE100% (last refresh 2026-05-25 08:57 UTC)
noneVECTORembeddingThe gap this work fills.
Practical consequence for "alter the dev table first": we can't, because the dev table doesn't exist. Two options: Recommendation: do Option B first to validate DDL + the new SQL, then Option A in prod under the flag.

4. Does the existing search logic need to change?

Yes — the dense ranking path in mif_admin.py needs to be rewritten to use VECTOR_SEARCH. Creating the index alone does nothing for it.

Today the similarities CTE does this (paraphrased):

SELECT
  e.*,
  (SELECT SUM(ev*qv) FROM UNNEST(e.embedding) ev WITH OFFSET i
   JOIN UNNEST(q.embedding) qv WITH OFFSET j ON i = j)
  / NULLIF(SQRT(...) * SQRT(...), 0) AS similarity
FROM `fct_calls_scenario_embeddings` e
CROSS JOIN query_embedding q
LEFT JOIN `calls` c ON ...
WHERE e.ds BETWEEN @start_date AND @end_date

The planner cannot route hand-written UNNEST arithmetic to a vector index. To get acceleration we'd need something like:

WITH query_embedding AS (
  SELECT ml_generate_embedding_result AS embedding
  FROM ML.GENERATE_EMBEDDING(...)
)
SELECT
  base.call_id, base.window_idx, base.ds, base.value,
  base.character_name, base.start_time_ms,
  1 - distance AS similarity
FROM VECTOR_SEARCH(
  TABLE `sesame-prod-426417.analytics.fct_calls_scenario_embeddings`,
  'embedding',
  (SELECT embedding FROM query_embedding),
  top_k          => 200,
  distance_type  => 'COSINE',
  options        => '{"fraction_leaf_nodes_to_search": 0.05}'
)

Gotcha to design around: the current query computes more than top-K

The existing handler returns three things from one SQL family:

  1. Page results — top-K ranked windows. Maps cleanly to VECTOR_SEARCH
  2. Distance histogram across all rows in the date range. Cannot use VECTOR_SEARCH — needs every row's distance, not top-K.
  3. "Above threshold" count for the date range. Same issue.

Options for the stats query:

My recommendation: do the first one — keep brute-force stats unchanged, accelerate only the page query. The page query is what drives perceived latency; stats can run in parallel and finish whenever.

Hybrid path interaction

The hybrid (RRF) path has a dense_top CTE that already takes LIMIT _HYBRID_TOP_K_PER_SIDE from similarities. That maps perfectly to VECTOR_SEARCH(top_k => _HYBRID_TOP_K_PER_SIDE). The lex side is unchanged. The BM25 search index (scenario_value_idx) already exists, so once we add the vector index both sides of RRF are accelerated.

5. Proposed sequence (no work to start yet)

  1. Copy one prod partition into encoded-stage-394013.analytics.scenario_embeddings_ann_test.
  2. Run the TreeAH DDL against that sandbox; verify coverage hits 100 and a hand-written VECTOR_SEARCH returns sensible top-K with acceptable recall vs brute-force.
  3. Apply the same DDL to prod. Watch INFORMATION_SCHEMA.VECTOR_INDEXES until ACTIVE. No code change yet — index is dormant.
  4. Add a SCENARIO_SEARCH_ANN flag in mif_admin.py. New code path uses VECTOR_SEARCH for the page (and the dense side of hybrid). Stats query stays brute-force. Default off.
  5. Flip the flag in dev / internal, validate UX, then prod.

Sources: BigQuery — Manage vector indexes · BigQuery — Search embeddings with vector search · INFORMATION_SCHEMA.VECTOR_INDEXES · Google Cloud blog — TreeAH / ScaNN in BigQuery · Intro to vector search