Skip to content

Database Design

RepoRelay uses a content-addressable storage model. Files are keyed by SHA-256 hash, so identical content across branches and tags shares parsed symbols, chunks, and embeddings automatically. Every ref gets a complete ref_files set via git ls-tree — no incremental diffs, no ordering dependencies.

Schema

repos (1) ──< repo_refs (1) ──< ref_files >── (1) file_contents

                                                       ├──< symbols
                                                       ├──< chunks  (with embedding vectors)
                                                       └──< imports

Tables (7)

TableDescription
reposRegistered repositories (name, path/URL, glob patterns)
repo_refsIndexed refs (branches/tags) per repo
file_contentsContent-addressable file storage (keyed by SHA-256)
ref_filesJunction: maps ref+path → file_contents
symbolsExtracted symbols per file_contents
chunksChunked content with 768-dim embedding vectors
importsParsed import statements per file_contents

Search combines two approaches, fused in a single SQL query:

  • BM25 full-text via ParadeDB's pg_search extension with a source_code tokenizer (handles camelCase and snake_case)
  • Vector similarity via pgvector with HNSW index and cosine distance
  • Results are fused via Reciprocal Rank Fusion (RRF) — scale-invariant, no weight tuning needed

Migrations

The worker runs migrations automatically on startup. Schema is defined in src/storage/schema/schema.ts using Drizzle ORM. After schema changes:

bash
pnpm db:generate   # Create migration SQL
pnpm db:migrate    # Apply migrations

Released under the MIT License.