Skip to main content

Codebases Code Intelligence

Code Intelligence

Auto-triage, gold artifacts, and intent-driven curation

Reviewing 50k+ chunks by hand is impossible. The Codebases code-intelligence pipeline turns every parsed snapshot into ranked, explainable, and routable knowledge: deterministic auto-triage, Symbol Cards, Module Briefs, a Repo Map, and an intent-driven curation surface that targets exactly what an agent or a human needs.

Pipeline

Mechanical signals, no LLM hot path

File roles, PageRank centrality, complexity, safety tags, exportness, and clusters all combine into one explainable significance score per chunk.

Auto-triage

Pre-route 90% of the queue

Every chunk gets a deterministic dismiss / memory / review hint with a human-readable reason, so reviewers focus only on the gray zone.

Gold Artifacts

Symbol Cards, Module Briefs, Repo Map

After triage we materialize three structured surfaces tuned for AI agents and developers, not opaque embeddings.

Curate by Intent

Goal-shaped retrieval over the snapshot

Type an intent, get a ranked cluster + symbol bundle that explains why each result matched. Built on code-aware embeddings.

The Core Promise

If you only remember one thing, remember this:

  • the pipeline scores every chunk with explainable signals
  • auto-triage pre-routes the obvious 90% so humans only see the gray zone
  • gold artifacts turn the snapshot into structured AI-grade knowledge
  • intent-driven curation lets coders ask the snapshot for exactly what they need

Why It Matters

Pain in raw chunk reviewWhat the code-intel pipeline does
50k+ chunks make manual triage impracticalDeterministic auto-routing pre-classifies the obvious dismissals and obvious memory-grade chunks
parse_confidence only means "did the parser succeed"significance_score means "is this chunk valuable for memory"
Reviewers cannot tell why a chunk mattersEvery chunk exposes a significance_components breakdown and an auto_route_reason string
Embeddings alone do not explain the repoRepo Map, Symbol Cards, Module Briefs surface structural understanding
Search is one-shot semantic similarityIntent-driven curation re-ranks against the user's goal and includes path/text overlap explanations

Pipeline At A Glance

The pipeline runs once per parsed snapshot, alongside the existing ASD parse. It is pure-Python, deterministic, and degrades gracefully when optional dependencies are missing.

Significance Score

Every chunk is scored in [0..1] from ten weighted components. The score is then compared against the codebase's triage_settings thresholds to produce a deterministic route hint.

Component Weights

ComponentWeightSignal sourceWhat it captures
pagerank_centrality0.20Repo-map PageRank over the symbol graphHow structurally central the chunk's symbol is in the project
role_weight0.18file_role.role_weight()Importance of the file the chunk lives in (entrypoint, api_route, etc.)
symbol_kind_weight0.12Chunk kind + linesFunctions and classes outrank import blocks and trivial code
exportness0.10Heuristic on text + symbol name + languageWhether the symbol is part of the public surface (export, public name)
safety_priority0.10safety.scan_text_for_safety_tags()Auth, crypto, SQL, eval, deserialization, subprocess, network
fanin0.08Repo-map fan-in countHow many other places reference this symbol
complexity_density0.06lizard cyclomatic complexity / NLOCDensity of branching logic
cluster_representativeness0.06Embedding cluster centroid pickWhether the chunk represents a near-duplicate group
change_boost0.05Diff against previous snapshotNew / modified files get a small lift
parse_confidence0.05Existing AST confidenceTie-breaker only; no longer the headline number

Total weight = 1.00. The score is min(1.0, weighted_sum) so the only way to land near 1.0 is to dominate the high-weight components.

Component Diagram

File Roles

classify_file_role() is a pure path/name heuristic so it is fast, deterministic, and language-agnostic.

FileRoleExamplesrole_weightDefault route bias
entrypointmain.py, index.ts, app/main.go, cli.py0.95memory
api_routeroutes/, controllers/, api/, *_handler.py0.90memory
public_liblib/, sdk/, pkg/, public packages0.80memory
schema_modelmodels/, schemas/, *.proto, *.graphql0.75memory
shared_utilutils/, helpers/, common/0.65review
config*.yaml, *.toml, *.json config0.55review
migrationmigrations/, alembic/0.50review
testtests/, *_test.py, *.spec.ts0.30review
fixturefixtures/, __snapshots__/0.20review
docs*.md, docs/0.20review
boilerplate__init__.py (empty), index.ts re-exports0.05dismiss
generated_pb2.py, *.generated.*, dist/, build/0.00dismiss
vendorednode_modules/, vendor/, third_party/0.00dismiss
unknownEverything else0.40review

Auto-Triage Route Decision

Route Reason Reference

auto_route_reasonMeaningDefault route_target
dismiss_roleFile is generated, vendored, fixture, or boilerplatedismissed
trivial_chunkPure imports / tiny non-symbol fragmentdismissed
high_significanceComposite score crossed the high thresholdmemory
high_value_roleEntrypoint / API route / public lib symbolmemory
safety_sensitiveAuth, crypto, SQL, deserialization etc. detectedreview (not auto-memory by design)
structural_centralitySymbol is highly connected but not high-scoring overallreview
gray_zoneMid-significance, no other strong signalreview

Safety-tagged chunks never auto-promote to memory. They always land in review so a human can confirm intent.

Triage Settings

triage_settings is a JSONB column on codebases. It can be edited from the Code Map tab via Tune triage settings. Defaults are tuned for production review loops.

SettingDefaultRangeEffect
score_threshold_high0.62[0..1]Lower → more chunks land in memory; raise to keep gold layer tighter
centrality_threshold0.35[0..1]Lower → more pure-graph chunks pulled into review
safety_threshold0.25[0..1]Lower → more aggressive safety tagging surfaces in review
enable_safety_scantrueboolMaster switch for built-in safety regex scanner
enable_semgrepfalseboolOpt-in Semgrep subprocess (heavier)
semgrep_rulepacknullstringSemgrep ruleset slug or path
embedding_providerjina_localjina_local / voyage_code_3Selects code-aware embedding backend for curation
scip_index_pathnullstringPath to a SCIP index.scip for precise xrefs

Gold Artifacts

After scoring runs, the pipeline materializes three artifact families into codebase_intel_artifacts. These are the AI-grade outputs.

Symbol Card

A structured payload per high-significance symbol.

FieldDescription
symbol_idFQ symbol name (pkg.module.Class.method)
path, start_line, end_lineLocation in the snapshot
kindfunction / method / class / interface / type
signatureExtracted signature line
purposeOne-line summary derived from docstring (docstring_parser) when available
pagerankNormalized centrality
fanin, fanoutCaller and callee counts
top_callers, top_calleesUp to 5 connected symbol IDs
complexitylizard cyclomatic complexity if available
safety_tagsList of safety tags from the body
chunk_keysLinked chunk IDs for drill-down

Module Brief

A summary of a folder/package.

FieldDescription
module_pathFolder path (e.g. atulya_api/engine/code_intel)
summary_roleDominant FileRole in the module
public_surfaceTop exported symbols ranked by PageRank
dependencies_inModules that import from this one
dependencies_outModules this one imports from
top_symbolsTop-N symbols by composite significance
chunk_count, locVolume signals

Repo Map

A ranked symbol table inspired by Aider's RepoMap algorithm.

FieldDescription
top_symbols[]Symbols sorted by pagerank desc
total_symbols, total_filesSnapshot footprint
algorithm"pagerank-aider" or "power-iteration-fallback"
generated_atUTC timestamp

Intent-Driven Curation

The /curate endpoint takes a free-form intent and returns ranked clusters + Symbol Cards. The ranker combines:

  1. Embedding similarity between the intent and chunk embeddings (code-aware model)
  2. Token overlap against chunk text and symbol names
  3. Path match against the optional scope_hint
  4. Significance prior so high-value symbols dominate ties

Result Shape

FieldDescription
total_candidatesHow many chunks the ranker evaluated
clusters[]Top clusters with cluster_id, representative_chunk, members, score, explain
symbol_cards[]Top Symbol Cards with score and explain
explainPer-result string like "text overlap on 'auth, login'; path match on 'api/'"

Database Surface

A single Alembic migration adds the new columns and tables.

Table / ColumnTypePurpose
codebase_chunks.significance_scoredouble precisionFinal composite score ([0..1])
codebase_chunks.significance_componentsjsonbPer-component breakdown for the Why? tooltip
codebase_chunks.file_roletextOne of the FileRole enum values
codebase_chunks.auto_route_reasontextHuman-readable reason for the route
codebase_chunks.complexity_scoredouble precisionlizard cyclomatic complexity
codebase_chunks.safety_tagstext[]Safety tags from regex + Semgrep
codebase_chunks.pagerank_centralitydouble precisionPer-symbol PageRank
codebase_chunks.fanin_countintegerInbound reference count
codebases.triage_settingsjsonbPer-codebase tuning
codebase_intel_artifactsnew tableStores Symbol Cards, Module Briefs, Repo Map
codebase_auto_triage_overridesnew tableAudit log of operator overrides on auto-routes
codebase_saved_intentsnew tableReusable intents per codebase

API Surface

All endpoints are namespaced under the existing codebase routes.

MethodPathPurpose
GET/banks/{bank_id}/codebases/{codebase_id}/chunksNow accepts min_significance, max_significance, file_role, auto_route_reason, has_safety_tag, route_source, order_by
GET/banks/{bank_id}/codebases/{codebase_id}/artifacts/repo-mapTop symbols by PageRank
GET/banks/{bank_id}/codebases/{codebase_id}/artifacts/modulesModule Briefs
GET/banks/{bank_id}/codebases/{codebase_id}/artifacts/symbolsList Symbol Cards
GET/banks/{bank_id}/codebases/{codebase_id}/artifacts/symbols/{symbol_id:path}Single Symbol Card
POST/banks/{bank_id}/codebases/{codebase_id}/curateIntent-driven curation
GET/banks/{bank_id}/codebases/{codebase_id}/triage-settingsRead current settings
PUT/banks/{bank_id}/codebases/{codebase_id}/triage-settingsUpdate settings (applies on next index pass)

chunks query params (new)

ParameterTypeNotes
min_significancefloatFilter chunks below this score
max_significancefloatCap, useful for surfacing the gray zone
file_rolestringOne of the FileRole values
auto_route_reasonstringFilter by route reason
has_safety_tagstringe.g. auth, crypto, sql_string
route_sourcestringauto, user, or none
order_bystringsignificance (default), pagerank, complexity, path, review

Control Plane Surface

The Codebases page now exposes two new tabs and a richer review queue.

SurfaceWhat it showsWhen to use
Review Queue (existing, enhanced)Significance score, file role, route source chip, safety tag chips, Why? tooltip with components, route reason filter chips, second filter row (role / route source / safety / order / min significance)Daily triage, focusing only on the gray zone
Code Map (new)Repo Map top symbols, Module Briefs, Tune triage settings dialogOnboarding to a new repo, planning refactors
Curate by Intent (new)Intent input, scope hint, ranked clusters and Symbol Cards with per-result explanationsFeature work, bug investigation, "where do I start" questions

Optional Dependencies

The pipeline ships with sensible defaults but supports lazy-loaded upgrades.

CapabilityDefaultUpgradeWhen to enable
Code embeddingsjina-embeddings-v2-base-code (local)voyage-code-3 (paid API)When higher recall on intent curation justifies the spend
Safety scanningBuilt-in regex pattern setsemgrep subprocessWhen you want CI-grade rules on memory-bound chunks
Cross-referencestree-sitter heuristic edgesscip-protobuf reading index.scipWhen you have an SCIP indexer in CI for the repo's languages
Complexitylizard (default, multi-language)n/aAlways on
Graph centralitynetworkx PageRankPure-Python power-iteration fallbackAuto-fallback if networkx unavailable
  1. Index the repo (ZIP or GitHub).
  2. Open the Code Map tab to learn the shape of the codebase before triaging.
  3. (Optional) Open Tune triage settings to bias more or less aggressively for your team.
  4. Switch to Review Queue and filter by auto_route_reason = gray_zone first.
  5. Spot-check a few safety_sensitive and structural_centrality items.
  6. Use Curate by Intent before starting any feature ("auth refresh flow", "where is rate limiting handled") to pull a focused bundle.
  7. Approve via Retain Pipeline for high-value Symbol Cards and ASD Direct for bulk memory hydration.

Why The Pipeline Stays Fast

ChoiceEffect
All scoring is pure-Python with no LLM in the hot pathIndexing cost stays bounded by the number of chunks, not tokens
Heavy optional deps are lazy-loadedA repo without Semgrep / SCIP / Voyage still gets the full default experience
Embeddings cached by content hashRe-indexing a snapshot reuses prior work for unchanged chunks
lizard results cached per file content hashRepeated parses of the same content are free
Repo-map PageRank uses networkx with a pure-Python fallbackWorks in environments without networkx installed

What This Is Not

  • It is not an LLM-powered summarizer. Symbol Card purpose lines come from extracted docstrings only; no model generates them at index time.
  • It is not a replacement for recall or reflect. Curate-by-intent is scoped to the current snapshot; recall and reflect still operate over the approved memory bank.
  • It is not a substitute for human review on safety-tagged chunks. Auto-routing intentionally defers those to the review queue.

That's the contract: deterministic mechanics first, structured artifacts second, human approval last.