Bring Back the Math: Why Future LLMs Need Mechanical Algorithms

April 29, 2026 · 6 min read

LLM impressive. No argument.

LLM also hallucinate. Confidently. Fluently. Wrong answer look identical to right answer. No internal alarm. No "I am not sure." Just smooth wrong answer.

Why? LLM is distribution over tokens. Not truth machine. When distribution confident but wrong — output wrong, tone confident.

Math does not do this. BM25(q, d) returns score. Deterministic. Run again — same score. No drift. No confidence without basis.

Future of reliable AI: put math and LLMs in correct relationship. Not replace one with other. Each doing what it is designed for.

Where LLMs Fail and Math Doesn't

Industry mistake: saw LLMs succeed at generation → assumed they succeed at everything. Replaced BM25 with semantic search. Replaced parsers with "just ask the LLM." Replaced deterministic pipelines with "agent will figure it out."

Result: systems that feel smarter but are less reliable. Fail in unpredictable ways.

BM25: The Algorithm Nobody Wants to Talk About

Invented 1994. Still best-in-class for keyword retrieval.

BM25(q, d) = Σᵢ IDF(qᵢ) · (f(qᵢ,d) · (k₁+1)) / (f(qᵢ,d) + k₁·(1 − b + b·|d|/avgdl))

Parameter	Meaning	Typical value
`f(qᵢ, d)`	Term frequency in document	—
`IDF(qᵢ)`	Rarity of term across corpus	`log((N-n+0.5)/(n+0.5))`
`k₁`	Term frequency saturation	1.2
`b`	Document length normalization	0.75

Why still relevant:

Query type	Semantic search	BM25
"atulya.retain method"	Returns docs about storing, saving, persistence	Finds "atulya.retain" exactly — milliseconds
"error code ERR_AUTH_403"	Returns docs about auth failures — maybe	Finds "ERR_AUTH_403" exactly — always
"Anurag's PR from March 14"	Semantic match — unreliable	Keyword + date filter — precise

BM25 and semantic search not competitors. Complementary. Different failure modes. Run both. Fuse results.

Reciprocal Rank Fusion: Simple Math, Surprising Power

Run 4 retrieval methods in parallel. Each return ranked list. How to combine?

Problem with averaging scores: incompatible scales. Semantic similarity: 0.0–1.0. BM25: 0–50. Graph relevance: 0–∞. Cannot meaningfully average.

RRF solution:

RRF(d) = Σᵢ 1 / (k + rank(d, Lᵢ))

Scenario	Single method	RRF
Query = exact error code	BM25 wins, semantic fails	BM25 rank 1 → high RRF
Query = "how does Anurag usually approach this?"	Semantic wins, BM25 fails	Semantic rank 1 → high RRF
Query = "what changed in auth last week?"	Temporal wins	Temporal + graph both contribute
Query = complex technical concept	All contribute	Stable fusion — no single method dominates

RRF adds ~2ms. Meaningfully better than any single strategy. No learned weights. No parameters to tune. Pure math.

Cross-Encoder: When Neural Networks Earn Their Place

Two-stage retrieval: recall first, precision second.

Stage	Method	Scope	Speed	Quality
Stage 1	BM25 + semantic + graph + temporal via RRF	Full corpus	Fast	High recall
Stage 2	Cross-encoder reranking	Top-50 candidates only	Slower	High precision

Bi-encoder (stage 1): encode query and document separately → dot product. Fast. But misses subtle interactions.

Cross-encoder (stage 2): encode query-document pair jointly → full attention across both. Sees negation, conditional relevance, exact overlap. Applied to only 50 candidates — tractable.

Final score with temporal boost:

score_final = σ(cross_encoder) · e^(−λ·age) · temporal_match_boost

Math (BM25 + RRF) did heavy lifting on recall. Neural did precision work on small candidate set. Each doing job it is designed for.

HMM + Kalman: Tracking State Over Time

Memory not static. Entity state changes. Tracking current state — not just current facts — requires temporal modeling.

Hidden Markov Model for state sequences:

P(s₁…sₙ | o₁…oₙ) ∝ P(o₁|s₁) · Π P(oₜ|sₜ) · P(sₜ|sₜ₋₁)

Viterbi algorithm finds most likely state path. Deterministic given parameters. No hallucination possible.

Kalman filter for continuous signals:

x̂ₜ = x̂ₜ|ₜ₋₁ + Kₜ(zₜ − H·x̂ₜ|ₜ₋₁)
Kₜ = Pₜ|ₜ₋₁·Hᵀ · (H·Pₜ|ₜ₋₁·Hᵀ + R)⁻¹

K_t = Kalman gain. Balances trust in prediction vs new observation.

Signal	LLM approach	Math approach
"What is Anurag's current state?"	Generates plausible continuation	HMM posterior over state sequence — actual evidence
"Is this influence score trending up?"	Guesses from context	Kalman-smoothed estimate — bounded by math
"Anomaly in access pattern?"	Cannot reliably detect	Robust z-score — deterministic

The Right Architecture

LLM not doing retrieval. LLM reasoning about what the mechanical pipeline surfaced. This is the correct role.

Why "Just Use a Bigger LLM" Fails

Concern	Bigger model	Hybrid pipeline
Hallucination	Less often — but more convincingly wrong	Math layer catches before output
Cost at scale	Increases with usage	Math cost constant regardless of scale
Auditability	Black box	Every mechanical step inspectable
Consistency	Non-deterministic — same query, different answers	Mechanical stages deterministic
Runs locally	Requires cloud for large models	BM25 + RRF runs on CPU, no GPU needed

Scale math. Do not scale away from math.

Math as Integrity Layer for LLM Outputs

One more role: check LLM outputs mechanically.

Check	Mechanism
Answer references entity not in retrieved context	Flag — possible hallucination
Answer contradicts observation in memory bank	Detect contradiction score — surface for review
Claimed fact has no evidence trace in memory	Mark as generated, not recalled
Answer inconsistent with prior answer to same query	Hash comparison — flag inconsistency

LLM cannot reliably self-check. Math can check LLM. Deterministic rules applied to LLM output create integrity layer that meaningfully reduces hallucination reaching user.

This is BRAIN integrity vision. Not replacing LLM. Supervising it with mechanical precision.

The Principle

Creative where creativity required. Exact where exactness required. Honest about which is which.

LLMs brilliant at generation. Cannot tell you when they are wrong.
Math exact at what math does. Cannot reason about ambiguous open-ended problems.

Together: reliable, auditable, scalable AI systems. That is where engineering needs to go. Not bigger models. Smarter pipelines.

Where LLMs Fail and Math Doesn't​

BM25: The Algorithm Nobody Wants to Talk About​

Reciprocal Rank Fusion: Simple Math, Surprising Power​

Cross-Encoder: When Neural Networks Earn Their Place​

HMM + Kalman: Tracking State Over Time​

The Right Architecture​

Why "Just Use a Bigger LLM" Fails​

Math as Integrity Layer for LLM Outputs​

The Principle​