Analytics AI Has an Access Problem. The Real Problem Is Judgment.

There is a version of the current enthusiasm about AI and analytics that I think misses the point entirely, and it goes like this: if we just give the model more context — more tables, more metadata, more lineage documentation — it will answer business questions reliably.

It won’t.

What I’ve seen, after spending a year building analytics systems that use language models to answer questions from data warehouses, is that access is the easy part. The hard part is what happens before and after the query: knowing whether you have enough context to write one at all, and knowing whether the answer that comes back is actually right.

The best analytics engineers I’ve worked with don’t distinguish themselves by writing SQL faster. They distinguish themselves by knowing when they don’t have enough information to write SQL at all.

They ask the clarifying question before touching a keyboard. They check whether a 30% revenue drop is a real business event or a backfill delay before escalating to leadership. They hold competing hypotheses until the evidence actually converges.

None of this is addressed by giving a model a bigger pile of table schemas.

What’s Actually Missing

Three things, I think, are missing from nearly every analytics AI tool being built today.

1. Separation between generation and evaluation

Analytics agents have a self-evaluation problem: when asked to judge their own output, they tend to find it reasonable. This is not a model limitation — it’s a structural one. The same agent that generated the query will defend the query.

The answer to this is not a smarter model. It’s an architecture that separates the agent doing the work from the agent auditing it — checking row counts against expectations, comparing to known benchmarks, flagging the 40% week-over-week jump in a metric that has never moved.

Key insight

The evaluator doesn’t ask “did the query run?” It asks “does this result look like what an experienced analyst would expect?“

2. Durable artifacts that survive the conversation

An analytics investigation typically spans multiple sessions, multiple people, multiple shifts in hypothesis. Yet most AI analytics tools treat each interaction as a fresh start. The institutional knowledge built up — which tables were considered, which assumptions were made, which interpretations were rejected and why — evaporates when the chat window closes.

A well-designed system leaves behind structured artifacts:

A brief capturing the request and its interpretations
An evidence trace recording what was considered and discarded
A plausibility report documenting what was checked

These turn “a conversation with an AI” into “a reviewable investigation.” They make it possible for a human, or a future session, to pick up exactly where the work left off.

3. The right question about a successful query

A query that executes without error tells you almost nothing about whether the analysis is correct. Wrong timestamp column. Joined on the wrong key. Filter that silently excludes a key population. Mixed metric definitions from two different source tables. All of these produce syntactically valid results that look plausible until someone with domain expertise squints at them.

Key insight

A query succeeding is the analytical equivalent of code compiling — it’s a necessary condition, not a sufficient one.

What This Means for Practice

I build analytics AI tools, and I use them on real problems daily. The architecture I keep arriving at is not a smarter chat interface. It’s a control layer that governs the full loop from a vague business question to a validated answer.

That loop has explicit checkpoints:

Is there enough context to proceed, or does something need to be clarified first?
Does this result pass basic plausibility checks, or does something look suspicious?
Has the system documented its assumptions in a form that a human can read, audit, and challenge?

These checkpoints are not constraints on intelligence. They are the structure within which intelligence becomes trustworthy.

A senior analyst operates this way naturally — freely exploring competing hypotheses on the way in, rigorously converging on a conclusion on the way out. The craft is knowing the difference between the two modes, and switching deliberately. A well-designed system encodes this discipline into the architecture so it doesn’t depend on the model having learned it by luck from training data.

The tools that will matter are not the ones with the best SQL generation. They’re the ones that know when SQL isn’t the answer yet.

When the question is ambiguous enough to produce three different valid queries. When the result looks off in a way that statistics can’t catch but judgment can. When the right move is to stop, surface an assumption, and ask a human.

The gap

Access to data is an input layer. Judgment about data is the expertise layer. Right now, nearly all the investment is going into the input layer — more metadata, better embeddings, richer semantic context. The expertise layer remains almost entirely unmapped.

That’s the gap. And it’s the one that actually matters.