← Back to Journal
2026-02-24

The QA Shift for probabilistic systems

With the rapid progress of coding agents, getting a working prototype or MVP out quickly has become significantly easier, even for small teams.

At the same time, baseline expectations for software have risen sharply. What felt like a delighter in 2023 barely meets expectations today. And that is assuming the product avoids the usual blind spots.

When building a product meant for real business users, these challenges cannot be treated as afterthoughts.

In finance, legal, healthcare, or enterprise workflows, failures are expensive. Traditional software breaks and throws errors. AI breaks differently: confident nonsense, risky actions, silent data leaks. That is a fundamentally different failure mode.

Classic QA assumed determinism, but binary pass/fail logic no longer holds. You are evaluating behavior quality, not just code correctness.

Beyond informal testing, robust evals become essential. Think 50–100 structured inputs covering easy cases, edge cases, ambiguous phrasing, and messy real-world queries. Run them before every release. Use LLMs to generate, expand, and grade outputs against defined criteria.

Hallucination risk is another area that demands attention. If your agent interacts with business data, it will be judged on accuracy. For RAG systems, track whether the right context is retrieved, whether responses stay grounded in it, and whether answers actually address the question. It is far better to detect accuracy gaps before customers do.

Privacy and security are non-negotiable. Sensitive data can easily leak through prompts, logs, and third-party tools if safeguards are weak.

Explainability matters. Enterprise users will ask why the system produced a response. In some cases, it may even be a legal requirement. Log inputs, retrieved context, and tool usage. Opaque AI is fragile AI.

Model drift is real. Something that worked perfectly last week can silently degrade. Sample outputs in production, rerun evals regularly, and treat poor responses as bugs, not anomalies.

Even a lightweight eval pipeline and basic monitoring can dramatically reduce risk.

Ship fast if you want. Just remember that cutting corners on QA often becomes the most expensive shortcut.