Estimated reading time: 7 minutes
The public debate on artificial intelligence is increasingly moving away from the question of whether models are impressive. The real question is more and more whether they are reliable enough for real decision-making. Large language models can write fluently, reason plausibly, and answer convincingly. But that is also precisely where risk lies. A model can appear strong while still being insufficiently grounded in reality when it gives answers without a firm evidence basis in data. Independent tests show that recent frontier models without web access still answer incorrectly very often in difficult high-stakes settings. In HalluHard, the hallucination rate, without web access for the tested top models from OpenAI, Anthropic and Google, was above 50%.
Beyond accuracy, model reliability is becoming an increasingly important concern for decision-makers in the public sector, investment, risk management, and project contexts. NIST accordingly describes “trustworthy AI” more broadly than predictive performance alone, naming validity, reliability, transparency, and risk management among the core requirements. In Europe, the same shift can be seen in the EU AI Act, which explicitly regulates AI through a risk-based framework and imposes additional obligations around transparency and high-risk applications.
What is trustworthy AI?
For serious decision-making, trustworthy AI is best understood as a combination of five properties:
- Empirical grounding — the model stays close to observable data.
- Low assumption burden — it imposes as little additional structure as possible.
- Explainability and controllability — it is understandable why a prediction arises.
- Honest uncertainty — the model does not pretend to be more certain than is justified.
- Restraint beyond the data — it remains cautious outside the historical data space.
This definition fits well with the broader literature on interpretable machine learning. In that literature, explainability is not an optional extra, but a necessary property for responsible use in applications with major consequences.
Scientific validity and EBMAR
A scientifically strong model is characterized by high predictive value, few assumptions, little extrapolation, and limited dependence on latent variables. The main practical measure of scientific validity is therefore robust out-of-sample predictive performance. In high-stakes environments, particular emphasis should be placed on the components that make AI trustworthy. Here, we assess that through the lenses of the EBMAR framework.
EBMAR stands for Evidence-Bound, Mechanism-Agnostic Reasoning. Its core idea is simple: the data provides the ground truth, everything beyond that is uncertainty. Models are therefore not answers, but hypotheses that must first prove themselves on new data. Dependence on latent variables should be minimized as much as possible and made explicit. Extrapolation beyond the data space must be explicitly bounded.
This also explains why not all model types are epistemically equally strong. Human estimates and many rule-based models often rely heavily on preselected assumptions, expert intuition and fixed decision rules. As a result, implicit choices that might not be entirely correct are already being made about drivers. Which features truly matter is then determined not primarily through data-driven feature engineering or systematic out-of-sample validation, but through prior assumptions.
Deep learning models sit at the other end of the spectrum. They can perform strongly, but they use many hidden representations and latent layers. As a result, the direct link between input data, underlying drivers, and prediction becomes less clear. Uncertainty also becomes less directly observable. That is precisely why, in high-stakes applications, not only performance matters, but also how closely a model stays tied to the data, how much hidden structure it adds and how honestly it reflects its uncertainty.
Why deep hidden layers can be a problem
When models become very complex (with many hidden layers), it gets harder to understand how they turn inputs into predictions.
Because of that:
- It’s difficult to tell whether a prediction is really based on known data
- It’s also hard to see when the model is being used in a situation it hasn’t seen before
If the model is trained only to give the “best” answer (i.e., minimize error), it may:
- focus on being right on average
- but not recognize when it actually does not know enough to be confident
Alternative: Empirically grounded AI architecture
If pure model complexity is not enough, what is needed instead? The answer lies not in less AI, but in a different structure of AI. Reliable AI for decision-making ideally has:
- an empirically validated core,
- an explainable decision layer,
- and only after that, where relevant, a context- or regime-dependent enrichment.
That is exactly what makes the RCF-AI approach interesting. In our application for dyke cost estimation, AI is used to enforce three perspectives: predictive modelling, explainability and reference class (aka “similar cases”) benchmarking. The model uses more than 50 project features, shows predicted versus realized costs for completed projects and automatically identifies similar historical projects based on multidimensional similarity.
That is substantively stronger than a model that only provides a point prediction. The question is then not only: what does the model predict? The question is also: what did comparable cases actually cost? This way, the model remains anchored in actual cases, while the contribution of features and feature combinations remains directly transparent.
Reference class forecasting: strengths and weaknesses
Reference Class Forecasting (RCF) starts from reality, not assumptions.
Instead of modeling how things should work, it asks:
What actually happened in similar cases?
By grounding predictions in real outcomes, RCF is often more reliable than purely theoretical or assumption-driven models.
In practice, the reference class is usually selected manually. This means experts decide which cases are “similar”—often based on implicit judgment. That introduces hidden assumptions about what really drives outcomes.
AI doesn’t replace RCF—it strengthens it.
It helps make the reference class:
- more systematic (data-driven selection)
- more precise (better matching of comparable cases)
- more transparent (clearer why cases are included)
- more testable (validated against historical data)
This way one can keep the empirical strength of RCF, while reducing subjectivity. The outcome is a forecasting approach that is more consistent, transparent, and grounded in data.
Explainability is not optional
In many AI applications, explainability is presented as something extra. In a truthfulness framework, that is too weak. Explainability is not cosmetic, but a control mechanism. Models in high-consequence applications must not only be accurate. They must also be transparent enough for assumptions, sensitivities, and errors to be identified and corrected.
That aligns with the RCF-AI setup. The dyke cost application explicitly includes an explainability layer with a waterfall chart and top contributing factors. Users can therefore see which features push a prediction upward or downward. As a result, the model does not remain stuck at black-box output. It becomes a testable system for decision support.
Context matters—when it adds real value
A second mistake in many AI and modelling discussions is that “more context” is automatically seen as better. That is not always true. Context models are useful, but only if they add something on top of the empirical core. That added value must also be visible in validation.
That is exactly why the regime-based approach in the land price risk tool is substantively relevant. There, context is not only added theoretically. A regime-based approach is explicitly used, and its added value is assessed through out-of-sample backtests. That is the right order: first an empirical core, then add context only insofar as it actually holds up.
The right lesson from the LLM discussion
Public doubt about LLMs is not only about isolated mistakes, but about the question of when a model gives confident answers without a sufficient empirical basis. HalluHard makes that concrete: without web access, more than half of the tested answers in difficult high-stakes settings were incorrect. That makes clear that high language ability is not the same as reliable decision support.
The lesson, then, is not that AI is unusable. The lesson is that different AI architectures are needed: with maximum empirical anchoring, explicit uncertainty, transparent comparability, and contextual structure that must first prove itself through data.
In that light, a system such as RCF-AI is interesting. It combines the strongest elements of AI with an empirical anchor. It does not replace Reference Class Forecasting with black-box AI, but strengthens it through better comparability, explainability, and validated context modelling.
Conclusion
The future of reliable AI does not lie in ever-larger models alone. It lies in models that better distinguish between:
- what they know,
- what they can probably estimate,
- and what they cannot claim with sufficient certainty.
That requires a shift from performance thinking to truthfulness thinking. For decision-making, that means: start with an empirically validated core, add explainability, and only allow context where it can be robustly validated out of sample. In that respect, RCF-AI is not merely an AI application. It is an example of a stronger design principle for reliable decision support.
Sources
- [1] HalluHard (2026). A Hard Multi-Turn Hallucination Benchmark.
- [2] Lei Huang et al (2024), “A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions,” ACM Transactions on Information Systems.
- [3] NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0).
- [4] Molnar, C., Interpretable Machine Learning.
- [5] European Union, Regulation (EU) 2024/1689 (Artificial Intelligence Act), Official Journal of the European Union
- [6] Asset Mechanics (2026), AI-enhanced dyke cost estimation with RCF-AI.
- [7] Asset Mechanics (2024), Land price risk: a regime-based approach.