Site Overlay

Trustworthy AI: why predictive power alone is not enough

Estimated reading time: 7 minutes

The public debate on artificial intelligence is increasingly moving away from the question of whether models are impressive. The real question is more and more whether they are reliable enough for real decision-making. Large language models can write fluently, reason plausibly, and answer convincingly. But that is also precisely where the risk lies. A model can appear strong while still being insufficiently truth-faithful when it gives answers without a firm basis in the data. Independent tests show that recent frontier models without web access still answer incorrectly very often in difficult high-stakes settings. In HalluHard, the hallucination rate without web access for the tested top models from OpenAI, Anthropic, and Google was above 50%.

For policy, investments, risk management, and project decisions, a different question is therefore more important than accuracy alone: how reliable is the model? NIST accordingly describes trustworthy AI more broadly than predictive performance alone, naming validity, reliability, transparency, and risk management among the core requirements. In Europe, the same shift can be seen in the EU AI Act, which explicitly regulates AI through a risk-based framework and imposes additional obligations around transparency and high-risk applications.

What is truthfulness?

For serious decision-making, truthfulness is best understood as a combination of five properties:

  1. Empirical grounding — the model stays close to observable data.
  2. Low assumption burden — it imposes as little additional structure as possible.
  3. Explainability and controllability — it is understandable why a prediction arises.
  4. Honest uncertainty — the model does not pretend to be more certain than is justified.
  5. Restraint beyond the data — it remains cautious outside the historical data space.

This definition fits well with the broader literature on interpretable machine learning. In that literature, explainability is not an optional extra, but a necessary property for responsible use in applications with major consequences.

Scientific validity and EBMAR

A scientifically strong model is characterized by high predictive value, few assumptions, little extrapolation, and limited dependence on latent variables. The main practical measure of scientific validity is therefore robust out-of-sample predictive performance. In high-stakes environments, truthfulness is also needed. Here, we assess that using the EBMAR framework.

EBMAR stands for Evidence-Bound, Mechanism-Agnostic Reasoning. Its core idea is simple: the data provide the ground truth; everything beyond that is uncertainty. Models are therefore not answers, but hypotheses that must first prove themselves on new data. Dependence on latent variables should be minimized as much as possible and made explicit. Extrapolation beyond the data space must be explicitly bounded.

This also makes clear why not all model types are epistemically equally strong. Human estimates and many rule-based models often rely heavily on preselected assumptions, expert intuition, and fixed decision rules. As a result, implicit choices are already being made about drivers that may not in fact be the most important. Which features truly matter is then determined not primarily through data-driven feature engineering or systematic out-of-sample validation, but through prior assumptions.

Deep learning models sit at the other end of the spectrum. They can perform strongly, but they use many hidden representations and latent layers. As a result, the direct link between input data, underlying drivers, and prediction becomes less clear. Uncertainty also becomes less directly observable. That is precisely why, in high-stakes applications, not only performance matters, but also how closely a model stays tied to the data, how much hidden structure it adds, and how honestly it reflects its uncertainty.

Why deep hidden layers can be a problem

As models make greater use of deep hidden layers, the relationship between input, internal processing, and output becomes less visible. That makes it less clear whether a prediction is still firmly grounded in real data and when a model is operating outside its familiar domain. If a model is also trained only for “the best prediction,” information about uncertainty is easily lost. It then mainly learns which answer reduces error on average, not naturally when it actually lacks a sufficient basis to be confident.

The alternative: an AI architecture with an empirical core

If pure model complexity is not enough, what is needed instead? The answer lies not in less AI, but in a different structure of AI. Reliable AI for decision-making ideally has:

  • an empirically validated core,
  • an explainable decision layer,
  • and only after that, where relevant, a context- or regime-dependent enrichment.

That is exactly what makes an approach such as RCF-AI interesting. In the Asset Mechanics application for dike cost estimation, AI is not presented as a replacement for reality. It is used to strengthen three perspectives: predictive modelling, explainability, and reference class benchmarking. The tool uses more than 50 project features, shows predicted versus realized costs for completed projects, and automatically identifies similar historical projects based on multidimensional similarity.

That is substantively stronger than a model that only provides a point prediction. The question is then not only: what does the model predict? The question is also: what did comparable cases actually cost? In this way, the model remains anchored in realized cases, while the contribution of features and feature combinations remains directly transparent.

Why Reference Class Forecasting is strong — and where classical RCF falls short

Reference Class Forecasting is strong because it does not begin with a theory about how the world ought to work. It begins with the question of which comparable cases have already occurred and what actually happened there. That empirical orientation makes it methodologically stronger than many purely mechanistic or speculative models.

But classical RCF also has a weakness. The reference class is often selected manually. In doing so, experts are already implicitly making assumptions about which features are the real drivers. That is precisely where AI can have legitimate added value. Not by replacing the reference class logic, but by making it more systematic, sharper, and more testable. In the RCF-AI setup, this is done through a combination of systematic feature discovery, explainability, validation, and historically comparable projects.

Why explainability here is not cosmetic

In many AI applications, explainability is presented as something extra. In a truthfulness framework, that is too weak. Explainability is not cosmetic, but a control mechanism. Models in high-consequence applications must not only be accurate. They must also be transparent enough for assumptions, sensitivities, and errors to be discussed and checked.

That aligns with the RCF-AI setup. The dike cost application explicitly includes an explainability layer with a waterfall chart and top contributing factors. Users can therefore see which features push a prediction upward or downward. As a result, the model does not remain stuck at black-box output. It becomes a testable system for decision support.

Context matters — but only if it proves itself first

A second mistake in many AI and modelling discussions is that “more context” is automatically seen as better. That is not always true. Context models are useful, but only if they add something on top of the empirical core. That added value must also be visible in validation.

That is exactly why the regime-based approach in the land price risk tool is substantively relevant. There, context is not only added theoretically. A regime-based approach is explicitly used, and its added value is assessed through out-of-sample backtests. That is the right order: first an empirical core, then add context only insofar as it actually holds up.

The right lesson from the LLM discussion

Public doubt about LLMs is not only about isolated mistakes, but about the question of when a model gives confident answers without a sufficient empirical basis. HalluHard makes that concrete: without web access, more than half of the tested answers in difficult high-stakes settings were incorrect. That makes clear that high language ability is not the same as reliable decision support.

The lesson, then, is not that AI is unusable. The lesson is that different AI architectures are needed: with maximum empirical anchoring, explicit uncertainty, transparent comparability, and contextual structure that must first prove itself through data.

In that light, a system such as RCF-AI is interesting. It combines the strongest elements of AI with an empirical anchor. It does not replace Reference Class Forecasting with black-box AI, but strengthens it through better comparability, explainability, and validated context modelling.

Conclusion

The future of reliable AI does not lie in ever-larger models alone. It lies in models that better distinguish between:

  • what they know,
  • what they can probably estimate,
  • and what they cannot claim with sufficient certainty.

That requires a shift from performance thinking to truthfulness thinking. For decision-making, that means: start with an empirically validated core, add explainability, and only allow context where it can be robustly validated out of sample. In that respect, RCF-AI is not merely an AI application. It is an example of a stronger design principle for reliable decision support.

Sources

Risk and Data scientist at Asset Mechanics | https://assetmechanics.org/

Risk and Data scientist at Asset Mechanics

Risk and Data scientist at Asset Mechanics R&D | https://assetmechanics.org/

Risk and Data scientist at Asset Mechanics R&D