Trustworthy AI: why predictive power alone is not enough

Estimated reading time: 8 minutes

The public debate on artificial intelligence is increasingly moving away from the question of whether models are impressive. The real question is more and more whether they are reliable enough for real decision-making. Large language models can write fluently, reason plausibly, and answer convincingly. But that is also precisely where risk lies. A model can appear strong while still being insufficiently grounded in reality when it gives answers without a firm evidence basis in data. Independent tests show that recent frontier models without web access still answer incorrectly very often in difficult high-stakes settings. In HalluHard, the hallucination rate, without web access for the tested top models from OpenAI, Anthropic and Google, was above 50%.

Beyond accuracy, model reliability is becoming an increasingly important concern for decision-makers in the public sector, investment, risk management, and project contexts. NIST accordingly describes “trustworthy AI” more broadly than predictive performance alone, naming validity, reliability, transparency, and risk management among the core requirements. In Europe, the same shift can be seen in the EU AI Act, which explicitly regulates AI through a risk-based framework and imposes additional obligations around transparency and high-risk applications.

Scientific validity and truthfulness

A scientifically strong model is often characterized by high predictive value. The main practical measure of scientific validity is therefore robust out-of-sample predictive performance.

However, in high-stakes environments, where data scarcity often is an issue, we face statistical pitfalls. In those settings, truthfulness requires the following properties:

Empirical grounding — the model stays close to observable data.
Restraint beyond the data — remain cautious outside the (historical) data space. Predictions outside the training data carry different epistemic status than interpolated ones, and that difference should be explicit [Molnar, Interpretable Machine Learning].
Honest uncertainty: express only as much confidence as the data justify, rather than optimizing it away in pursuit of point prediction accuracy [Gelman et al., Bayesian Data Analysis].
Low assumptions burden: impose as little structure beyond what the data support — complex models, with many latent variables, fail to generalize especially when historical cases are scarce [Hastie, Tibshirani & Friedman, Elements of Statistical Learning].
Mechanism agnosticism: avoid presupposing which drivers matter. Expert-selected features and fixed causal assumptions introduce implicit choices that may not reflect reality — systematic feature discovery and out-of-sample validation should determine what actually drives outcomes [Breiman, Statistical Modeling: The Two Cultures].
Explainability as a control mechanism: a model that cannot be interrogated cannot be challenged or corrected — making explainability a precondition for responsible use, not a cosmetic extra [Molnar; Mitchell et al., Model Cards for Model Reporting].

Together these define what we mean by a truthful model. We refer to this set of principles collectively as Evidence-Bound, Mechanism-Agnostic Reasoning (EBMAR). Summarized, its core idea is simple:

The data provides the ground truth, everything beyond that is uncertainty.
Models are therefore not answers, but hypotheses that must first prove themselves on new data.
Dependence on latent variables should be minimized as much as possible and made explicit.

This also explains why not all model types are epistemically equally strong. Human estimates and many rule-based models often rely heavily on preselected assumptions, expert intuition and fixed decision rules. As a result, implicit choices that might not be entirely correct are already being made about drivers. Which features truly matter is then determined not primarily through data-driven feature engineering or systematic out-of-sample validation, but through prior assumptions.

Deep learning models sit at the other end of the spectrum. They can perform strongly, but they use many hidden representations and latent layers. As a result, the direct link between input data, underlying drivers, and prediction becomes less clear. Uncertainty also becomes less directly observable. That is precisely why, in high-stakes applications, not only performance matters, but also how closely a model stays tied to the data, how much hidden structure it adds and how honestly it reflects its uncertainty.

Why deep hidden layers can be a problem

When models become very complex (with many hidden layers), it gets harder to understand how they turn inputs into predictions. Because of that:

It’s difficult to tell whether a prediction is really based on known data
It’s also hard to see when the model is being used in a situation it hasn’t seen before

If the model is trained only to give the “best” answer (i.e., minimize error), it may:

focus on being right on average
but not recognize when it actually does not know enough to be confident

Alternative: Empirically grounded AI architecture

If pure model complexity is not enough, what is needed instead? The answer lies not in less AI, but in a different structure of AI. Reliable AI for decision-making ideally has:

an empirically validated core,
an explainable decision layer,
and only after that, where relevant, a context- or regime-dependent enrichment.

That is exactly what makes the RCF-AI approach interesting. In our application for dyke cost estimation, AI is used to enforce three perspectives: predictive modelling, explainability and reference class (aka “similar cases”) benchmarking. The model uses more than 50 project features, shows predicted versus realized costs for completed projects and automatically identifies similar historical projects based on multidimensional similarity.

That is substantively stronger than a model that only provides a point prediction. The question is then not only: what does the model predict? The question is also: what did comparable cases actually cost? This way, the model remains anchored in actual cases, while the contribution of features and feature combinations remains directly transparent.

Reference class forecasting: strengths and weaknesses

Reference Class Forecasting (RCF) starts from reality, not assumptions. Instead of modeling how things should work, it asks: What actually happened in similar cases?

By grounding predictions in real outcomes, RCF is often more reliable than purely theoretical or assumption-driven models.

In practice, the reference class is usually selected manually. This means experts decide which cases are “similar”—often based on implicit judgment. That introduces hidden assumptions about what really drives outcomes.

AI doesn’t replace RCF—it strengthens it. It helps make the reference class:

more systematic (data-driven selection)
more precise (better matching of comparable cases)
more transparent (clearer why cases are included)
more testable (validated against historical data)

This way one can keep the empirical strength of RCF, while reducing subjectivity. The outcome is a forecasting approach that is more consistent, transparent, and grounded in data.

Explainability is not optional

In many AI applications, explainability is presented as something extra. In a truthfulness framework, that is too weak. Explainability is not cosmetic, but a control mechanism. Models in high-consequence applications must not only be accurate. They must also be transparent enough for assumptions, sensitivities, and errors to be identified and corrected.

That aligns with the RCF-AI setup. The dyke cost application explicitly includes an explainability layer with a waterfall chart and top contributing factors. Users can therefore see which features push a prediction upward or downward. As a result, the model does not remain stuck at black-box output. It becomes a testable system for decision support.

Context matters—when it adds real value

A second mistake in many AI and modelling discussions is that “more context” is automatically seen as better. That is not always true. Context models are useful, but only if they add something on top of the empirical core. That added value must also be visible in validation.

That is exactly why the regime-based approach in land price risk is substantively relevant. There, context is not only added theoretically. A regime-based approach is explicitly used, and its added value is assessed through out-of-sample backtests. That is the right order: first an empirical core, then add context only insofar as it actually holds up.

Choosing the right tool for the right problem

Public doubt about LLMs in high-stakes settings is not only about isolated mistakes. It reflects a deeper mismatch: these models are optimized for language fluency and broad generalization across massive data, not for the calibrated, auditable reasoning that consequential decisions require. HalluHard makes the performance gap concrete, but the more fundamental issue is architectural — general-purpose language models are simply not designed for the structural characteristics of high-stakes, data-scarce domains.

The lesson is therefore not that LLMs need fixing. It is that different problems require different tools. For policy, investment, risk management, and project decisions — where historical cases are limited, uncertainty must be honest, and assumptions need to be challengeable — the right architecture is one built around EBMAR principles from the ground up, not adapted from a paradigm optimized for entirely different conditions.

In that light, a system such as RCF-AI is interesting. It combines the strongest elements of AI with an empirical anchor. It does not replace Reference Class Forecasting with black-box AI, but strengthens it through better comparability, explainability, and validated context modelling.

Conclusion

The future of reliable AI does not lie in ever-larger models alone. It lies in models that better distinguish between:

what they know,
what they can probably estimate,
and what they cannot claim with sufficient certainty.

That requires a shift from performance thinking to truthfulness thinking. For decision-making, that means: start with an empirically validated core, add explainability, and only allow context where it can be robustly validated out of sample. In that respect, RCF-AI is not merely an AI application. It is an example of a stronger design principle for reliable decision support.

RCF-AI Tool

Systeemkaart RCf-AI