The truth behind generative AI hallucinations

Author: Paul Berenguer, Business Innovation Manager
Even the most advanced large language models generate false but plausible statements without recognising their own uncertainty. This phenomenon is known as hallucination: when there is insufficient evidence, the model speculates without admitting that ‘it does not know’. Through the paper ‘Why Language Models Hallucinate’, OpenAI argues that these hallucinations are not a software quirk or a moral machine defect, but rather a consequence of the way the models are trained and evaluated. If the system is rewarded more for providing a response than for abstaining, it will tend to provide responses even when in doubt. It is similar to a multiple-choice test, where incorrect answers are not penalised and stating ‘I don’t know’ results in a zero score, ‘guessing’ maximises the expected score.
This paper demonstrates that the generative behaviour of a model can be analysed as a binary classification (valid vs invalid responses) problem. From this point onwards, any error rate in this classification translates into error rate in generation. As a general premise, it should be noted that there are simple cases where the model effectively distinguishes between correct and incorrect responses, as well as cases where the model itself is inadequate, and cases where the facts do not follow any discernible pattern.
From the model’s perspective, many facts in the world are ‘arbitrary’. For instance, the title of a thesis, the date of a minor event, or a specific alphanumeric code may only appear once in the training corpus. When such singletons are frequent, the system lacks a sufficient statistical basis to generalise, increasing the temptation to guess when faced with questions involving rare events. Even if the corpus were perfect, the statistical objective of predicting the next word would not eliminate a certain percentage of errors in domains with low redundancy.
The second aspect of this phenomenon emerges during the post-training evaluation. Most benchmarks and standard tests use a binary grading system: correct or incorrect. Responding with ‘I don’t know’ is counted as an incorrect answer, prompting the models to provide an answer, even if they have a low probability of being correct. The result is that calibration (the correspondence between system confidence and actual accuracy) deteriorates. This bias towards responding is also evident when incorporating techniques designed to reduce errors, such as augmented retrieval using document searches or more extensive reasoning networks. While these techniques are useful, if the evaluation criteria reward responding but do not credit justified abstention, the system will continue to ‘take risks’ when the evidence is insufficient.
The central proposal of the aforementioned paper is as straightforward as it is unusual: introduce explicit confidence targets and award credit for abstention when appropriate, in order to formulate instructions and metrics that establish operational confidence thresholds. If the model’s probability of success does not exceed a given threshold, the expected response should be ‘I don’t know’, and this abstention should be scored neutrally –or positively –, as opposed to an erroneous guess, depending on the context and the cost of the error. The aim is not to ask the system to report perfect probability numbers, but rather to ensure that its behaviour aligns with understandable and verifiable reliability goals.
A threshold-based evaluation scheme reveals the compromises between coverage and accuracy and enables the system to adapt to the risks of each use. Including explicit penalties for serious errors and recognising abstention as a legitimate decision encourage safer behaviours during learning and model selection. This logic is relevant not only for research, but also for product engineering. It can be implemented in prompts, decision policies, and production metrics, by monitoring the abstention rate, accuracy conditioned on high confidence and errors with the greatest impact.
Instructions can be drafted to incorporate operational confidence thresholds and standard messages for abstention. Internal evaluation processes may change from a binary scoring system to one incorporating error penalties and recognition of correct rejections. Pipelines using document retrieval should require explicit confirmation when evidence is weak. System observability should also include calibration metrics to detect deviations and adjust thresholds using real data. From the above, it cannot be argued that models ‘lie’ in the human sense, nor that hallucination is an accidental defect that will disappear with more data or computation.
Furthermore, it is not suggested that innovation should slow down, but rather that incentives should be rebalanced so that technical progress leads to systems that are more responsive and recognise when they should remain silent. And what about coverage? By making thresholds explicit, we can select the most suitable operating point for each domain. In low-risk environments, we might prefer greater coverage with more relaxed thresholds; in sensitive applications, stricter thresholds and frequent abstentions are a sign of responsibility, not weakness. In all cases, the metric for success will be ‘saying something reliable’ rather than ‘saying something’.