02/12/2025

After the headlines: six practical techniques for managing LLM hallucination

From making sure you have selected web search to learning how to engineer your prompts, Nnamdi Odozi and the AI Ethics Working Party explore six ways to work around LLM weaknesses and limitations. This is the fifth in a series of blogs by the working party.

Every week we read of the advances being made by large language models (LLMs) and their increasing power. However, in recent months there have been some cautionary tales in the news:

LLMs have achieved remarkable progress in understanding, processing, and generating human-like language. Yet they come with clear weaknesses and deficiencies: cut-off dates for knowledge. That is, models are trained only on existing internet data up to a point in time and so are ignorant of events that have taken place afterwards.

LLMs have a limited context window (the model’s temporary working memory), whereby the model, which does not have inherent memory, can only process a certain amount of input data in one pass. Inputs larger than this cause performance to degrade, for example from the model not being able to keep track of the items in the input.

Other LLM weaknesses include difficulty with numeracy and arithmetic; and hallucinations – confidently producing content that is fabricated or inaccurate.

What does actuarial ethics have to say about this?

Principle 2 of the Actuaries’ Code, ‘Competence and Care’, requires members to carry out work competently and with care. This duty does not diminish when using new technologies such as generative AI.

Similarly, Section 2.15 of the IFoA’s Ethical and Professional Guidance on Data Science and Artificial Intelligence – A Guide for Members provides, as an example of professional competence within data and modelling ethics, that “care should be taken when using third-party generative AI tools (including LLMs), in relation to veracity of output and privacy and copyright risks.”

In practical terms, this means actuaries remain responsible for the quality and reliability of any work that includes AI-generated content. Whether a model produces a report, a chart, or a paragraph of text, the member who signs off on that work must ensure its accuracy and appropriateness.

Careful checking, critical reading, and validation are therefore not optional. They are essential steps in fulfilling our professional and ethical duties, as well as sound governance and risk management practice.
The cautionary headlines are therefore reminders of our ongoing accountability to verify before we rely.

Practical tips

The following six techniques work reliably in practice and require no technical expertise. Most can be implemented immediately. Pick a couple to start with rather than trying all at once.

1. Activate web search

Most advanced LLMs now come with web browsing or web search mode. However, some models default to using their own internal ‘parametric’ knowledge to reduce cost and response time. This increases the likelihood of hallucination, especially where up-to-date information matters such as when the question concerns current affairs or recently released models and libraries.

Where possible, enable web search explicitly. This can often be done by selecting ‘web search mode’ in the chat interface or cueing it in the prompt with ‘search for…’. Paid tiers usually provide more reliable access. Web search also yields citations and you should open the links to see the full source.

In professional contexts, you can further reduce hallucinations by supplementing the model’s search with your own internal sources (such as company policies, research notes, or actuarial assumptions). But before doing so, always confirm compliance with your firm’s data-privacy and security policies. Indeed, many organisations restrict the AI to a narrow or specific area of access.

2. Use ‘thinking’ or ‘reasoning’ modes

Many platforms offer modes that allow more agentic reasoning as opposed to the default next-token prediction mode. In practice, this means the LLM spends more time deliberating, planning, and iterating before producing a final answer. It can feel slower, but the answers are usually more grounded and substantiated, and better thought through.

Where possible select ‘thinking/reasoning’ mode or even ‘extended thinking’ mode for nuanced or high-stakes queries or when you notice that the LLM has already come back with a hallucination. In my experience, activating ‘thinking’ mode in ChatGPT also triggers web search seemingly to feed the model with enough context for its deliberations, or perhaps because by doing so the user has signalled a greater tolerance for latency.

It also works to include the words ‘think carefully’ or ‘think hard about this’ in the prompt. If you can afford to wait 10 to 20 minutes then you could use ‘deep research’ mode, which is now available on most models, for drafting longer research-type pieces running to thousands of words.

3. Adjust the temperature setting

For work requiring accuracy over creativity, explicitly ask the model to ‘be cautious and factual’ or ‘prioritise accuracy over creativity’. Some AI platforms, such as Google AI Studio allow you to adjust a ‘temperature’ setting that controls how deterministic or creative the model’s responses are. Lowering the temperature makes the model favour the most probable words and phrases from its training data, which can reduce hallucinations and produce more consistent answers.

While the ChatGPT web interface does not expose this control directly, it can be prompted to behave as if operating at a lower temperature by asking for concise, factual, or cautious responses. The trade-off is that lower temperatures make the model less imaginative and more likely to reply with ‘I don’t know’ when uncertain. But this is often desirable when accuracy matters more than creativity.

4. Use prompt engineering

Simple prompting adjustments can reduce errors. For example:

Persona prompting: (‘act as a pensions expert’) gives the model a context to answer from.
Few-shot prompting: providing a ‘few shots’ – that is, a sequence of example question/answer pairs before your actual question, so it can learn the desired pattern or reasoning style ‘on the fly’. It’s like showing it a mini-training set inside the prompt.

This helps make responses more consistent and contextually aligned, especially when you want a particular tone, structure, or reasoning method.

Few-shot prompting example:

You are an actuary explaining technical concepts to non-technical audiences. Use simple language, relevant analogies, and avoid jargon.

Here are examples of how you should respond:

Q1: What is a discount rate?

A1: It’s like comparing money today versus money tomorrow. £100 today is worth more than £100 in five years, partly because you could invest that £100 now and earn interest. The discount rate helps us put a ‘today value’ on future money, so we can compare them fairly.

Q2: What does ‘longevity risk’ mean for our pension scheme?

A2: It’s the risk that our pension scheme members live longer than we’ve planned for. Imagine budgeting for a 20-year retirement, but people actually live 25 years in retirement. We’d need more money than we set aside. That’s longevity risk.

Q3: Why do we need to hold capital reserves?

A3: Think of reserves like an emergency fund for unexpected events. Just as you might keep three months’ salary in savings for emergencies, insurers hold extra capital to cover claims that turn out higher than expected or investments that perform worse than planned.

Now answer the next question in the same style:

Q4: What is experience analysis and why do we do it?

This illustrates how the model learns by example. By seeing a few pairs that demonstrate the desired behaviour – giving hints rather than answers – it continues the pattern for the final question.

5. Use another LLM as a check

Here the output of one LLM is fed into the input of a second LLM with an instruction to critique or improve on it. This ‘two-model’ workflow helps catch reasoning gaps, missing citations, or weak explanations that the first model might overlook. The second model can be asked to evaluate accuracy, clarity, or tone, or to rewrite the answer more precisely according to your criteria (for example, ‘Check this summary for factual consistency and improve conciseness’).

This approach doesn’t require two different vendors. You can even run the same model twice with different prompts, such as one acting as author and the other as reviewer. It mirrors human peer review and can significantly reduce hallucinations or stylistic inconsistencies, especially for technical or compliance-sensitive writing.

6. Watch the context window

Models have limits on how much conversation they can handle effectively. Leading models today advertise context windows of about a million tokens (a token is around 0.75 words and so this is about 750,000 words) but in practice the effective context window might just be a third or half of that.

Once you go beyond that, performance tends to degrade. If you are working on a long thread, it can be worth starting a new session to maintain clarity. When working with large documents that might exceed the context window, for example to generate a summary, it helps to divide up the documents and send them to separate conversations.

Alternatively, using an agentic AI application that automatically manages context via techniques such as chunking, retrieval-augmented generation extraction and summarisation of key facts, use of input/output templates, stores them to disk, and also builds in auditing and validation steps, is beneficial.

In both the Deloitte and the Canadian academic reports listed at the start of this piece, the reports were large. In the Deloitte case the report ran to 237 pages and around 101,000 words. That is about 135,000 tokens which exceeded the 128,000 context window of the Azure OpenAI (GPT-4o) model that was used. Information is not readily available about the LLM that was employed in the Canadian report, or the workflows they employed when using the LLM, but it is likely (being 418 pages long) that it also exceeded the model’s context window.

Closing thought

Handled carefully, LLMs can be a powerful complement to actuarial judgment. They can accelerate research, summarisation, first-draft writing, and even power agentic workflows such as reserving and experience calculations.

However, they have clear weaknesses and limitations which we as actuaries, with accountability for the outputs and advice we give, need to be mindful of.

In the cases mentioned in this piece, independent checking and validation of the deliverables, and use of the techniques mentioned here such as agentic AI, web search, and careful management of the context window, would have provided useful support for the LLM.

Connect with the author Nnamdi Odozi on LinkedIn

See all blogs in this series by the AI Ethics Working Party.

The first appeared in The Actuary: Check your AI: a framework for its use in actuarial practice | The Actuary