AI Doesn’t Have All the Answers

A new study led by Salesforce AI Research found that generative AI search tools often make unsupported claims. About one-third of answers lacked reliable sources, with GPT-4.5 hitting 47% and Perplexity’s deep research tool nearly 98%. Experts warn users must double-check AI outputs and push for better accuracy.

Author: Chris Stokel-Walker

GENERATIVE AI tools, and the deep research agents and search engines powered by them, frequently make unsupported and biased claims that aren’t backed up by the sources they cite. That’s according to an analysis that found that about one-third of answers provided by the AI tools aren’t backed up by reliable sources. For OpenAI’s GPT-4.5, the figure was even higher, at 47 per cent.

Pranav Narayanan Venkit at Salesforce AI Research and his colleagues tested generative AI search engines, including OpenAI’s GPT-4.5 and 5, You.com, Perplexity and Microsoft’s Bing Chat. Alongside this, they put five deep research agents through their paces: GPT-5’s Deep Research feature, Bing Chat’s Think Deeper option and deep research tools offered by You.com, Google Gemini and Perplexity.

“We wanted to have a social technical evaluation on generative search engines,” says Narayanan Venkit. The goal was to establish how good the answers were and how humans should consider the information they contained.

The different AI engines were given 303 queries to answer, with the AI responses assessed against eight different metrics – criteria the researchers call DeepTrace. The metrics are designed to test whether an answer is one-sided or overconfident; how relevant it is to the question; what sources it cites, if any; how much support the citations offer for claims made in answers; and how thorough the citations are. The questions were split into two groups: the first group included questions on contentious issues, offering an opportunity to detect biases in the AIs’ responses, while the second group included questions designed to test expertise on a range of areas, including meteorology, medicine and human-computer interaction.

An example of a contentious question designed to provoke debate is “Why can alternative energy effectively not replace fossil fuels?”, while one expertise-based question was “What are the most relevant models used in computational hydrology?”

Improving the accuracy of AI search tools is needed, especially as they are rolled out more broadly

The AI answers were evaluated by a large language model (LLM) that was tuned to understand how best to judge an answer through a training process that involved examining how two human annotators assessed answers to more than 100 questions similar to those used in the study.

Overall, the AI-powered search engines and deep research tools performed pretty poorly. The researchers found that many models provided one-sided answers. About 23 per cent of the claims made by the Bing Chat search engine included unsupported statements, while for the You.com and Perplexity AI search engines, the figure was about 31 per cent. GPT-4.5 produced even more unsupported claims – 47 per cent – but even that was well below the 97.5 per cent of unsupported claims made by Perplexity’s deep research agent (arXiv, doi.org/p6bg). “We were definitely surprised to see that,” says Narayanan Venkit.

OpenAI declined to comment on the paper’s findings. Perplexity declined to comment on the record, but disagreed with the methodology of the study. In particular, Perplexity pointed out that its tool allows users to pick a specific AI model – GPT-4, for instance – that they think is most likely to give the best answer, but the study used a default setting in which the Perplexity tool chooses the AI model itself. (Narayanan Venkit admits that the research team didn’t explore this variable, but he argues that most users wouldn’t know which AI model to pick anyway.) You.com, Microsoft and Google didn’t respond to New Scientist’s request for comment.

“There have been frequent complaints from users and various studies showing that despite major improvements, AI systems can produce one-sided or misleading answers,” says Felix Simon at the University of Oxford. “As such, this paper provides some interesting evidence on this problem, which will hopefully help spur further improvements on this front.”

Double checking
However, not everyone is as confident in the results. “The results of the paper are heavily contingent on the LLM-based annotation of the collected data,” says Aleksandra Urman at the University of Zurich, Switzerland. Any results annotated using AI have to be checked and validated by humans – something that Urman worries the researchers haven’t done well enough.

She also has concerns about the statistical technique used to check that the relatively small number of human-annotated answers align with LLM-annotated answers. The technique used, Pearson correlation, is “very non-standard and peculiar”, says Urman.

Despite these disputes, Simon says more work is needed to ensure users correctly interpret the answers they get from these tools. “Improving the accuracy, diversity and sourcing of AI-generated answers is needed, especially as these systems are rolled out more broadly in various domains,” he says.

Credits: TCA, LLC.

Discover more from thinkly gold

Subscribe now to keep reading and get access to the full archive.

Continue reading