Why Language Models Fail: Ways to Enhance AI for Effective Deployments

Researchers explore the potential for 'irreversible damage’ when AI models generate falsities, including issues around historical data and a lack of model evaluation.

Ben Wodecki, AI Business

October 20, 2023

5 Min Read
Rendering of a 3d LLM text with neural network infrastructure
Alamy

This article was originally published on AI Business.

The reliability of large language models (LLMs) is under scrutiny as a new study explores the ability of large language models like ChatGPT to produce factual, trustworthy content.

A group of U.S. and Chinese researchers, who hail from Microsoft, Yale and other universities, evaluated large language models across domains ranging from health care to finance to determine their reliability. In a research survey titled Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity, they found that problems with reasoning and the model’s misinterpreting of retrieved data are among the prime causes of factual errors.

Such errors could lead to a health care chatbot providing incorrect information to a patient. Or a finance-focused AI system could provide false reporting on stocks, leading to potentially bad investments. Such blunders could harm users and even cause reputational damage to companies using them, like when Google launched Bard, only for the chatbot to produce a factual error in one of its earliest demos.

Another issue affecting the reliability of large language models outlined in the research was using outdated information. Certain LLMs have datasets that only go up to a certain date, which forces businesses to continually update them.

Related:Generative AI in ITOps: Its Potential and Limitations

Evaluate Before Deployment

The researchers warn that factual mistakes generated by LLMs could cause “irreversible damage.” For businesses looking to deploy such systems, the authors stress the need for careful evaluation of a model’s factuality before deployment.

They wrote that utilizing evaluation techniques like FActScore would allow businesses to measure the factual accuracy of LLM-generated content. FActScore was proposed by a group of researchers from Meta, the University of Washington and the Allen Institute for AI. It is an evaluation metric used to test the factual accuracy of large language models.

The researchers also referenced the idea of using benchmarks like TruthfulQAC-EVAL and RealTimeQA that are capable of quantifying factuality. Such systems are largely open source and easily accessible via GitHub, meaning businesses can use free tools to check their models.

Other strategies to evaluate an LLM’s factuality included continual training and retrieval augmentation, to enhance the learning of long-tail knowledge in LLMs.

Multi-Agent Systems

The survey references the reliance on historical data used to train models. For a long time, the basic version of OpenAI's ChatGPT was limited to data up until September 2021, though this was brought up to January 2022 for the basic in a recent update. 

Having an AI system that provides outputs based on outdated information could prove detrimental to users. For example, having a system that is unable to provide relevant information could allow for an ineffective deployment. AI models rely on data to learn and make decisions, if the data they are trained on is largely outdated, it may not be able to accurately predict outcomes. Having outdated information powering a system could lead to historical biases in older data resurfacing in responses.

There are ways around this – like using API calls as seen in Meta’s Toolformer to improve a model's access to information. However, such systems don't produce real-time information.

The paper refers to the idea of using a multi-agent approach, where multiple AI systems are used to generate an output, instead of just one. A team from MIT and Google DeepMind recently proposed such a system, dubbing the concept a “Multiagent Society.”

The Chinese researchers were in favor of using a multi-agent approach to improve a system's factuality. They wrote that engaging multiple models collaboratively or competitively could enhance factuality “through their collective prowess and help address issues like reasoning failures or forgetting of facts."

Several multi-agent system concepts were explored by the researchers to help improve LLMs. Among them was multi-debate – where different LLM agents debate answers and iteratively refine responses to converge on a correct factual consensus, such an approach could improve mathematical and logical reasoning abilities.

There was also multi-role fact-checking – where separate LLM agents were tasked with generating statements or verifying outputs in a collaborative manner to detect potential factual inaccuracies.

Such an approach would prove model-agnostic, meaning users could use any existing LLM as one of the agents in their multi-faceted approach.

Domain-Specific Training

Using a more general AI model for a hyper-specific use case or sector was another issue highlighted in the research. The authors suggest that while effective at general tasks, a model like ChatGPT lacks domain-specific factual knowledge for medicine, for example.

Domain-specific public models do exist. There is Harvey, which is designed to automate tasks in the legal sector. Or Owl, built with IT tasks in mind, and BloombergGPT, which was trained on the vast trove of information from the financial data giant.

The research states that domain-specific LLMs provide factual improvements in their outputs compared to more general LLMs. They contend that models trained using knowledge-rich data tend to be more factual.

They suggest that domain-specific training and evaluation could “transform” deployments, like in the case of HuatuoGPT, a medical language model that uses data from ChatGPT and doctors, for clinical decision-making.

The survey paper discussed several promising methods for domain-specific training and evaluation of large language models. Among them was continual pretraining, where fine-tuned models are routinely fed a stream of domain-specific data to keep them up to date. And supervised finetuning, where labeled domain-specific datasets are used to refine a model’s performance on specialized tasks like question-answering legal issues.

The paper also outlined domain-specific benchmarks that businesses can use to evaluate domain-specific models, like CMB for health care or LawBench for legal use cases.

Read more about:

AI Business

About the Authors

Ben Wodecki

Assistant Editor, AI Business

Ben Wodecki is assistant editor at AI Business, a publication dedicated to the latest trends in artificial intelligence.

AI Business

AI Business, an ITPro Today sister site, is the leading content portal for artificial intelligence and its real-world applications. With its exclusive access to the global c-suite and the trendsetters of the technology world, it brings readers up-to-the-minute insights into how AI technologies are transforming the global economy - and societies - today.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like