Insight and analysis on the information technology space from industry thought leaders.

Is Prompt Engineering Dead? How To Scale Enterprise GenAI Adoption

Prompt engineering is unreliable because businesses lack control over off-the-shelf LLMs. Fine-tuning with domain-specific data and using RAG provide more sustainable, accurate, and scalable AI systems for specific business needs.

Industry Perspectives

September 24, 2024

6 Min Read
ai chatbot conversation using artificial intelligence technology
Alamy

Companies have been rolling out prompt-engineering products and services to capitalize on the AI and LLMOps hype cycle. Tools have been adapted to suit prompt engineering purposes, claiming generative AI capabilities. However, this approach is proving to be a band-aid solution.

That’s because when connecting to an off-the-shelf LLM through an API, businesses lose control over the model. As models are updated, businesses cannot keep track of changes to datasets or retraining processes, leading to inconsistent responses. This unpredictability can make prompt engineering an unreliable and non-scalable solution. Imagine if you used a widely available LLM behind the scenes of your website’s support chatbot, and suddenly it didn’t understand your customers’ questions – or worse, started giving wrong answers. For example, it might have been trained to block offensive answers, but it begins to block correct answers. Alternatively, it learns a new language and then loses the ability to understand nuances. Improving one thing can often decrease capabilities in another area, and that’s what can happen when models are updated.

Prompt engineering attempts to set specific contexts and manipulate the query. However, the core issue remains: Readily available LLMs are too general to customize responses effectively for specific enterprise use cases. (Let’s face it: Many of them have been trained on internet content such as Reddit, including webpages, fictional books, movie scripts, medical articles, and Wikipedia, among many other sources. How many of them are relevant to your business?) Additionally, the lack of control over the model and the constant updates from LLM providers lead to maintenance challenges and unreliable, sometimes disastrous, outcomes.

Related:Developers' Guide to Unlocking the Power of Open Source LLMs

Fine-Tuning: Addressing the Root Cause

To achieve accurate and reliable results, businesses must address the root cause of the problem: LLMs are generalists. Fine-tuning an LLM with domain-specific data can enable it to perform certain tasks well, but it may still fall short in other business areas. For example, the HR, Sales, and Marketing departments require different datasets. Therefore, businesses must run multiple fine-tuned LLMs to support various use cases within their organization. Fine-tuning ensures that each model is trained on a specific dataset, with specific data owners providing feedback (RLHF), thereby minimizing confusion and maximizing accuracy.

Retrieval Augmented Generation (RAG): A Better Alternative?

In an era of fast pivots, Retrieval Augmented Generation (RAG) has gained traction as a promising alternative. RAG combines embedding models with vector databases, enabling LLMs to reference specific, easily updatable information. You can fine-tune the embedding model and update the vector instead of fixing the LLM. This minimizes hallucinations and provides more accurate responses.

Related:Agentic RAG vs. Traditional RAG: Which Improves AI Capabilities More?

However, RAG is not a panacea. For RAG to work effectively, the query context and data corpus must be closely aligned. Broadening the scope with larger data corpora can confuse the model, leading to nonsensical answers.

Key Considerations

If you pick a model that is a poor fit for your use case, it will not be good at determining the context of the question and will fail at retrieving a reference point for the response. In those situations, the lack of reference data needed for providing an accurate response contributes to a hallucination. While there are many situations where you would prefer the model to give no response at all rather than fabricate one, what happens if there is no exact answer available is that the model will take some data points that it thinks are contextually relevant to the query and return an inaccurate answer (known as a generation failure).

Related:Micro-Monolith: The Best of Both Worlds in App Architecture

That’s problematic for several reasons. Wrong information may have serious consequences and require the user to double-check every answer, which invalidates the productivity gains of the LLM. Using RAG to teach a model to recognize that it does not know something can be challenging, but it is easier when the scope of knowledge is smaller, as you can more easily control the context of the answer and the response. Businesses must select suitable embedding models to ensure accurate context matching and avoid generation failures.

Moreover, the cost of implementing RAG can vary. Depending on whether you are using an open source embedding model or an LLM provider, the cost of fine-tuning could range from immaterial to a hefty line on your monthly invoice. The desired speed of the LLM response is also a significant cost driver as companies consider their inference investment. The bottom line? The faster, the more expensive.

Lastly, you should consider data privacy when using an LLM provider. Ensure that the data used for fine-tuning or retraining the embedding model does not leak back to the LLM provider.

Implementing LLMs for Enterprise-Grade Scale

To leverage LLMs effectively at an enterprise scale, businesses need to understand their limitations. Prompt engineering and RAG can improve accuracy, but LLMs must be tightly limited in domain knowledge and scope. Each LLM should be trained for a specific use case, using a specific dataset with data owners providing feedback. (We’re big believers in RLHF.) This ensures no chance of confusing the model with information from different domains. The training process for LLMs differs from traditional machine learning, requiring human oversight and quality assurance by data owners.

In this process, the data owners are the subject-matter experts, as they are the ones who can assess the accuracy of the responses and determine whether the model can be used by the business. For accurate LLM responses, you have to provide feedback to the LLM on whether the answer is bad, good, or great. And that can only be done with human oversight and quality assurance by data owners.

In addition, for developing, QAing, and monitoring the deployed model, you also need a human in the middle, preferably using a unified platform for LLM development. We’ve found that a streamlined, end-to-end workflow coordinated through a single pane of glass greatly increases the chances of accuracy and success.

Building a Future-Proof GenAI Tech Stack

That’s why I believe businesses should consider a flexible, unified platform for LLM development, one that supports the following:

  • Data ingestion (which can be challenging due to the different types of data that need to be turned into references for your RAG)

  • LLM selection and development (including having easy ways for data owners to provide feedback during model development so a winner can be selected for deployment)

  • Secure deployment that ensures those using the model are also only querying the data they are permitted to access (role-based access control)

  • Easy updating to ensure that the RAG has accurate and up-to-date information for fetching response references

  • Automated orchestration to make it all happen frictionlessly

Additionally, organizations should consider the total cost of ownership (TCO)—including people, process, technology, and time—for building and supporting generative AI use cases, avoiding one-offs, and thinking about replicable processes across multiple use cases and business units.

Consider the costs and lack of control of using LLM-as-a-service versus having your own LLMs. Remember that LLM-as-a-service charges are based on tokens and context provided, which scales linearly with usage. Having an in-house team manage LLMs could provide economies of scale by sharing the investments in talent and compute.

Technology is constantly evolving (and more quickly than previously predicted!), so don’t put all your eggs in one basket. Ultimately, you want to be in the best position to take advantage of innovation rather than locking yourself in.

In conclusion, while prompt engineering provided a temporary solution during the initial push for LLM adoption, it has proven unsustainable and unreliable. Fine-tuning and RAG offer more robust approaches, allowing businesses to leverage LLMs effectively for specific enterprise use cases. By focusing on these methods and by including humans in the loop, I believe organizations can achieve scalable, accurate, and reliable generative AI, ultimately realizing the full potential of LLM technology.

About the Author

Noam Harel is co-founder and general manager, North America, at ClearML.

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like