What are our expectations from a well operationalized GenAI Application? The asks are like any other predictive or traditional ML application like is the system accurate, efficient, effective, safe, reliable, and scalable. In this blog, we discuss some key operationalization aspects that need to be considered to build production ready end-to-end LLM model pipelines.
1. Designing the data pipeline
It is important to identify and map the data sources which will be input to the LLM. These could be documents, code, structured data etc. The data pipeline needs to be built including ingestion and preprocessing of data. This data can be then used as part of context to the LLM. For certain use cases one needs to match the end user input at inference time to a part of existing documents to retrieve the best matched relevant chunks of data. These chunks are then added to the context given to the LLM along with original user question and prompt. Vector databases are best suited for such pipelines. Examples of vector databases include Pinecone, Weaviate etc.
2. Dealing with ambiguity of LLMs
The output of LLMs is stochastic in nature and may differ even with same input, prompts and context. This may not be palatable in certain use cases and needs to be managed carefully by setting the hyperparameters (such as temperature and top_k) of the model. Another way is to engineer workarounds and design workflows to handle such scenarios in the application. This is one of the key differences between handling the output of traditional ML and LLMs!
3. Selecting the model
The model that is best suited for the task needs to be selected. Generally, we need models for two purposes, embedding generation and query completion via LLM. Various choices are available in both cases. Factors impacting model selection include embedding dimension, sequence length, cost, infrastructure needs, latency and the key benchmark metric required for the task. Models need to be studied for their benchmarks and then selected for the task. Always refer to the leaderboard. For example, we can refer to the Massive Text Embedding Benchmark (MTEB) leaderboard to select the best embedding model. And check for TruthfulQA benchmark for Q and A specific use case. It is a benchmark designed to evaluate the truthfulness of language models in generating answers to diverse questions.
Also, model selection needs to be revisited periodically as newer models keep popping up every now and then! The model’s performance also needs to be reviewed periodically as we will see in the next section.
Some models can be hosted by the enterprise itself (on self-owned infra) and then served while some others are already hosted and need to be consumed via an API. For example, models are available via OpenAI, Huggingface, Cohere and also via Cloud providers such as AWS, Azure and GCP. These are available via API calls. Open-Source models such as Falcon and Llama are available if the need is to host models on self-owned infra or one needs to continuously do extensive fine tuning for various tasks and maintain local model versions. Many ML Ops libraries provide such features (MLFLOW, Cloud providers etc.).
Model size is also a factor to be considered. There is a tradeoff between model size and accuracy. The higher the parameters, the greater the size and hence more the memory /compute subsequently increasing costs This impacts latency as well.
4. Selecting the design pattern
The architecture pattern needs to be decided for the LLM application depending on the type of use case, data sources (internal or external), etc. Options include simple inference calls to model with minimal prompts where they utilize their existing knowledge or alternative implementation using the in-context learning ability of the LLMs such as Zero and Few shot learning. Another option is to implement a Retrieval Augmented Generation architecture which uses internal documents to limit the context given for a single query to the model. This pattern also uses vector stores and is implemented as a pipeline.
In multi-step applications, AI Assistants or AI agents can be used. They can perform tasks such as web search, web browser-based actions, invoke code interpreters &SQL executors, etc. These tasks can be done in conjunction with the LLM output and allow the application to perform a series of automated tasks.
5. Experimenting with prompts
This is one of the most vital aspects of getting the right output. Prompt evaluation involves checking if the model understands the examples given in the prompt or does it overfit the Few shot examples! Further, prompt versioning and prompt optimization are required to have a good system output. Various tools such as MLFLOW prompt Engineering UI or Azure Prompt flow can be used to perform experiments, iterate, and tune the prompts.
Prompt tuning is another technique to be used only with open-source models (not APIs). This involves programmatically changing the embedding of the prompt and providing them as input to the model. This is an expensive task as each task requires a separate model.
6. LLMOps orchestration tooling
This is one of the most critical components in building the end-to-end pipeline. This tool acts as a central orchestrating tool interacting with various other components and controlling the flow of actions and data. Various LLMOps tooling are available in the market. Few are open source, and many are paid versions. Important ones among them are Langchain, Auto-GPT, LlamaIndex, AutoAI APIs. Decisions should be made based on ease of use and type of features (such as playground, extensible APIs, quick development of changes in industry).
7. Finetuning the models
When the enterprise has good amount of data in the form of “provided input” vs “to be generated output”, it can be fed to the model for finetuning it. Then the model will learn based on the input given. Finetuning is the process when the weights of the model are updated by mini training certain layers of the neural network-based architecture for additional data only. The new model can then be hosted and used for subsequent inference. The disadvantage is higher cost of fine tuning and then using that model for inference. The advantage is one can feed specific data and make the model behave accordingly.
But what if we have only a few examples to show to the model? In such cases, we use prompting techniques such as few-shot learning and make use of the in-context learning. Here the weights of the model remain unchanged and in context instruction is followed by the model. But this is limited by the token length of the model. Also, beyond a point, the longer the input context, more the chance of the model forgetting a part of it! So, consider fine tuning in such scenarios.
Engineering wise also, in case of self-hosted models, fine tuning involves implementation of lots of optimization techniques such a Quantization ,Low Rank Adaptation(LORA) and Parameter-Efficient Finetuning (PEFT).
8. Planning for deployment/inference
Once the model is ready to be deployed in production for its end use, design of scalable application architecture comes into play. Model size, user load on the application and latency and throughput requirements determines the system infrastructure (compute and memory). For example, we could select a GPU cluster to load a 70 billion parameter size model!
These factors also influence the choice of serving application architecture. The application could include simple python script calls or full-fledged server applications (such as Django and Gunicorn). Most providers such as Azure or Databricks have an option of hosted model serving and horizontal scalability. Many frameworks such as Ray-serve help us achieve optimized inference for self-hosted applications as well.
9. Review, evaluation, and governance
Once the application is developed, certain metrics need to be checked, and the cycle needs to be iterated till we have the desired numbers. Quality metrics for evaluating functional aspects include “exact-match” for Q and A models, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for summarization, etc. We can also evaluate the model output using another evaluator LLM. Some metrics that can be measured using this method are answer-correctness, answer-faithfulness, etc. Langchain and MLFLOW are some tools that help in the framework for evaluating model performance. Testing datasets do play a key role in automated measurement. Human feedback is important in evaluation as well. In some cases where LLM is used for prediction/classification, traditional metrics do apply.
Governance of the models is vital in making sure they are not misused or pose a threat to processes or people. Steps should be taken to eliminate bias and toxicity. Also, relevant guardrails for quality, security, transparency, equity in response etc. should be in place to mitigate risks arising from LLM use. These guardrails can be implemented using prompt engineering, model hyperparameters, type of model, explicit checks, limiting the input context etc.
10. Monitoring of models
For model observability, it is important to log and track various parameters of the application after its implementation. Such as logging the number of requests, response time, token usage, costs, CPU/GPU, and memory metrics, etc. This helps in keeping the system up and the infrastructure optimized. Also logging the entire chain from input (prompt, context) to response output leads to transparency, auditability and provides functional insights on the performance of the model.
The metrics tracked during development need to be periodically monitored using the test datasets and the new inference data available. Any drift in performance needs to be flagged using an alerting mechanism. Alerts beyond certain thresholds are an effective way to point to aberrations.
Closing notes
It is important for enterprises to have a mechanism in place to ensure diligent implementation of the above mentioned steps. While this is a multi-team effort (devops,infra,business etc.) a centralized and homogenous approach to manage the lifecycle of all LLM applications, right from pilot to post-production, is recommended. Finally, as GenAI space evolves rapidly, it is important to evolve these steps in line, to reap maximum and timely benefits of GenAI.