Large Language Models | Towards Data Science

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Tula Masterman — Wed, 05 Mar 2025 19:50:12 +0000

Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks.

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data.

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems.

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request.

How to distill documents and build the pyramid:

Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages.
Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document.
Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.

Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid.

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information.

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response.

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used.

Here is a high-level visualization of our Agentic approach to responding to user requests:

Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks.

Fact-finding (spear fishing):

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly.

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response.

Complex research and analysis:

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Part 1 of response generated by the agent on disclosed risks.

Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report.

Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions.

Some of the key benefits we observed include:

Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time.
Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables.
Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents.
Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks.
Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use.
Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way.
Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections.

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains.

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process.

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes.

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work.

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication.

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!).

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise.

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

Generative AI Is Declarative

Michael Herman — Wed, 05 Mar 2025 16:36:00 +0000

ChatGPT launched in 2022 and kicked off the Generative Ai boom. In the two years since, academics, technologists, and armchair experts have written libraries worth of articles on the technical underpinnings of generative AI and about the potential capabilities of both current and future generative AI models.

Surprisingly little has been written about how we interact with these tools—the human-AI interface. The point where we interact with AI models is at least as important as the algorithms and data that create them. “There is no success where there is no possibility of failure, no art without the resistance of the medium” (Raymond Chandler). In that vein, it’s useful to examine human-AI interaction and the strengths and weaknesses inherent in that interaction. If we understand the “resistance in the medium” then product managers can make smarter decisions about how to incorporate generative AI into their products. Executives can make smarter decisions about what capabilities to invest in. Engineers and designers can build around the tools’ limitations and showcase their strengths. Everyday people can know when to use generative AI and when not to.

How to order a cheeseburger with AI

Imagine walking into a restaurant and ordering a cheeseburger. You don’t tell the chef how to grind the beef, how hot to set the grill, or how long to toast the bun. Instead, you simply describe what you want: “I’d like a cheeseburger, medium rare, with lettuce and tomato.” The chef interprets your request, handles the implementation, and delivers the desired outcome. This is the essence of declarative interaction—focusing on the what rather than the how.

Now, imagine interacting with a Large Language Model (LLM) like ChatGPT. You don’t have to provide step-by-step instructions for how to generate a response. Instead, you describe the result you’re looking for: “A user story that lets us implement A/B testing for the Buy button on our website.” The LLM interprets your prompt, fills in the missing details, and delivers a response. Just like ordering a cheeseburger, this is a declarative mode of interaction.

Explaining the steps to make a cheeseburger is an imperative interaction. Our LLM prompts sometimes feel imperative. We might phrase our prompts like a question: ”What is the tallest mountain on earth?” This is equivalent to describing “the answer to the question ‘What is the tallest mountain on earth?’” We might phrase our prompt as a series of instructions: ”Write a summary of the attached report, then read it as if you are a product manager, then type up some feedback on the report.” But, again, we’re describing the result of a process with some context for what that process is. In this case, it is a sequence of descriptive results—the report then the feedback.

This is a more useful way to think about LLMs and generative AI. In some ways it is more accurate; the neural network model behind the curtain doesn’t explain why or how it produced one output instead of another. More importantly though, the limitations and strengths of generative AI make more sense and become more predictable when we think of these models as declarative.

LLMs as a declarative mode of interaction

Computer scientists use the term “declarative” to describe coding languages. SQL is one of the most common. The code describes the output table and the procedures in the database figure out how to retrieve and combine the data to produce the result. LLMs share many of the benefits of declarative languages like SQL or declarative interactions like ordering a cheeseburger.

Focus on desired outcome: Just as you describe the cheeseburger you want, you describe the output you want from the LLM. For example, “Summarize this article in three bullet points” focuses on the result, not the process.
Abstraction of implementation: When you order a cheeseburger, you don’t need to know how the chef prepares it. When submitting SQL code to a server, the server figures out where the data lives, how to fetch it, and how to aggregate it based on your description. You as the user don’t need to know how. With LLMs, you don’t need to know how the model generates the response. The underlying mechanisms are abstracted away.
Filling in missing details: If you don’t specify onions on your cheeseburger, the chef won’t include them. If you don’t specify a field in your SQL code, it won’t show up in the output table. This is where LLMs differ slightly from declarative coding languages like SQL. If you ask ChatGPT to create an image of “a cheeseburger with lettuce and tomato” it may also show the burger on a sesame seed bun or include pickles, even if that wasn’t in your description. The details you omit are inferred by the LLM using the “average” or “most likely” detail depending on the context, with a bit of randomness thrown in. Ask for the cheeseburger image six times; it may show you three burgers with cheddar cheese, two with Swiss, and one with pepper jack.

Like other forms of declarative interaction, LLMs share one key limitation. If your description is vague, ambiguous, or lacks enough detail, then the result may not be what you hoped to see. It is up to the user to describe the result with sufficient detail.

This explains why we often iterate to get what we’re looking for when using LLMs and generative AI. Going back to our cheeseburger analogy, the process to generate a cheeseburger from an LLM may look like this.

“Make me a cheeseburger, medium rare, with lettuce and tomatoes.” The result also has pickles and uses cheddar cheese. The bun is toasted. There’s mayo on the top bun.
“Make the same thing but this time no pickles, use pepper jack cheese, and a sriracha mayo instead of plain mayo.” The result now has pepper jack, no pickles. The sriracha mayo is applied to the bottom bun and the bun is no longer toasted.
“Make the same thing again, but this time, put the sriracha mayo on the top bun. The buns should be toasted.” Finally, you have the cheeseburger you’re looking for.

This example demonstrates one of the main points of friction with human-AI interaction. Human beings are really bad at describing what they want with sufficient detail on the first attempt.

When we asked for a cheeseburger, we had to refine our description to be more specific (the type of cheese). In the second generation, some of the inferred details (whether the bun was toasted) changed from one iteration to the next, so then we had to add that specificity to our description as well. Iteration is an important part of AI-human generation.

Insight: When using generative AI, we need to design an iterative human-AI interaction loop that enables people to discover the details of what they want and refine their descriptions accordingly.

To iterate, we need to evaluate the results. Evaluation is extremely important with generative AI. Say you’re using an LLM to write code. You can evaluate the code quality if you know enough to understand it or if you can execute it and inspect the results. On the other hand, hypothetical questions can’t be tested. Say you ask ChatGPT, “What if we raise our product prices by 5 percent?” A seasoned expert could read the output and know from experience if a recommendation doesn’t take into account important details. If your product is property insurance, then increasing premiums by 5 percent may mean pushback from regulators, something an experienced veteran of the industry would know. For non-experts in a topic, there’s no way to tell if the “average” details inferred by the model make sense for your specific use case. You can’t test and iterate.

Insight: LLMs work best when the user can evaluate the result quickly, whether through execution or through prior knowledge.

The examples so far involve general knowledge. We all know what a cheeseburger is. When you start asking about non-general information—like when you can make dinner reservations next week—you delve into new points of friction.

In the next section we’ll think about different types of information, what we can expect the AI to “know”, and how this impacts human-AI interaction.

What did the AI know, and when did it know it?

Above, I explained how generative AI is a declarative mode of interaction and how that helps understand its strengths and weaknesses. Here, I’ll identify how different types of information create better or worse human-AI interactions.

Understanding the information available

When we describe what we want to an LLM, and when it infers missing details from our description, it draws from different sources of information. Understanding these sources of information is important. Here’s a useful taxonomy for information types:

General information used to train the base model.
Non-general information that the base model is not aware of.
- Fresh information that is new or changes rapidly, like stock prices or current events.
- Non-public information, like facts about you and where you live or about your company, its employees, its processes, or its codebase.

General information vs. non-general information

LLMs are built on a massive corpus of written word data. A large part of GPT-3 was trained on a combination of books, journals, Wikipedia, Reddit, and CommonCrawl (an open-source repository of web crawl data). You can think of the models as a highly compressed version of that data, organized in a gestalt manner—all the like things are close together. When we submit a prompt, the model takes the words we use (and any words added to the prompt behind the scenes) and finds the closest set of related words based on how those things appear in the data corpus. So when we say “cheeseburger” it knows that word is related to “bun” and “tomato” and “lettuce” and “pickles” because they all occur in the same context throughout many data sources. Even when we don’t specify pickles, it uses this gestalt approach to fill in the blanks.

This training information is general information, and a good rule of thumb is this: if it was in Wikipedia a year ago then the LLM “knows” about it. There could be new articles on Wikipedia, but that didn’t exist when the model was trained. The LLM doesn’t know about that unless told.

Now, say you’re a company using an LLM to write a product requirements document for a new web app feature. Your company, like most companies, is full of its own lingo. It has its own lore and history scattered across thousands of Slack messages, emails, documents, and some tenured employees who remember that one meeting in Q1 last year. The LLM doesn’t know any of that. It will infer any missing details from general information. You need to supply everything else. If it wasn’t in Wikipedia a year ago, the LLM doesn’t know about it. The resulting product requirements document may be full of general facts about your industry and product but could lack important details specific to your firm.

This is non-general information. This includes personal info, anything kept behind a log-in or paywall, and non-digital information. This non-general information permeates our lives, and incorporating it is another source of friction when working with generative AI.

Non-general information can be incorporated into a generative AI application in three ways:

Through model fine-tuning (supplying a large corpus to the base model to expand its reference data).
Retrieved and fed it to the model at query time (e.g., the retrieval augmented generation or “RAG” technique).
Supplied by the user in the prompt.

Insight: When designing any human-AI interactions, you should think about what non-general information is required, where you will get it, and how you will expose it to the AI.

Fresh information

Any information that changes in real-time or is new can be called fresh information. This includes new facts like current events but also frequently changing facts like your bank account balance. If the fresh information is available in a database or some searchable source, then it needs to be retrieved and incorporated into the application. To retrieve the information from a database, the LLM must create a query, which may require specific details that the user didn’t include.

Here’s an example. I have a chatbot that gives information on the stock market. You, the user, type the following: “What is the current price of Apple? Has it been increasing or decreasing recently?”

The LLM doesn’t have the current price of Apple in its training data. This is fresh, non-general information. So, we need to retrieve it from a database.
The LLM can read “Apple”, know that you’re talking about the computer company, and that the ticker symbol is AAPL. This is all general information.
What about the “increasing or decreasing” part of the prompt? You did not specify over what period—increasing in the past day, month, year? In order to construct a database query, we need more detail. LLMs are bad at knowing when to ask for detail and when to fill it in. The application could easily pull the wrong data and provide an unexpected or inaccurate answer. Only you know what these details should be, depending on your intent. You must be more specific in your prompt.

A designer of this LLM application can improve the user experience by specifying required parameters for expected queries. We can ask the user to explicitly input the time range or design the chatbot to ask for more specific details if not provided. In either case, we need to have a specific type of query in mind and explicitly design how to handle it. The LLM will not know how to do this unassisted.

Insight: If a user is expecting a more specific type of output, you need to explicitly ask for enough detail. Too little detail could produce a poor quality output.

Non-public information

Incorporating non-public information into an LLM prompt can be done if that information can be accessed in a database. This introduces privacy issues (should the LLM be able to access my medical records?) and complexity when incorporating multiple non-public sources of information.

Let’s say I have a chatbot that helps you make dinner reservations. You, the user, type the following: “Help me make dinner reservations somewhere with good Neapolitan pizza.”

The LLM knows what a Neapolitan pizza is and can infer that “dinner” means this is for an evening meal.
To do this task well, it needs information about your location, the restaurants near you and their booking status, or even personal details like dietary restrictions. Assuming all that non-public information is available in databases, bringing them all together into the prompt takes a lot of engineering work.
Even if the LLM could find the “best” restaurant for you and book the reservation, can you be confident it has done that correctly? You never specified how many people you need a reservation for. Since only you know this information, the application needs to ask for it upfront.

If you’re designing this LLM-based application, you can make some thoughtful choices to help with these problems. We could ask about a user’s dietary restrictions when they sign up for the app. Other information, like the user’s schedule that evening, can be given in a prompting tip or by showing the default prompt option “show me reservations for two for tomorrow at 7PM”. Promoting tips may not feel as automagical as a bot that does it all, but they are a straightforward way to collect and integrate the non-public information.

Some non-public information is large and can’t be quickly collected and processed when the prompt is given. These need to be fine-tuned in batch or retrieved at prompt time and incorporated. A chatbot that answers information about a company’s HR policies can obtain this information from a corpus of non-public HR documents. You can fine-tune the model ahead of time by feeding it the corpus. Or you can implement a retrieval augmented generation technique, searching a corpus for relevant documents and summarizing the results. Either way, the response will only be as accurate and up-to-date as the corpus itself.

Insight: When designing an AI application, you need to be aware of non-public information and how to retrieve it. Some of that information can be pulled from databases. Some needs to come from the user, which may require prompt suggestions or explicitly asking.

If you understand the types of information and treat human-AI interaction as declarative, you can more easily predict which AI applications will work and which ones won’t. In the next section we’ll look at OpenAI’s Operator and deep research products. Using this framework, we can see where these applications fall short, where they work well, and why.

Critiquing OpenAI’s Operator and deep research through a declarative lens

I have now explained how thinking of generative AI as declarative helps us understand its strengths and weaknesses. I also identified how different types of information create better or worse human-AI interactions.

Now I’ll apply these ideas by critiquing two recent products from OpenAI—Operator and deep research. It’s important to be honest about the shortcomings of AI applications. Bigger models trained on more data or using new techniques might one day solve some issues with generative AI. But other issues arise from the human-AI interaction itself and can only be addressed by making appropriate design and product choices.

These critiques demonstrate how the framework can help identify where the limitations are and how to address them.

The limitations of Operator

Journalist Casey Newton of Platformer reviewed Operator in an article that was largely positive. Newton has covered AI extensively and optimistically. Still, Newton couldn’t help but point out some of Operator’s frustrating limitations.

[Operator] can take action on your behalf in ways that are new to AI systems — but at the moment it requires a lot of hand-holding, and may cause you to throw up your hands in frustration.

My most frustrating experience with Operator was my first one: trying to order groceries. “Help me buy groceries on Instacart,” I said, expecting it to ask me some basic questions. Where do I live? What store do I usually buy groceries from? What kinds of groceries do I want?

It didn’t ask me any of that. Instead, Operator opened Instacart in the browser tab and begin searching for milk in grocery stores located in Des Moines, Iowa.

The prompt “Help me buy groceries on Instacart,” viewed declaratively, describes groceries being purchased using Instacart. It doesn’t have a lot of the information someone would need to buy groceries, like what exactly to buy, when it would be delivered, and to where.

It’s worth repeating: LLMs are not good at knowing when to ask additional questions unless explicitly programmed to do so in the use case. Newton gave a vague request and expected follow-up questions. Instead, the LLM filled in all the missing details with the “average”. The average item was milk. The average location was Des Moines, Iowa. Newton doesn’t mention when it was scheduled to be delivered, but if the “average” delivery time is tomorrow, then that was likely the default.

If we engineered this application specifically for ordering groceries, keeping in mind the declarative nature of AI and the information it “knows”, then we could make thoughtful design choices that improve functionality. We would need to prompt the user to specify when and where they want groceries up front (non-public information). With that information, we could find an appropriate grocery store near them. We would need access to that grocery store’s inventory (more non-public information). If we have access to the user’s previous orders, we could also pre-populate a cart with items typical to their order. If not, we may add a few suggested items and guide them to add more. By limiting the use case, we only have to deal with two sources of non-public information. This is a more tractable problem than Operator’s “agent that does it all” approach.

Newton also mentions that this process took eight minutes to complete, and “complete” means that Operator did everything up to placing the order. This is a long time with very little human-in-the-loop iteration. Like we said before, an iteration loop is very important for human-AI interaction. A better-designed application would generate smaller steps along the way and provide more frequent interaction. We could prompt the user to describe what to add to their shopping list. The user might say, “Add barbeque sauce to the list,” and see the list update. If they see a vinegar-based barbecue sauce, they can refine that by saying, “Replace that with a barbeque sauce that goes well with chicken,” and might be happier when it’s replaced by a honey barbecue sauce. These frequent iterations make the LLM a creative tool rather than a does-it-all agent. The does-it-all agent looks automagical in marketing, but a more guided approach provides more utility with a less frustrating and more delightful experience.

Elsewhere in the article, Newton gives an example of a prompt that Operator performed well: “Put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard.” This prompt describes an output using much more specificity. It also solely relies on general information—the Great Gatsby, the Common Core standard, and a general sense of what assignments are. The general-information use case lends itself better to AI generation, and the prompt is explicit and detailed in its request. In this case, very little guidance was given to create the prompt, so it worked better. (In fact, this prompt comes from Ethan Mollick who has used it to evaluate AI chatbots.)

This is the risk of general-purpose AI applications like Operator. The quality of the result relies heavily on the use case and specificity provided by the user. An application with a more specific use case allows for more design guidance and can produce better output more reliably.

The limitations of deep research

Newton also reviewed deep research, which, according to OpenAI’s website, is an “agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you.”

Deep research came out after Newton’s review of Operator. Newton chose an intentionally tricky prompt that prods at some of the tool’s limitations regarding fresh information and non-general information: “I wanted to see how OpenAI’s agent would perform given that it was researching a story that was less than a day old, and for which much of the coverage was behind paywalls that the agent would not be able to access. And indeed, the bot struggled more than I expected.”

Near the end of the article, Newton elaborates on some of the shortcomings he noticed with deep research.

OpenAI’s deep research suffers from the same design problem that almost all AI products have: its superpowers are completely invisible and must be harnessed through a frustrating process of trial and error.

Generally speaking, the more you already know about something, the more useful I think deep research is. This may be somewhat counterintuitive; perhaps you expected that an AI agent would be well suited to getting you up to speed on an important topic that just landed on your lap at work, for example.

In my early tests, the reverse felt true. Deep research excels for drilling deep into subjects you already have some expertise in, letting you probe for specific pieces of information, types of analysis, or ideas that are new to you.

The “frustrating trial and error” shows a mismatch between Newton’s expectations and a necessary aspect of many generative AI applications. A good response requires more information than the user will probably give in the first attempt. The challenge is to design the application and set the user’s expectations so that this interaction is not frustrating but exciting.

Newton’s more poignant criticism is that the application requires already knowing something about the topic for it to work well. From the perspective of our framework, this makes sense. The more you know about a topic, the more detail you can provide. And as you iterate, having knowledge about a topic helps you observe and evaluate the output. Without the ability to describe it well or evaluate the results, the user is less likely to use the tool to generate good output.

A version of deep research designed for lawyers to perform legal research could be powerful. Lawyers have an extensive and common vocabulary for describing legal matters, and they’re more likely to see a result and know if it makes sense. Generative AI tools are fallible, though. So, the tool should focus on a generation-evaluation loop rather than writing a final draft of a legal document.

The article also highlights many improvements compared to Operator. Most notably, the bot asked clarifying questions. This is the most impressive aspect of the tool. Undoubtedly, it helps that deep search has a focused use-case of retrieving and summarizing general information instead of a does-it-all approach. Having a focused use case narrows the set of likely interactions, letting you design better guidance into the prompt flow.

Good application design with generative AI

Designing effective generative AI applications requires thoughtful consideration of how users interact with the technology, the types of information they need, and the limitations of the underlying models. Here are some key principles to guide the design of generative AI tools:

1. Constrain the input and focus on providing details

Applications are inputs and outputs. We want the outputs to be useful and pleasant. By giving a user a conversational chatbot interface, we allow for a vast surface area of potential inputs, making it a challenge to guarantee useful outputs. One strategy is to limit or guide the input to a more manageable subset.

For example, FigJam, a collaborative whiteboarding tool, uses pre-set template prompts for timelines, Gantt charts, and other common whiteboard artifacts. This provides some structure and predictability to the inputs. Users still have the freedom to describe further details like color or the content for each timeline event. This approach ensures that the AI has enough specificity to generate meaningful outputs while giving users creative control.

2. Design frequent iteration and evaluation into the tool

Iterating in a tight generation-evaluation loop is essential for refining outputs and ensuring they meet user expectations. OpenAI’s Dall-E is great at this. Users quickly iterate on image prompts and refine their descriptions to add additional detail. If you type “a picture of a cheeseburger on a plate”, you may then add more detail by specifying “with pepperjack cheese”.

AI code generating tools work well because users can run a generated code snippet immediately to see if it works, enabling rapid iteration and validation. This quick evaluation loop produces better results and a better coder experience.

Designers of generative AI applications should pull the user in the loop early, often, in a way that is engaging rather than frustrating. Designers should also consider the user’s knowledge level. Users with domain expertise can iterate more effectively.

Referring back to the FigJam example, the prompts and icons in the app quickly communicate “this is what we call a mind map” or “this is what we call a gantt chart” for users who want to generate these artifacts but don’t know the terms for them. Giving the user some basic vocabulary can help them better generate desired results quickly with less frustration.

3. Be mindful of the types of information needed

LLMs excel at tasks involving general knowledge already in the base training set. For example, writing class assignments involves absorbing general information, synthesizing it, and producing a written output, so LLMs are very well-suited for that task.

Use cases that require non-general information are more complex. Some questions the designer and engineer should ask include:

Does this application require fresh information? Maybe this is knowledge of current events or a user’s current bank account balance. If so, that information needs to be retrieved and incorporated into the model.
How much non-general information does the LLM need to know? If it’s a lot of information—like a corpus of company documentation and communication—then the model may need to be fine tuned in batch ahead of time. If the information is relatively small, a retrieval augmented generation (RAG) approach at query time may suffice.
How many sources of non-general information—small and finite or potentially infinite? General purpose agents like Operator face the challenge of potentially infinite non-general information sources. Depending on what the user requires, it could need to access their contacts, restaurant reservation lists, financial data, or even other people’s calendars. A single-purpose restaurant reservation chatbot may only need access to Yelp, OpenTable, and the user’s calendar. It’s much easier to reconcile access and authentication for a handful of known data sources.
Is there context-specific information that can only come from the user? Consider our restaurant reservation chatbot. Is the user making reservations for just themselves? Probably not. “How many people and who” is a detail that only the user can provide, an example of non-public information that only the user knows. We shouldn’t expect the user to provide this information upfront and unguided. Instead, we can use prompt suggestions so they include the information. We may even be able to design the LLM to ask these questions when the detail is not provided.

4. Focus on specific use cases

Broad, all-purpose chatbots often struggle to deliver consistent results due to the complexity and variability of user needs. Instead, focus on specific use cases where the AI’s shortcomings can be mitigated through thoughtful design.

Narrowing the scope helps us address many of the issues above.

We can identify common requests for the use case and incorporate those into prompt suggestions.
We can design an iteration loop that works well with the type of thing we’re generating.
We can identify sources of non-general information and devise solutions to incorporate it into the model or prompt.

5. Translation or summary tasks work well

A common task for ChatGPT is to rewrite something in a different style, explain what some computer code is doing, or summarize a long document. These tasks involve converting a set of information from one form to another.

We have the same concerns about non-general information and context. For instance, a Chatbot asked to explain a code script doesn’t know the system that script is part of unless that information is provided.

But in general, the task of transforming or summarizing information is less prone to missing details. By definition, you have provided the details it needs. The result should have the same information in a different or more condensed form.

The exception to the rules

There is a case when it doesn’t matter if you break any or all of these rules—when you’re just having fun. LLMs are creative tools by nature. They can be an easel to paint on, a sandbox to build in, a blank sheet to scribe. Iteration is still important; the user wants to see the thing they’re creating as they create it. But unexpected results due to lack of information or omitted details may add to the experience. If you ask for a cheeseburger recipe, you might get some funny or interesting ingredients. If the stakes are low and the process is its own reward, don’t worry about the rules.

The post Generative AI Is Declarative appeared first on Towards Data Science.

Avoidable and Unavoidable Randomness in GPT-4o

Vincent Vatter — Mon, 03 Mar 2025 12:00:00 +0000

Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"\nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0380541235 |    96.2660836887 |
| T         |     -3.2880542278 |     3.7326407467 |
| Sure      |    -12.5380544662 |     0.0003587502 |
| Head      |    -12.7880544662 |     0.0002793949 |
| Tail      |    -13.2880544662 |     0.0001694616 |
| Certainly |    -13.5380544662 |     0.0001319768 |
| "T        |    -14.2880544662 |     0.0000623414 |
| I'm       |    -14.5380544662 |     0.0000485516 |
| heads     |    -14.5380544662 |     0.0000485516 |
| Heads     |    -14.9130544662 |     0.0000333690 |
| "         |    -15.1630544662 |     0.0000259878 |
| _heads    |    -15.1630544662 |     0.0000259878 |
| tails     |    -15.5380544662 |     0.0000178611 |
| HEAD      |    -15.7880544662 |     0.0000139103 |
| TAIL      |    -16.2880535126 |     0.0000084370 |
| T         |    -16.7880535126 |     0.0000051173 |
| ```       |    -16.7880535126 |     0.0000051173 |
| Here's    |    -16.9130535126 |     0.0000045160 |
| I         |    -17.2880535126 |     0.0000031038 |
| As        |    -17.2880535126 |     0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0110520627 |    98.9008786933 |
| T         |     -4.5110521317 |     1.0986894433 |
| Certainly |    -14.0110521317 |     0.0000822389 |
| Head      |    -14.2610521317 |     0.0000640477 |
| Sure      |    -14.2610521317 |     0.0000640477 |
| Tail      |    -14.3860521317 |     0.0000565219 |
| heads     |    -15.3860521317 |     0.0000207933 |
| Heads     |    -15.5110521317 |     0.0000183500 |
| ```       |    -15.5110521317 |     0.0000183500 |
| _heads    |    -15.6360521317 |     0.0000161938 |
| tails     |    -15.6360521317 |     0.0000161938 |
| I'm       |    -15.8860521317 |     0.0000126117 |
| "T        |    -15.8860521317 |     0.0000126117 |
| As        |    -16.3860511780 |     0.0000076494 |
| "         |    -16.5110511780 |     0.0000067506 |
| HEAD      |    -16.6360511780 |     0.0000059574 |
| TAIL      |    -16.7610511780 |     0.0000052574 |
| Here's    |    -16.7610511780 |     0.0000052574 |
| ``        |    -17.1360511780 |     0.0000036133 |
| T         |    -17.6360511780 |     0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (<0.5) make it robotic and predictable, medium temperatures (0.7–1.3) allow some creativity, and high temperatures (>1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint   |    Logprob | Probability    |
|---------------|------------|----------------|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

|     Logprob | Probability    |
|-------------|----------------|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability          Count           Fingerprints
------------------------------------------------------------------
85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883%       2               fp_f9f4fb6dbf
------------------------------------------------------------------
92.3997629747%       1               fp_eb9dce56a8
92.3997733012%       4               fp_eb9dce56a8
92.3997836277%       3               fp_eb9dce56a8
------------------------------------------------------------------
92.4128943690%       1               fp_f9f4fb6dbf
92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
93.9906837191%       4               fp_eb9dce56a8
------------------------------------------------------------------
95.2569999350%       36              fp_eb9dce56a8
------------------------------------------------------------------
96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
97.0674551052%       1               fp_eb9dce56a8
97.0674778863%       3               fp_eb9dce56a8
97.0675003058%       4               fp_eb9dce56a8
97.0675116963%       1               fp_eb9dce56a8
97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405%       4               fp_eb9dce56a8
------------------------------------------------------------------
97.7008960695%       1               fp_f9f4fb6dbf
97.7011122645%       3               fp_eb9dce56a8
97.7011462953%       3               fp_eb9dce56a8
97.7018178132%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548%       6               fp_f9f4fb6dbf
98.2007107019%       1               fp_eb9dce56a8
98.2009525133%       5               fp_eb9dce56a8
98.2009751945%       1               fp_eb9dce56a8
98.2009867181%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253%       9               fp_eb9dce56a8
98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790%       384             fp_eb9dce56a8
------------------------------------------------------------------
98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

Unraveling Large Language Model Hallucinations

Prashal Ruchiranga — Fri, 28 Feb 2025 20:15:37 +0000

Introduction

In a YouTube video titled Deep Dive into LLMs like ChatGPT, former Senior Director of AI at Tesla, Andrej Karpathy discusses the psychology of Large Language Models (LLMs) as emergent cognitive effects of the training pipeline. This article is inspired by his explanation of LLM hallucinations and the information presented in the video.

You might have seen model hallucinations. They are the instances where LLMs generate incorrect, misleading, or entirely fabricated information that appears plausible. These hallucinations happen because LLMs do not “know” facts in the way humans do; instead, they predict words based on patterns in their training data. Early models released a few years ago struggled significantly with hallucinations. Over time, mitigation strategies have improved the situation, though hallucinations haven’t been fully eliminated.

An illustrative example of LLM hallucinations (Image by Author)

Zyler Vance is a completely fictitious name I came up with. When I input the prompt “Who is Zyler Vance?” into the falcon-7b-instruct model, it generates fabricated information. Zyler Vance is not a character in The Cloverfield Paradox (2018) movie. This model, being an older version, is prone to hallucinations.

LLM Training Pipeline

To understand where these hallucinations originate from, you have to be familiar with the training pipeline. Training LLMs typically involve three major stages.

Pretraining
Post-training: Supervised Fine-Tuning (SFT)
Post-training: Reinforcement Learning with Human Feedback (RLHF)

Pretraining

This is the initial stage of the training for LLMs. During pretraining the model is exposed to a huge quantity of very high-quality and diverse text crawled from the internet. Pretraining helps the model learn general language patterns, grammar, and facts. The output of this training phase is called the base model. It is a token simulator that predicts the next word in a sequence.

To get a sense of what the pretraining dataset might look like you can see the FineWeb dataset. FineWeb dataset is fairly representative of what you might see in an enterprise-grade language model. All the major LLM providers like OpenAI, Google, or Meta will have some equivalent dataset internally like the FineWeb dataset.

Post-Training: Supervised Fine-Tuning

As I mentioned before, the base model is a token simulator. It simply samples internet text documents. We need to turn this base model into an assistant that can answer questions. Therefore, the pretrained model is further refined using a dataset of conversations. These conversation datasets have hundreds of thousands of conversations that are multi-term and very long covering a diverse breadth of topics.

Illustrative human assistant conversations from InstructGPT distribution

These conversations come from human labelers. Given conversational context human lablers write out ideal responses for an assistant in any situation. Later, we take the base model that is trained on internet documents and substitute the dataset with the dataset of conversations. Then continue the model training on this new dataset of conversations. This way, the model adjusts rapidly and learns the statistics of how this assistant responds to queries. At the end of training the model is able to imitate human-like responses.

OpenAssistant/oasst1 is one of the open-source conversations dataset available at hugging face. This is a human-generated and human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages.

Post-training: Reinforcement Learning with Human Feedback

Supervised Fine-Tuning makes the model capable. However, even a well-trained model can generate misleading, biased, or unhelpful responses. Therefore, Reinforcement Learning with Human Feedback is required to align it with human expectations.

We start with the assistant model, trained by SFT. For a given prompt we generate multiple model outputs. Human labelers rank or score multiple model outputs based on quality, safety, and alignment with human preferences. We use these data to train a whole separate neural network that we call a reward model.

The reward model imitates human scores. It is a simulator of human preferences. It is a completely separate neural network, probably with a transformer architecture, but it is not a language model in the sense that it generates diverse language. It’s just a scoring model.

Now the LLM is fine-tuned using reinforcement learning, where the reward model provides feedback on the quality of the generated outputs. So instead of asking a real human, we’re asking a simulated human for their score of an output. The goal is to maximize the reward signal, which reflects human preferences.

Why Hallucinations?

Now that we have a clearer understanding of the training process of large language models, we can continue with our discussion on hallucinations.

Hallucinations originate from the Supervised Fine-Tuning stage of the training pipeline. The following is a specific example of three potential conversations you might have on your training set.

Examples of human-assistant conversations (Image by Author)

As I have shown earlier, this is what human-assistant conversations would look like in the training time. These conversations are created by human labelers under strict guidelines. When a labeler is writing the correct answer for the assistant in each one of these cases either they know this person or they research them on the internet. After that, they write the assistant response that has a confident tone of an answer.

At test time, if the model is asked about an individual it has not seen during training, it does not simply respond with an acknowledgment of ignorance. Simply put it does not reply with “Oh, I don’t know”. Instead, the model statistically imitates the training set.

In the training set, the questions in the form “Who is X?” are confidently answered with the correct answer. Therefore at the test time, the model replies with the style of the answer and it gives the statistically most likely guess. So it just makes stuff up that is statistically consistent with the style of the answer in its training set.

Model Interrogation

Our question now is how to mitigate the hallucinations. It is evident that our dataset should include examples where the correct answer for the assistant is that the model does not know about some particular fact. However, these answers must be produced only in instances where the model actually does not know. So the key question is how do we know what the model knows and what it does not? We need to probe the model to figure that out empirically.

The task is to figure out the boundary of the model’s knowledge. Therefore, we need to interrogate the model to figure out what it knows and doesn’t know. Then we can add examples to the training set for the things that the model doesn’t know. The correct response, in such cases, is that the model does not know them.

An example of a training instance where the model doesn’t know the answer to a particular question

Let’s take a look at how Meta dealt with hallucinations using this concept for the Llama 3 series of models.

In their 2024 paper titled “The Llama 3 Herd of Models”, Touvron et al. describe how they have developed a knowledge-probing technique to achieve this. Their primary approach involves generating data that aligns model generations with subsets of factual data present in the pre-training data. They describe the following procedure for the data generation process:

Extract a data snippet from the pre-training data.

Generate a factual question about these snippets (context) by prompting Llama 3.

Sample responses from Llama 3 to the question.

Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.

Score the informativeness of the generations using Llama 3 as a judge.

Generate a refusal for responses which are consistently informative and incorrect across the generations, using Llama 3. (p. 27)

After that data generated from the knowledge probe is used to encourage the model to only answer the questions for which it knows about, and refrain from answering questions that it is unsure about. Implementing this technique has improved the hallucination issue over time.

Using Web Search

We have better mitigation strategies than just saying we do not know. We can provide the LLM with an opportunity to generate factual responses and accurately address the question. What would you do, in a case where I ask you a factual question that you don’t have an answer to? How do you answer the question? You could do some research and search the internet to figure out the answer to the question. Then tell me the answer to the question. We can do the same thing with LLMs.

You can think of the knowledge inside the parameters of the trained neural network as a vague recollection of things that the model has seen during pretraining a long time ago. Knowledge in the model parameters is analogous to something in your memory that you read a month ago. You can remember things that you read continuously over time than something you read rarely. If you don’t have a good recollection of information that you read, what you do is go and look it up. When you look up information, you are essentially refreshing your working memory with information, allowing you to retrieve and discuss it.

We need some equivalent mechanism to allow the model to refresh its memory or recollection of information. We can achieve this by introducing tools for the model. The model can use web search tools instead of just replying with “I am sorry, I don’t know the answer”. To achieve this we need to introduce special tokens, such as and along with a protocol that defines how the model is allowed to use these tokens. In this mechanism, the language model can emit special tokens. Now in a case where the model doesn’t know the answer, it has the option to emit the special token instead of replying with “I am sorry, I don’t know the answer”. After that, the model will emit the query and .

Here when the program that is sampling from the model encounters the special token during inference, it will pause the generation process instead of sampling the next token in the sequence. It will initiate a session with the search engine, input the search query into the search engine, and retrieve all the extracted text from the results. Then it will insert that text inside the context window.

The extracted text from the web search is now within the context window that will be fed into the neural network. Think of the context window as the working memory of the model. The data inside the context window is directly accessible by the model. It is directly fed into the neural network. Therefore it is no longer a vague recollection of information. Now, when sampling new tokens, it can very easily reference the data that has been copy-pasted there. Thus, this is a general overview of how these web search tools function.

An example of a training instance with special tokens. The […] notation indicates the placeholder for the extracted content

How can we teach the model to correctly use these tools like web search? Again we accomplish this through training sets. We now need enough data and numerous conversations that demonstrate, by example, how the model should use web search. We need to illustrate with examples aspects such as: “What are the settings where you are using the search? What does it look like? How do you start a search?” Because of the pretraining stage, it possesses a native understanding of what a web search is and what constitutes a good search query. Therefore, if your training set contains several thousand examples, the model will be able to understand clearly how the tool works.

Conclusion

Large language model hallucinations are inherent consequences of the training pipeline, particularly arising from the supervised fine-tuning stage. Since language models are designed to generate statistically probable text, they often produce responses that appear plausible but lack a factual basis.

Early models were prone to hallucinations significantly. However, the problem has improved with the implementation of various mitigation strategies. Knowledge probing techniques and training the model to use web search tools have been proven effective in mitigating the problem. Despite these improvements, completely eliminating hallucinations remains an ongoing challenge. As LLMs continue to evolve, mitigating hallucinations to a large extent is crucial to ensuring their reliability as a trustworthy knowledge base.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

The post Unraveling Large Language Model Hallucinations appeared first on Towards Data Science.

How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API

Uladzimir Yancharuk — Fri, 21 Feb 2025 00:15:23 +0000

For a long time, one of the common ways to start new Node.js projects was using boilerplate templates. These templates help developers reuse familiar code structures and implement standard features, such as access to cloud file storage. With the latest developments in LLM, project boilerplates appear to be more useful than ever.

Building on this progress, I’ve extended my existing Node.js API boilerplate with a new tool LLM Codegen. This standalone feature enables the boilerplate to automatically generate module code for any purpose based on text descriptions. The generated module comes complete with E2E tests, database migrations, seed data, and necessary business logic.

History

I initially created a GitHub repository for a Node.js API boilerplate to consolidate the best practices I’ve developed over the years. Much of the implementation is based on code from a real Node.js API running in production on AWS.

I am passionate about vertical slicing architecture and Clean Code principles to keep the codebase maintainable and clean. With recent advancements in LLM, particularly its support for large contexts and its ability to generate high-quality code, I decided to experiment with generating clean TypeScript code based on my boilerplate. This boilerplate follows specific structures and patterns that I believe are of high quality. The key question was whether the generated code would follow the same patterns and structure. Based on my findings, it does.

To recap, here’s a quick highlight of the Node.js API boilerplate’s key features:

Vertical slicing architecture based on DDD & MVC principles
Services input validation using ZOD
Decoupling application components with dependency injection (InversifyJS)
Integration and E2E testing with Supertest
Multi-service setup using Dockercompose

Over the past month, I’ve spent my weekends formalizing the solution and implementing the necessary code-generation logic. Below, I’ll share the details.

Implementation Overview

Let’s explore the specifics of the implementation. All Code Generation logic is organized at the project root level, inside the llm-codegen folder, ensuring easy navigation. The Node.js boilerplate code has no dependency on llm-codegen, so it can be used as a regular template without modification.

LLM-Codegen folder structure

It covers the following use cases:

Generating clean, well-structured code for new module based on input description. The generated module becomes part of the Node.js REST API application.
Creating database migrations and extending seed scripts with basic data for the new module.
Generating and fixing E2E tests for the new code and ensuring all tests pass.

The generated code after the first stage is clean and adheres to vertical slicing architecture principles. It includes only the necessary business logic for CRUD operations. Compared to other code generation approaches, it produces clean, maintainable, and compilable code with valid E2E tests.

The second use case involves generating DB migration with the appropriate schema and updating the seed script with the necessary data. This task is particularly well-suited for LLM, which handles it exceptionally well.

The final use case is generating E2E tests, which help confirm that the generated code works correctly. During the running of E2E tests, an SQLite3 database is used for migrations and seeds.

Mainly supported LLM clients are OpenAI and Claude.

How to Use It

To get started, navigate to the root folder llm-codegen and install all dependencies by running:

npm i

llm-codegen does not rely on Docker or any other heavy third-party dependencies, making setup and execution easy and straightforward. Before running the tool, ensure that you set at least one *_API_KEY environment variable in the .env file with the appropriate API key for your chosen LLM provider. All supported environment variables are listed in the .env.sample file (OPENAI_API_KEY, CLAUDE_API_KEY etc.) You can use OpenAI, Anthropic Claude, or OpenRouter LLaMA. As of mid-December, OpenRouter LLaMA is surprisingly free to use. It’s possible to register here and obtain a token for free usage. However, the output quality of this free LLaMA model could be improved, as most of the generated code fails to pass the compilation stage.

To start llm-codegen, run the following command:

npm run start

Next, you’ll be asked to input the module description and name. In the module description, you can specify all necessary requirements, such as entity attributes and required operations. The core remaining work is performed by micro-agents: Developer, Troubleshooter, and TestsFixer.

Here is an example of a successful code generation:

Successful code generation

Below is another example demonstrating how a compilation error was fixed:

The following is an example of a generated orders module code:

A key detail is that you can generate code step by step, starting with one module and adding others until all required APIs are complete. This approach allows you to generate code for all required modules in just a few command runs.

How It Works

As mentioned earlier, all work is performed by those micro-agents: Developer, Troubleshooter and TestsFixer, controlled by the Orchestrator. They run in the listed order, with the Developer generating most of the codebase. After each code generation step, a check is performed for missing files based on their roles (e.g., routes, controllers, services). If any files are missing, a new code generation attempt is made, including instructions in the prompt about the missing files and examples for each role. Once the Developer completes its work, TypeScript compilation begins. If any errors are found, the Troubleshooter takes over, passing the errors to the prompt and waiting for the corrected code. Finally, when the compilation succeeds, E2E tests are run. Whenever a test fails, the TestsFixer steps in with specific prompt instructions, ensuring all tests pass and the code stays clean.

All micro-agents are derived from the BaseAgent class and actively reuse its base method implementations. Here is the Developer implementation for reference:

Each agent utilizes its specific prompt. Check out this GitHub link for the prompt used by the Developer.

After dedicating significant effort to research and testing, I refined the prompts for all micro-agents, resulting in clean, well-structured code with very few issues.

During the development and testing, it was used with various module descriptions, ranging from simple to highly detailed. Here are a few examples:

- The module responsible for library book management must handle endpoints for CRUD operations on books.
- The module responsible for the orders management. It must provide CRUD operations for handling customer orders. Users can create new orders, read order details, update order statuses or information, and delete orders that are canceled or completed. Order must have next attributes: name, status, placed source, description, image url
- Asset Management System with an "Assets" module offering CRUD operations for company assets. Users can add new assets to the inventory, read asset details, update information such as maintenance schedules or asset locations, and delete records of disposed or sold assets.

Testing with gpt-4o-mini and claude-3-5-sonnet-20241022 showed comparable output code quality, although Sonnet is more expensive. Claude Haiku (claude-3–5-haiku-20241022), while cheaper and similar in price to gpt-4o-mini, often produces non-compilable code. Overall, with gpt-4o-mini, a single code generation session consumes an average of around 11k input tokens and 15k output tokens. This amounts to a cost of approximately 2 cents per session, based on token pricing of 15 cents per 1M input tokens and 60 cents per 1M output tokens (as of December 2024).

Below are Anthropic usage logs showing token consumption:

Based on my experimentation over the past few weeks, I conclude that while there may still be some issues with passing generated tests, 95% of the time generated code is compilable and runnable.

I hope you found some inspiration here and that it serves as a starting point for your next Node.js API or an upgrade to your current project. Should you have suggestions for improvements, feel free to contribute by submitting PR for code or prompt updates.

If you enjoyed this article, feel free to clap or share your thoughts in the comments, whether ideas or questions. Thank you for reading, and happy experimenting!

UPDATE [February 9, 2025]: The LLM-Codegen GitHub repository was updated with DeepSeek API support. It’s cheaper than gpt-4o-mini and offers nearly the same output quality, but it has a longer response time and sometimes struggles with API request errors.

Unless otherwise noted, all images are by the author

The post How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API appeared first on Towards Data Science.

A Comprehensive Guide to LLM Temperature 🔥🌡️

Kelsey Wang — Fri, 07 Feb 2025 17:18:11 +0000

While building my own LLM-based application, I found many prompt engineering guides, but few equivalent guides for determining the temperature setting.

Of course, temperature is a simple numerical value while prompts can get mindblowingly complex, so it may feel trivial as a product decision. Still, choosing the right temperature can dramatically change the nature of your outputs, and anyone building a production-quality LLM application should choose temperature values with intention.

In this post, we’ll explore what temperature is and the math behind it, potential product implications, and how to choose the right temperature for your LLM application and evaluate it. At the end, I hope that you’ll have a clear course of action to find the right temperature for every LLM use case.

What is temperature?

Temperature is a number that controls the randomness of an LLM’s outputs. Most APIs limit the value to be from 0 to 1 or some similar range to keep the outputs in semantically coherent bounds.

From OpenAI’s documentation:

“Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.”

Intuitively, it’s like a dial that can adjust how “explorative” or “conservative” the model is when it spits out an answer.

What do these temperature values mean?

Personally, I find the math behind the temperature field very interesting, so I’ll dive into it. But if you’re already familiar with the innards of LLMs or you’re not interested in them, feel free to skip this section.

You probably know that an LLM generates text by predicting the next token after a given sequence of tokens. In its prediction process, it assigns probabilities to all possible tokens that could come next. For example, if the sequence passed to the LLM is “The giraffe ran over to the…”, it might assign high probabilities to words like “tree” or “fence” and lower probabilities to words like “apartment” or “book”.

But let’s back up a bit. How do these probabilities come to be?

These probabilities usually come from raw scores, known as logits, that are the results of many, many neural network calculations and other Machine Learning techniques. These logits are gold; they contain all the valuable information about what tokens could be selected next. But the problem with these logits is that they don’t fit the definition of a probability: they can be any number, positive or negative, like 2, or -3.65, or 20. They’re not necessarily between 0 and 1, and they don’t necessarily all add up to 1 like a nice probability distribution.

So, to make these logits usable, we need to use a function to transform them into a clean probability distribution. The function typically used here is called the softmax, and it’s essentially an elegant equation that does two important things:

It turns all the logits into positive numbers.
It scales the logits so they add up to 1.

Softmax formula

The softmax function works by taking each logit, raising e (around 2.718) to the power of that logit, and then dividing by the sum of all these exponentials. So the highest logit will still get the highest numerator, which means it gets the highest probability. But other tokens, even with negative logit values, will still get a chance.

Now here’s where Temperature comes in: temperature modifies the logits before applying softmax. The formula for softmax with temperature is:

Softmax with temperature

When the temperature is low, dividing the logits by T makes the values larger/more spread out. Then the exponentiation would make the highest value much larger than the others, making the probability distribution more uneven. The model would have a higher chance of picking the most probable token, resulting in a more deterministic output.

When the temperature is high, dividing the logits by T makes all the values smaller/closer together, spreading out the probability distribution more evenly. This means the model is more likely to pick less probable tokens, increasing randomness.

How to choose temperature

Of course, the best way to choose a temperature is to play around with it. I believe any temperature, like any prompt, should be substantiated with example runs and evaluated against other possibilities. We’ll discuss that in the next section.

But before we dive into that, I want to highlight that temperature is a crucial product decision, one that can significantly influence user behavior. It may seem rather straightforward to choose: lower for more accuracy-based applications, higher for more creative applications. But there are tradeoffs in both directions with downstream consequences for user trust and usage patterns. Here are some subtleties that come to mind:

Low temperatures can make the product feel authoritative. More deterministic outputs can create the illusion of expertise and foster user trust. However, this can also lead to gullible users. If responses are always confident, users might stop critically evaluating the AI’s outputs and just blindly trust them, even if they’re wrong.
Low temperatures can reduce decision fatigue. If you see one strong answer instead of many options, you’re more likely to take action without overthinking. This might lead to easier onboarding or lower cognitive load while using the product. Inversely, high temperatures could create more decision fatigue and lead to churn.
High temperatures can encourage user engagement. The unpredictability of high temperatures can keep users curious (like variable rewards), leading to longer sessions or increased interactions. Inversely, low temperatures might create stagnant user experiences that bore users.
Temperature can affect the way users refine their prompts. When answers are unexpected with high temperatures, users might be driven to clarify their prompts. But with low temperatures, users may be forced to add more detail or expand on their prompts in order to get new answers.

These are broad generalizations, and of course there are many more nuances with every specific application. But in most applications, the temperature can be a powerful variable to adjust in A/B testing, something to consider alongside your prompts.

Evaluating different temperatures

As developers, we’re used to unit testing: defining a set of inputs, running those inputs through a function, and getting a set of expected outputs. We sleep soundly at night when we ensure that our code is doing what we expect it to do and that our logic is satisfying some clear-cut constraints.

The promptfoo package lets you perform the LLM-prompt equivalent of unit testing, but there’s some additional nuance. Because LLM outputs are non-deterministic and often designed to do more creative tasks than strictly logical ones, it can be hard to define what an “expected output” looks like.

Defining your “expected output”

The simplest evaluation tactic is to have a human rate how good they think some output is, according to some rubric. For outputs where you’re looking for a certain “vibe” that you can’t express in words, this will probably be the most effective method.

Another simple evaluation tactic is to use deterministic metrics — these are things like “does the output contain a certain string?” or “is the output valid json?” or “does the output satisfy this javascript expression?”. If your expected output can be expressed in these ways, promptfoo has your back.

A more interesting, AI-age evaluation tactic is to use LLM-graded checks. These essentially use LLMs to evaluate your LLM-generated outputs, and can be quite effective if used properly. Promptfoo offers these model-graded metrics in multiple forms. The whole list is here, and it contains assertions from “is the output relevant to the original query?” to “compare the different test cases and tell me which one is best!” to “where does this output rank on this rubric I defined?”.

Example

Let’s say I’m creating a consumer-facing application that comes up with creative gift ideas and I want to empirically determine what temperature I should use with my main prompt.

I might want to evaluate metrics like relevance, originality, and feasibility within a certain budget and make sure that I’m picking the right temperature to optimize those factors. If I’m comparing GPT 4o-mini’s performance with temperatures of 0 vs. 1, my test file might start like this:

providers:
  - id: openai:gpt-4o-mini
    label: openai-gpt-4o-mini-lowtemp
    config:
      temperature: 0
  - id: openai:gpt-4o-mini
    label: openai-gpt-4o-mini-hightemp
    config:
      temperature: 1
prompts:
  - "Come up with a one-sentence creative gift idea for a person who is {{persona}}. It should cost under {{budget}}."

tests:
  - description: "Mary - attainable, under budget, original"
    vars:
      persona: "a 40 year old woman who loves natural wine and plays pickleball"
      budget: "$100"
    assert:
      - type: g-eval
        value:
          - "Check if the gift is easily attainable and reasonable"
          - "Check if the gift is likely under $100"
          - "Check if the gift would be considered original by the average American adult"
  - description: "Sean - answer relevance"
    vars:
      persona: "a 25 year old man who rock climbs, goes to raves, and lives in Hayes Valley"
      budget: "$50"
    assert:
      - type: answer-relevance
        threshold: 0.7

I’ll probably want to run the test cases repeatedly to test the effects of temperature changes across multiple same-input runs. In that case, I would use the repeat param like:

promptfoo eval --repeat 3

promptfoo test results

Conclusion

Temperature is a simple numerical parameter, but don’t be deceived by its simplicity: it can have far-reaching implications for any LLM application.

Tuning it just right is key to getting the behavior you want — too low, and your model plays it too safe; too high, and it starts spouting unpredictable responses. With tools like promptfoo, you can systematically test different settings and find your Goldilocks zone — not too cold, not too hot, but just right. ️

The post A Comprehensive Guide to LLM Temperature 🔥🌡️ appeared first on Towards Data Science.

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Shirley Li — Fri, 31 Jan 2025 10:02:05 +0000

Image created by author using ChatGPT.

This is the first article of our new series "Deepseek-V3 Explained", where we will try to demystify DeepSeek-V3 [1, 2], the latest model open-sourced by DeepSeek.

In this series, we aim to cover two major topics:

Major architecture innovations in DeepSeek-V3, including MLA (Multi-head Latent Attention) [3], DeepSeekMoE [4], auxiliary-loss-free load balancing [5], and multi-token prediction training.
Training of DeepSeek-V3, including pre-training, finetuning and RL alignment phases.

This article mainly focuses on Multi-head Latent Attention, which was first proposed in the development of DeepSeek-V2 and then used in DeepSeek-V3 as well.

Table of contents:

Background: we start from the standard MHA, explain why we need Key-Value cache at inference stage, how MQA and GQA try to optimize it, and how RoPE works, etc.
Multi-head Latent Attention: An in-depth introduction to MLA, including its motivations, why decoupled RoPE is needed, and its performance.
References.

Background

To better understand MLA and also make this article self-contained, we will revisit several related concepts in this section before diving into the details of MLA.

MHA in Decoder-only Transformers

Note that MLA is developed to speedup inference speed in autoregressive text generation, so the MHA we are talking about under this context is for decoder-only Transformer.

The figure below compares three Transformer architectures used for decoding, where (a) shows both the encoder and decoder proposed in the original "Attention is All You Need" paper. Its decoder part is then simplified by [6], leading to a decoder-only Transformer model shown in (b), which is later used in many generation models like GPT [8].

Nowadays, LLMs are more commonly to choose the structure shown in (c) for more stable training, with normalization applied on the input rather then output, and LayerNorm upgraded to RMS Norm. This will serve as the baseline architecture we will discuss in this article.

Figure 1. Transformer architectures. (a) encoder-decoder proposed in [6]. (b) Decoder-only Transformer proposed in [7] and used in GPT [8]. (c) An optimized version of (b) with RMS Norm before attention. [3]

Within this context, MHA calculation largely follows the process in [6], as shown in the figure below:

Figure 2. Scaled dot-product attention vs. Multi-Head Attention. Image from [6].

Assume we have n_h attention heads, and the dimension for each attention head is represented as d_h, so that the concatenated dimension will be (h_n · d_h).

Given a model with l layers, if we denote the input for the t-th token in that layer as h_t with dimension d, we need to map the dimension of h_t from d to (h_n · d_h) using the linear mapping matrices.

More formally, we have (equations from [3]):

where W^Q, W^K and W^V are the linear mapping matrices:

After such mapping, q_t, k_t and v_t will be split into n_h heads to calculate the scaled dot-product attention:

where W^O is another projection matrix to map the dimension inversely from (h_n · d_h) to d:

Note that the process described by Eqn.(1) to (8) above is just for a single token. During inference, we need to repeat this process for each newly generated token, which involves a lot of repeated calculation. This leads to a technique called Key-Value cache.

Key-Value Cache

As suggested by its name, Key-Value cache is a technique designed to speedup the autoregressive process by caching and reusing the previous keys and values, rather than re-computing them at each decoding step.

Note that KV cache is typically used only during the inference stage, since in training we still need to process the entire input sequence in parallel.

KV cache is commonly implemented as a rolling buffer. At each decoding step, only the new query Q is computed, while the K and V stored in the cache will be reused, so that the attention will be computed using the new Q and reused K, V. Meanwhile, the new token’s K and V will also be appended to the cache for later use.

However, the speedup achieved by KV cache comes at a cost of memory, since KV cache often scales with batch size × sequence length × hidden size × number of heads, leading to a memory bottleneck when we have larger batch size or longer sequences.

That further leads to two techniques aiming at addressing this limitation: Multi-Query Attention and Grouped-Query Attention.

Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

The figure below shows the comparison between the original MHA, Grouped-Query Attention (GQA) [10] and Multi-Query Attention (MQA) [9].

Figure 3. MHA [6], GQA [10] AND MQA [9]. Image from [10].

The basic idea of MQA is to share a single key and a single value head across all query heads, which can significantly reduce memory usage but will also impact the accuracy of attention.

GQA can be seen as an interpolating method between MHA and MQA, where a single pair of key and value heads will be shared only by a group of query heads, not all queries. But still this will lead to inferior results compared to MHA.

In the later sections, we will see how MLA manages to seek a balance between memory efficiency and modeling accuracy.

RoPE (Rotary Positional Embeddings)

One last piece of background we need to mention is RoPE [11], which encodes positional information directly into the attention mechanism by rotating the query and key vectors in multi-head attention using sinusoidal functions.

More specifically, RoPE applies a position-dependent rotation matrix to the query and key vectors at each token, and uses sine and cosine functions for its basis but applies them in a unique way to achieve rotation.

To see what makes it position-dependent, consider a toy embedding vector with only 4 elements, i.e., (x_1, x_2, x_3, x_4).

To apply RoPE, we firstly group consecutive dimensions into pairs:

(x_1, x_2) -> position 1
(x_3, x_4) -> position 2

Then, we apply a rotation matrix to rotate each pair:

Figure 4. Illustration of the rotation matrix applied to a pair of tokens. Image by author.

where θ = θ(p) = p ⋅ θ_0, and θ_0 is a base frequency. In our 4-d toy example, this means that (x_1, x_2) will be rotated by θ_0, and (x_3, x_4) will be rotated by 2 ⋅ θ_0.

This is why we call this rotation matrix as position-dependent: at each position (or each pair), we will apply a different rotation matrix where the rotation angle is determined by position.

RoPE is widely used in modern LLMs due to its efficiency in encoding long sequences, but as we can see from the above formula, it is position-sensitive to both Q and K, making it incompatible with MLA in some ways.

Multi-head Latent Attention

Finally we can move on to the MLA part. In this section we will first layout the high-level idea of MLA, and then dive deeper into why it needs to modify RoPE. Finally, we present the detailed algorithm of MLA as well as it performance.

MLA: High-level Idea

The basic idea of MLA is to compress the attention input h_t into a low-dimensional latent vector with dimension d_c, where d_c is much lower than the original (h_n · d_h). Later when we need to calculate attention, we can map this latent vector back to the high-dimensional space to recover the keys and values. As a result, only the latent vector needs to be stored, leading to significant memory reduction.

This process can be more formally described with the following equations, where c^{KV}_t is the latent vector, W^{DKV} is the compressing matrix that maps h_t‘s dimension from (h_n · d_h) to d_c (here D in the superscript stands for "down-projection", meaning compressing the dimension), while W^{UK} and W^{UV} are both up-projection matrices that map the shared latent vector back to the high-dimensional space.

Similarly, we can also map the queries into a latent, low-dimensional vector and then map it back to the original, high-dimensional space:

Why Decoupled RoPE is Needed

As we mentioned before, RoPE is a common choice for training generation models to handle long sequences. If we directly apply the above MLA strategy, that will be incompatible with RoPE.

To see this more clearly, consider what happens when we calculate attention using Eqn. (7): when we multiply the transposed q with k, the matrices W^Q and W^{UK} will appear in the middle, and their combination equivalents to a single mapping dimension from d_c to d.

In the original paper [3], the authors describe this as W^{UK} can be "absorbed" into W^Q, as a result we do not need to store W^{UK} in the cache, further reducing memory usage.

However, this will not be the case when we take the rotation matrix in Figure (4) into consideration, as RoPE will apply a rotation matrix on the left of W^{UK}, and this rotation matrix will end up in between the transposed W^Q and W^{UK}.

As we have explained in the background part, this rotation matrix is position-dependent, meaning the rotation matrix for each position is different. As a result, W^{UK} can no longer be absorbed by W^Q.

To address this conflict, the authors propose what they call "a decoupled RoPE", by introducing additional query vectors along with a shared key vector, and use these additional vectors in the RoPE process only, while keeping the original keys kind of isolated with the rotation matrix.

The entire process of MLA can be summarized as below (equation numbers reused from Appendix C in [3]):

Figure 5. MLA process. Image edited by author based on equations in [3].

where

Eqn. (37) to (40) describe how to process query tokens.
Eqn. (41) and (42) describe how to process key tokens.
Eqn. (43) and (44) describe how to use the additional shared key for RoPE, be aware that the output of (42) is not involved in RoPE.
Eqn. (45) describes how to process value tokens.

During this process, only the blue variables with boxes need to be cached. This process can be illustrated more clearly with the flowchart blow:

Figure 6. Flowchart of MLA. Image from [3].

Performance of MLA

The table below compares the number of elements needed for KV cache (per token) as well as the modeling capacity between MHA, GQA, MQA and MLA, demonstrating that MLA could indeed achieve a better balance between memory efficiency vs. modeling capacity.

Interestingly, the modeling capacity for MLA even surpass that of the original MHA.

Table 1 from [3].

More specifically, the table below shows the performance of MHA, GQA and MQA on 7B models, where MHA significantly outperforms both MQA and GQA.

Table 8 from [3].

The authors of [3] also conduct analysis between MHA vs. MLA, and results are summarized in the table below, where MLA achieves better results overall.

Table 9 in [3].

References

The post DeepSeek-V3 Explained 1: Multi-head Latent Attention appeared first on Towards Data Science.

Prompting Vision Language Models

Anand Subramanian — Wed, 29 Jan 2025 14:02:01 +0000

Vision Language Models (VLMs) represent a significant advancement in processing and understanding multimodal data by combining text and visual inputs.

High Level Overview of VLMs. The picture of the cute dog is from Josh Frenette on Unsplash. This image is inspired by the representation of VLMs provided in this blog from HuggingFace (https://huggingface.co/blog/vlms) (Overall Image By Author)

Unlike Large Language Models (LLMs), which handle only text, VLMs are multimodal, enabling users to address tasks requiring both visual and textual comprehension. This capability opens up a wide range of applications, such as Visual Question Answering (VQA), where models answer questions based on images, and Image Captioning, which involves generating descriptive text for images. In this blog post, I explain how we can prompt VLMs for tasks requiring visual understanding, and the different ways in which we can do so.

Introduction
Prompting with VLMs
Zero-Shot Prompting
Few-Shot Prompting
Chain of Thought Prompting
Object Detection Guided Prompting
Conclusion
References

Introduction

VLMs are extensions of existing LLMs – particularly in that they process vision as an additional modality. VLMs are typically trained with mechanisms that align image and text representations in the same representation or vector space, using techniques like cross-attention [1] [2] [3] [4]. The advantages of such systems is that you can interact or "query" images through text as a convenient interface. Because of their multi-modal capacity, VLMs are crucial for bridging the gap between textual and visual data, opening up a large range of use cases that text-only models can’t address. For a more in-depth understanding of how VLMs work, I recommend Sebastian Raschka’s excellent article on multi-modal LLMs.

Prompting with VLMs

In an earlier blog, I touched upon techniques for prompting LLMs. Similar to LLMs, VLMs can also be prompted using similar techniques – with the added advantage of incorporating images to help models better understand the task at hand. In this blog, I talk about the common prompting paradigms that apply to VLMs, covering zero-shot, few-shot and chain-of-thought prompting. I also explore how other deep learning methods can help guide in prompting VLMs – such as integrating object detection into prompting strategies. For my experiments, I use OpenAI’s GPT-4o-mini model which is a VLM.

All code and resources used in this blog post are available at this link on github.

Data Used

For this blogpost, I use 5 images downloaded from Unsplash that are permissively licensed for usage. Specifically, I use the following images:

Photo by Mathias Reding on Unsplash
Photo by Josh Frenette on Unsplash
Photo by sander traa on Unsplash
Photo by NEOM on Unsplash
Photo by Alexander Zaytsev on Unsplash

The captions for these images are derived from the image urls, as Unsplash images come with the corresponding captions for each image.

Zero-Shot Prompting

Zero-Shot Prompting Representation (Image by Author)

Zero-Shot Prompting represents a scenario where the user only provides a description of the task through the system and user prompt, along with the image(s) to be processed. In this setup, the VLM relies only on the task description to generate an output. If we were to categorize prompting methods on a spectrum based on the amount of information provided to the VLM, zero-shot prompting presents the least amount of information to the model.

The advantage for users is that, for most tasks, a well-crafted zero-shot prompt can produce decent outputs simply by describing the task in text. Consider the implications of this: just a few years ago, tasks like image classification or image captioning required training a CNN or deep learning model on a large dataset before they could be used. Now, you can perform these tasks by utilizing text to describe them. You can create an off-the-shelf image classifier by specifying and describing what you want to analyze. Before VLMs, achieving this would have required gathering a large, task-specific dataset, training a model, and then using it for inference.

So how do we prompt them? OpenAI supports sending images as Base64 encoded urls to the VLM [2]. The general request structure looks like this:

{
  "role": "system",
  "content": "You are a helpful assistant that can analyze images and provide captions."
},
{
  "role": "user",
  "content": [
    {
      "type": "text",
      "text": "Please analyze the following image:"
    },
    {
      "type": "image_url",
      "image_url": {
        "url": "data:image/jpeg;base64,{base64_image}",
        "detail": "detail"
      }
    }
  ]
}

The structure is largely the same as how you would prompt a normal LLM with OpenAI. However the key difference is in adding an image to the request, which needs to be encoded as a Base64 string. This is where it is actually a bit interesting – though I’m using only one image as an input here, there is nothing stopping me from also attaching multiple images to the request.

Let’s implement helper functions to do zero-shot prompting. To make the experiments faster, I parallelize the requests sent to the OpenAI API. I implement helpers for constructing the prompts and calling the model for getting the outputs. This includes a function to encode an image as a base64 string, and functions for constructing the prompt and calling the VLM through OpenAI’s API.

def encode_image(image_path):
    """
    Encodes an image to a base64 string.

    Args:
        image_path (str): The file path of the image to encode.

    Returns:
        str: Base64-encoded string of the image, or None if an error occurs.
    """    
    try:
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")
    except Exception as e:
        print(f"Error encoding image: {e}")
        return None

def llm_chat_completion(messages, model="gpt-4o-mini", max_tokens=300, temperature=0.0):
    """
    Calls OpenAI's ChatCompletion API with the specified parameters.

    Args:
        messages (list): A list of message dictionaries for the conversation.
        model (str): The model to use for the chat completion.
        max_tokens (int, optional): Maximum tokens for the response. Defaults to 300.
        temperature (float, optional): Sampling temperature for randomness. Defaults to 0.0.

    Returns:
        str or None: The response content from the API, or None if an error occurs.
    """
    try:
        response =  client.chat.completions.create(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error calling LLM: {e}")
        return None

def build_few_shot_messages(few_shot_prompt, user_prompt = "Please analyze the following image:", detail="auto"):
    """
    Generates few-shot example messages from image-caption pairs.

    Args:
        few_shot_prompt (dict): A dictionary mapping image paths to metadata, 
                                including "image_caption".
        detail (str, optional): Level of image detail to include. Defaults to "auto".

    Returns:
        list: A list of few-shot example messages.
    """
    few_shot_messages = []
    for path, data in few_shot_prompt.items():
        base64_image = encode_image(path)
        if not base64_image:
            continue  # skip if failed to encode
        caption = data

        few_shot_messages.append(
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": detail
                        }
                    },
                ]
            }
        )
        few_shot_messages.append({"role": "assistant", "content": caption})
    return few_shot_messages

def build_user_message(image_path, user_prompt="Please analyze the following image:", detail="auto"):
    """
    Creates a user message for analyzing a single image.

    Args:
        image_path (str): Path to the image file.
        detail (str, optional): Level of image detail to include. Defaults to "auto".

    Returns:
        dict or None: The user message dictionary, or None if image encoding fails.
    """
    base64_image = encode_image(image_path)
    if not base64_image:
        return None

    return {
        "role": "user",
        "content": [
            {"type": "text", "text": user_prompt},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base64_image}",
                    "detail": detail
                }
            },
        ]
    }

I now tie all of this together and define a function that takes as input the image path, along with the necessary parameters (system prompt, user prompt and other hyperparameters), invokes the VLM and returns the caption generated for the image.

def get_image_caption(
    image_path,
    few_shot_prompt=None,
    system_prompt="You are a helpful assistant that can analyze images and provide captions.",
    user_prompt="Please analyze the following image:",
    model="gpt-4o-mini",
    max_tokens=300,
    detail="auto",
    llm_chat_func=llm_chat_completion,
    temperature=0.0
):
    """
    Gets a caption for an image using a LLM.

    Args:
        image_path (str): File path of the image to be analyzed.
        few_shot_prompt (dict, optional): Maps image paths to {"image_caption": }.
        system_prompt (str, optional): Initial system prompt for the LLM.
        user_prompt (str, optional): User prompt for the LLM.
        model (str, optional): LLM model name (default "gpt-4o-mini").
        max_tokens (int, optional): Max tokens in the response (default 300).
        detail (str, optional): Level of detail for the image analysis (default "auto").
        llm_chat_func (callable, optional): Function to call the LLM. Defaults to `llm_chat_completion`.
        temperature (float, optional): Sampling temperature (default 0.0).

    Returns:
        str or None: The generated caption, or None on error.
    """
    try:
        user_message = build_user_message(image_path, detail)
        if not user_message:
            return None

        # Build message sequence
        messages = [{"role": "system", "content": system_prompt}]

        # Include few-shot examples if provided
        if few_shot_prompt:
            few_shot_messages = build_few_shot_messages(few_shot_prompt, detail)
            messages.extend(few_shot_messages)

        messages.append(user_message)

        # Call the LLM
        response_text = llm_chat_func(
            model=model,
            messages=messages,
            max_tokens=max_tokens,
            temperature=temperature
        )
        return response_text

    except Exception as e:
        print(f"Error getting caption: {e}")
        return None

def process_images_in_parallel(
    image_paths, 
    model="gpt-4o-mini", 
    system_prompt="You are a helpful assistant that can analyze images and provide captions.", 
    user_prompt="Please analyze the following image:", 
    few_shot_prompt = None, 
    max_tokens=300, 
    detail="auto", 
    max_workers=5):
    """
    Processes a list of images in parallel to generate captions using a specified model.

    Args:
        image_paths (list): List of file paths to the images to be processed.
        model (str): The model to use for generating captions (default is "gpt-4o").
        max_tokens (int): Maximum number of tokens in the generated captions (default is 300).
        detail (str): Level of detail for the image analysis (default is "auto").
        max_workers (int): Number of threads to use for parallel processing (default is 5).

    Returns:
        dict: A dictionary where keys are image paths and values are their corresponding captions.
    """    
    captions = {}
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Pass additional arguments using a lambda or partial
        future_to_image = {
            executor.submit(
                get_image_caption, 
                image_path, 
                few_shot_prompt, 
                system_prompt, 
                user_prompt, 
                model, 
                max_tokens, 
                detail): image_path
            for image_path in image_paths
        }

        # Use tqdm to track progress
        for future in tqdm(as_completed(future_to_image), total=len(image_paths), desc="Processing images"):
            image_path = future_to_image[future]
            try:
                caption = future.result()
                captions[image_path] = caption
            except Exception as e:
                print(f"Error processing {image_path}: {e}")
                captions[image_path] = None
    return captions

We now run the outputs for the two images and get the corresponding caption from the zero-shot prompting setup.

from tqdm import tqdm
import os

IMAGE_QUALITY = "high"
PATH_TO_SAMPLES = "images/"

system_prompt = """You are an AI assistant that provides captions of images. 
You will be provided with an image. Analyze the content, context, and notable features of the images.
Provide a concise caption that covers the important aspects of the image."""

user_prompt = "Please analyze the following image:"

image_paths = [os.path.join(PATH_TO_SAMPLES, x) for x in os.listdir(PATH_TO_SAMPLES)]

zero_shot_high_quality_captions = process_images_in_parallel(image_paths, model = "gpt-4o-mini", system_prompt=system_prompt, user_prompt = user_prompt, few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)

Outputs from the Zero-Shot Setup. Images in this picture are taken by Josh Frenette on [Unsplash](https://unsplash.com/photos/a-building-with-a-lot-of-plants-growing-on-the-side-of-it-E70gGcHNHj0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) and by Alexander Zaytsev on Unsplash (Overall Image by Author).

We observe that the model gives detailed captions for the provided images, covering all the aspects in detail.

Few-Shot Prompting

Few-Shot Prompting involves providing examples or demostrations of the task as context to the VLM to provide more reference and context to the model about the task to be performed.

Representation of Few-Shot Prompting (Image by Author)

The advantages of few-shot prompting is that it helps provide additional grounding for the task, compared to zero-shot prompting. Let’s look at how this practically occurs in our task. First, consider our existing implementation:

def build_few_shot_messages(few_shot_prompt, user_prompt = "Please analyze the following image:", detail="auto"):
    """
    Generates few-shot example messages from image-caption pairs.

    Args:
        few_shot_prompt (dict): A dictionary mapping image paths to metadata, 
                                including "image_caption".
        detail (str, optional): Level of image detail to include. Defaults to "auto".

    Returns:
        list: A list of few-shot example messages.
    """
    few_shot_messages = []
    for path, data in few_shot_prompt.items():
        base64_image = encode_image(path)
        if not base64_image:
            continue  # skip if failed to encode
        caption = data

        few_shot_messages.append(
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": detail
                        }
                    },
                ]
            }
        )
        few_shot_messages.append({"role": "assistant", "content": caption})
    return few_shot_messages

I utilize 3 images as few-shot examples to the VLM, and then run our system.

image_captions = json.load(open("image_captions.json"))
FEW_SHOT_EXAMPLES_PATH = "few_shot_examples/"
few_shot_samples = os.listdir(FEW_SHOT_EXAMPLES_PATH)
few_shot_captions = {os.path.join(FEW_SHOT_EXAMPLES_PATH,k):v for k,v in  image_captions.items() if k in few_shot_samples}

IMAGE_QUALITY = "high"
few_shot_high_quality_captions = process_images_in_parallel(image_paths, model = "gpt-4o-mini", few_shot_prompt= few_shot_captions, detail=IMAGE_QUALITY, max_workers=5)

Few-Shot Prompting outputs. Images in this picture are taken by Josh Frenette on [Unsplash](https://unsplash.com/photos/a-building-with-a-lot-of-plants-growing-on-the-side-of-it-E70gGcHNHj0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) and by Alexander Zaytsev on Unsplash (Overall Image by Author).

Notice the captions generated with the VLM using the few-shot examples. These captions are much shorter and concise compared to the captions generated in the zero-shot prompting setting. Let’s look at how the captions for these examples look like:

The few-shot images used and their corresponding captions. Images in this picture are by Mathias Reding on [[Unsplash](https://unsplash.com/photos/a-man-standing-next-to-a-tent-in-the-desert-ckfXPMb2_BI?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash)](https://unsplash.com/photos/a-white-car-driving-on-a-desert-road-nOhLRqb5AV0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash), sander traa on Unsplash, NEOM on Unsplash. (Image by author)

This difference is primarily due to the influence of the few-shot examples. The captions in the few-shot examples clearly influence the output by the VLM for the new images we use, leading to more concise and shorter captions. This illustrates how powerful the selection of few-shot examples can be in shaping the behavior of the VLM. Users can direct the model’s output to more closely match their preferred style, tone, or amount of detail by carefully choosing few-shot samples.

Chain of Thought Prompting

Chain of Thought (CoT) prompting [9], initially developed for LLMs, enables them to tackle address complex problems by breaking them into simpler, intermediate steps and encouraging the model to "think" before answering. This technique can be applied to VLMs without any adjustments. The primary difference is that the VLM can use both image and text as inputs when reasoning under CoT to answer the question.

To implement this, I utilize the 3 examples I picked for few-shot prompting and construct a CoT trace for them by first utilizing OpenAI’s O1 model (which is a multimodal VLM for reasoning tasks). I then use these CoT reasoning traces as my few-shot examples for prompting the VLM. For example, a CoT trace looks like this for one of the images:

CoT trace for an image. Image in this picture is from Mathias Reding on Unsplash (Image by Author)

The few-shot CoT traces I created for the images contain detailed reasoning for each aspect of the image that is important for creating the final caption.

few_shot_samples_1_cot = """Observations:
Visible Blur: The foreground and parts of the image are out of focus or blurred, indicating either camera movement or subject motion during the shot.
Tall, Ornate Buildings: The structures have multiple floors, detailed balconies, and decorative facades, suggesting older or classic urban architecture.
Street-Level View: Parked cars line both sides of a road or narrow street, confirming an urban environment with typical city traffic and infrastructure.
Soft, Warm Light: The sunlight appears to be hitting the buildings at an angle, creating a warm glow on the façade and enhancing the sense of a real city scene rather than a staged setup.

Final Caption: A blurry photo of a city street with buildings."""

few_shot_samples_2_cot = """Observations:  
Elevated Desert Dunes: The landscape is made up of large, rolling sand dunes in a dry, arid environment.  
Off-Road Vehicle: The white SUV appears equipped for travel across uneven terrain, indicated by its size and ground clearance.  
Tire Tracks in the Sand: Visible tracks show recent movement across the dunes, reinforcing that the vehicle is in motion and navigating a desert path.  
View from Inside Another Car: The dashboard and windshield framing in the foreground suggest the photo is taken from a passenger's or driver's perspective, following behind or alongside the SUV.  

Final Caption: A white car driving on a desert road."""

few_shot_samples_3_cot = """Observations:
Towering Rock Formations: The steep canyon walls suggest a rugged desert landscape, with sandstone cliffs rising on both sides.
Illuminated Tents: Two futuristic-looking tents emit a soft glow, indicating a nighttime scene with lights or lanterns inside.
Starry Night Sky: The visible stars overhead reinforce that this is an outdoor camping scenario after dark.
Single Male Figure: A man, seen from behind, stands near one of the tents, indicating he is likely part of the camping group.

Final Caption: A man standing next to a tent in the desert."""

We run the CoT prompt setup for the VLM and obtain the results for these images:

import copy

few_shot_samples_cot = copy.deepcopy(few_shot_captions)

few_shot_samples_cot["few_shot_examples/photo_1.jpg"] = few_shot_samples_1_cot
few_shot_samples_cot["few_shot_examples/photo_3.jpg"] = few_shot_samples_2_cot
few_shot_samples_cot["few_shot_examples/photo_4.jpg"] = few_shot_samples_3_cot

IMAGE_QUALITY = "high"
cot_high_quality_captions = process_images_in_parallel(image_paths, model = "gpt-4o-mini", few_shot_prompt= few_shot_samples_cot, detail=IMAGE_QUALITY, max_workers=5)

Outputs obtained from Chain-of-Thought prompting. Images in this picture are taken by Josh Frenette on [Unsplash](https://unsplash.com/photos/a-building-with-a-lot-of-plants-growing-on-the-side-of-it-E70gGcHNHj0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) and by Alexander Zaytsev on Unsplash (Image by Author)

The outputs from the few-shot CoT setup shows that the VLM is now capable of breaking down the image into intermediate steps before arriving at the final caption. This approach highlights the power of CoT prompting in influencing the behaviour of the VLM for a given task.

Object Detection Guided Prompting

An interesting implication of having models capable of processing images directly and recognizing various aspects is that it opens up numerous opportunities to guide VLMs in performing tasks more efficiently and implementing "feature engineering". If you think about it, you’ll quickly realize that you could provide images with embedded metadata that is relevant to the VLM’s task.

An example of this is to use object detection as an additional component in prompting VLMs. Such techniques are also explored in detail in this blog from AWS [10]. In this blog, I leverage an object detection model as an additional component in the prompting pipeline for the VLM.

An overview of the Object Detection Guided Prompting with VLMs. Image in this picture is taken by Josh Frenette on Unsplash (Image by Author)

Generally, an object detection model is trained with a fixed vocabulary, meaning it can only recognize a predefined set of object categories. However, in our pipeline, since we can’t predict in advance which objects will appear in the image, we need an object detection model that is versatile and capable of recognizing a wide range of object classes. To achieve this, I use the OWL-ViT model [11], an open-vocabulary object detection model. This model requires text prompts that specifies the objects to be detected.

Another challenge that needs to be addressed is obtaining a high-level idea of the objects present in the image before utilizing the OWL-ViT model, as it requires a text prompt describing the objects. This is where VLMs come to the rescue! First, we pass the image to the VLM with a prompt to identify the high-level objects in the image. These detected objects are then used as text prompts, along with the image, for the OWL-ViT model to generate detections. Next, we plot the detections as bounding boxes on the same image and pass this updated image to the VLM, prompting it to generate a caption. The code for inference is partially adapted from [12].

# Load model directly
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

processor = AutoProcessor.from_pretrained("google/owlvit-base-patch32")
model = AutoModelForZeroShotObjectDetection.from_pretrained("google/owlvit-base-patch32")

I detect the objects present in each image using the VLM:

IMAGE_QUALITY = "high"
system_prompt_object_detection = """You are provided with an image. You must identify all important objects in the image, and provide a standardized list of objects in the image.
Return your output as follows:
Output: object_1, object_2"""

user_prompt = "Extract the objects from the provided image:"

detected_objects = process_images_in_parallel(image_paths, system_prompt=system_prompt_object_detection, user_prompt=user_prompt, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)

detected_objects_cleaned = {}

for key, value in detected_objects.items():
    detected_objects_cleaned[key] = list(set([x.strip() for x in value.replace("Output: ", "").split(",")]))

The detected objects are now passed as text prompts to the OWL-ViT model to obtain the predictions for the images. I implement a helper function that predicts the bounding boxes for the images, and then plots the bounding box on the original image.

from PIL import Image, ImageDraw, ImageFont
import numpy as np
import torch

def detect_and_draw_bounding_boxes(
    image_path,
    text_queries,
    model,
    processor,
    output_path,
    score_threshold=0.2
):
    """
    Detect objects in an image and draw bounding boxes over the original image using PIL.

    Parameters:
    - image_path (str): Path to the image file.
    - text_queries (list of str): List of text queries to process.
    - model: Pretrained model to use for detection.
    - processor: Processor to preprocess image and text queries.
    - output_path (str): Path to save the output image with bounding boxes.
    - score_threshold (float): Threshold to filter out low-confidence predictions.

    Returns:
    - output_image_pil: A PIL Image object with bounding boxes and labels drawn.
    """
    img = Image.open(image_path).convert("RGB")
    orig_w, orig_h = img.size  # original width, height

    inputs = processor(
        text=text_queries,
        images=img,
        return_tensors="pt",
        padding=True,
        truncation=True
    ).to("cpu")

    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)

    logits = torch.max(outputs["logits"][0], dim=-1)       # shape (num_boxes,)
    scores = torch.sigmoid(logits.values).cpu().numpy()    # convert to probabilities
    labels = logits.indices.cpu().numpy()                  # class indices
    boxes_norm = outputs["pred_boxes"][0].cpu().numpy()    # shape (num_boxes, 4)

    converted_boxes = []
    for box in boxes_norm:
        cx, cy, w, h = box
        cx_abs = cx * orig_w
        cy_abs = cy * orig_h
        w_abs  = w  * orig_w
        h_abs  = h  * orig_h
        x1 = cx_abs - w_abs / 2.0
        y1 = cy_abs - h_abs / 2.0
        x2 = cx_abs + w_abs / 2.0
        y2 = cy_abs + h_abs / 2.0
        converted_boxes.append((x1, y1, x2, y2))

    draw = ImageDraw.Draw(img)

    for score, (x1, y1, x2, y2), label_idx in zip(scores, converted_boxes, labels):
        if score < score_threshold:
            continue

        draw.rectangle([x1, y1, x2, y2], outline="red", width=3)

        label_text = text_queries[label_idx].replace("An image of ", "")

        text_str = f"{label_text}: {score:.2f}"
        text_size = draw.textsize(text_str)  # If no font used, remove "font=font"
        text_x, text_y = x1, max(0, y1 - text_size[1])  # place text slightly above box

        draw.rectangle(
            [text_x, text_y, text_x + text_size[0], text_y + text_size[1]],
            fill="white"
        )
        draw.text((text_x, text_y), text_str, fill="red")  # , font=font)

    img.save(output_path, "JPEG")

    return img

for key, value in tqdm(detected_objects_cleaned.items()):
    value = ["An image of " + x for x in value]
    detect_and_draw_bounding_boxes(key, value, model, processor, "images_with_bounding_boxes/" + key.split("/")[-1], score_threshold=0.15)

The images with the detected objects plotted are now passed to the VLM for captioning:

IMAGE_QUALITY = "high"
image_paths_obj_detected_guided = [x.replace("downloaded_images", "images_with_bounding_boxes") for x in image_paths] 

system_prompt="""You are a helpful assistant that can analyze images and provide captions. You are provided with images that also contain bounding box annotations of the important objects in them, along with their labels.
Analyze the overall image and the provided bounding box information and provide an appropriate caption for the image.""",

user_prompt="Please analyze the following image:",

obj_det_zero_shot_high_quality_captions = process_images_in_parallel(image_paths_obj_detected_guided, model = "gpt-4o-mini", few_shot_prompt= None, detail=IMAGE_QUALITY, max_workers=5)

Outputs obtained from Object Detection Guided Prompting. Images in this picture are taken by Josh Frenette on [Unsplash](https://unsplash.com/photos/a-building-with-a-lot-of-plants-growing-on-the-side-of-it-E70gGcHNHj0?utm_content=creditCopyText&utm_medium=referral&utm_source=unsplash) and by Alexander Zaytsev on Unsplash (Image by Author)

In this task, given the simple nature of the images we use, the location of the objects does not add any significant information to the VLM. However, Object Detection Guided Prompting can be a powerful tool for more complex tasks, such as Document Understanding, where layout information can be effectively provided through object detection to the VLM for further processing. Additionally, Semantic Segmentation can be employed as a method to guide prompting by providing segmentation masks to the VLM.

Conclusion

VLMs are a powerful tool in the arsenal of AI engineers and scientists for solving a variety of problems that require a combination of vision and text skills. In this article, I explore prompting strategies in the context of VLMs to effectively use these models for tasks such as image captioning. This is by no means an exhaustive or comprehensive list of prompting strategies. One thing that has become increasingly clear with the advancements in GenAI is the limitless potential for creative and innovative approaches to prompt and guide LLMs and VLMs in solving tasks. For a complementary resource on prompting with VLMs, I would also recommend this material from Azure [13].

References

[1] J. Chen, H. Guo, K. Yi, B. Li and M. Elhoseiny, "VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning," 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 18009–18019, doi: 10.1109/CVPR52688.2022.01750.

[2] Luo, Z., Xi, Y., Zhang, R., & Ma, J. (2022). A Frustratingly Simple Approach for End-to-End Image Captioning.

[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millicah, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. 2022. Flamingo: a visual language model for few-shot learning. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS ’22). Curran Associates Inc., Red Hook, NY, USA, Article 1723, 23716–23736.

[4] https://huggingface.co/blog/vision_language_pretraining

[5] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia. Association for Computational Linguistics.

[6] https://platform.openai.com/docs/guides/vision

[7] Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

[8]https://docs.ragas.io/en/stable/concepts/metrics/available_metrics/traditional/#rouge-score

[9] Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35, 24824–24837.

[10] https://aws.amazon.com/blogs/machine-learning/foundational-vision-models-and-visual-prompt-engineering-for-autonomous-driving-applications/

[11] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. 2022. Simple Open-Vocabulary Object Detection. In Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part X. Springer-Verlag, Berlin, Heidelberg, 728–755. https://doi.org/10.1007/978-3-031-20080-9_42

[12]https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb

[13] https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/gpt-4-v-prompt-engineering

The post Prompting Vision Language Models appeared first on Towards Data Science.

Beyond Causal Language Modeling

Masatake Hirono — Mon, 27 Jan 2025 12:51:37 +0000

Introduction

A few days ago, I had the chance to present at a local reading group that focused on some of the most exciting and insightful papers from Neurips 2024. As a presenter, I selected a paper titled "Not All Tokens Are What You Need for Pretraining". It addresses a super simple but reasonable question: do we really need to apply the next-token prediction loss to all tokens during language model pretraining?

Most of us have gotten used to the standard approach: feed in a huge web-scraping corpus, apply causal language modeling (CLM) across every token, and trust that bigger is better. The authors of this paper push back on that assumption, instead arguing that some tokens might actually be detrimental to the learning process. From their analysis, it becomes clear that focusing your training budget on particularly "useful" tokens can yield significant efficiency and performance gains in data efficiency and downstream tasks.

In this post, I’ll summarize the paper’s core ideas of the proposed method called Selective Language Modeling (SLM) and share some insights from the experiments that impressed me the most.

Background

Token-Level Noise in Large Web Corpora

When you scrape huge amounts of texts from the web, it’s no surprise to find a fair amount of noise. Researchers have tried refining their corpora by applying document-level filters – removing entire documents that look suspicious or low quality. However, the authors note that noise can exist inside documents as well: a good article might still contain a handful of nonsense or extremely unpredictable tokens. If your model is forced to learn from everything, these noisy tokens can waste computation or even confuse the model.

Token-Level Learning Dynamics

By examining training checkpoints at multiple stages, the authors categorized tokens based on whether their cross-entropy loss was high or low over time:

L→L (low to low): These tokens are learned early and remain "easy" for the model, giving no further significant gradient updates later on.
H→L (high to low): A subset that starts off hard and eventually gets learned. This means that those tokens still have significant room for improvement in learning.
H→H (high to high): Tokens that remain difficult and fluctuate heavily, often due to their inherent unpredictability (i.e., aleatoric uncertainty)
L→H (low to high): Tokens that were initially learned but become confusing later, possibly due to context shifts or noise.

The big takeaway is that only a fraction of tokens truly contributes meaningful learning signals. Many tokens get mastered early (L→L) and then stop being beneficial. Meanwhile tokens that stay consistently hard (H→H) can be extremely noisy and unproductive throughout the entire training run.

The Proposed Approach: Selective Language Modeling (SLM)

Figure 1. Created by the author based on the figure presented in the original paper, with additional explanations and interpretations

The authors propose Selective Language Modeling (SLM) as a more nuanced approach. Here’s how it works:

Step 1: Train a Reference Model (RM)

They first take a small but high-quality dataset that reflects their "ideal" data distribution. Using an already partially pre-trained base model, they fine-tune this model to create a RM. This model essentially becomes the judge that decide which tokens are worth training on later.

Step 2: Score Tokens with Excess Loss

For each token in the large-scale corpus, they compute the RM loss (how well the RM predicts that token) and compare it with the current training model’s loss. The difference, called the excess loss, indicates how much improvement is still possible on a token that "should" be predictable under the reference distribution.

Step 3: Select Top-k% Tokens for Back-Propagation

During the main pretraining, they still run the full forward pass across all tokens, but only back-propagate the loss on the top k% of tokens (those with the highest excess loss). This means that model devotes most of its capacity to the tokens that are both learnable and relevant, while ignoring tokens deemed less helpful. This dynamic selection happens at each step or batch, so it adapts as the training model itself changes.

Experiments

This paper demonstrates SLM’s benefits across several experimental setups:

Figure 2. Created by the author to summarize the experimental setup

Math Domain Results

When they continued pretraining smaller 1B models on OpenWebMath, the improvements were striking – up to 10% or even more gains on GSM8K and MATH over the same model trained with CLM. The authors highlight that SLM can reach baseline performance 5–10 times faster, requiring fewer tokens and less computation.

In one remarkable case, their 7B model matched the accuracy of a prior SOTA approach while consuming only 3% of the training tokens that the prior had used.

On top of that, a simple fine-tuning step further boosted MATH scores over 40% for a 1B model – a level that smaller open-source models usually struggle to reach without huge training budgets.

General Domain Results

What if you already have a model that has seen a ton of general text? The authors show that SLM still helps.

Even a strong base model improved by around 5.8 percentage points on average across 15 benchmarks, especially in tougher domains like code and math. Therefore, even after large-scale training, there’s still a subset of tokens that can be conducive to improving the performance.

Self-Referencing

A question arises: what if you don’t have a curated dataset to t rain your reference model in the first place? The authors show a creative workaround: you can train a quick-and-dirty reference model on the same raw corpus at just few epochs.

Even though that reference model might not be perfect, it still does an okay job identifying the noisier tokens that hamper training. The result is a 2–3% downstream accuracy boost and a 30–40% reduction in tokens used.

Conclusion and Future Directions

Contributions of This Work

This paper provides both an illuminating analysis of token-level training dynamics and a new technique called SLM:

Token Loss Analysis:They demonstrate that a majority of tokens contribute little beyond the initial training phase, while a small subset stays persistently high loss.

SLM for Focused Learning:By leveraging a reference model to gauge how "useful" each token is, they manage to reduce training tokens drastically without sacrificing quality – in many cases even boosting downstream performance.

Broad Demonstration of Effectiveness:SLM works not only on math-specific tasks but also in more general domains, with either a meticulously curated reference dataset or a reference model drawn from the same large corpus.

Where Could This Go Next?

SLM encompasses various potential directions for future research. For example:

Scaling Up Further:Though the paper primarily focuses on models around 1B to 7B parameters, there remains the open question of how SLM performs at the 30B, 70B, or 100B+ scale. If the token-level approach generalizes well, the cost savings could be enormous for truly massive LLMs.

Reference Models via API:If you can’t gather curated data, maybe you could use an API-based language model as your reference. That might make SLM more practical for smaller research teams who lack the resources for selective reference training.

Reinforcement Learning Extensions:Imagine coupling SLM with reinforcement learning. The reference model could act as a "reward model," and token selection might then be optimized through something akin to policy gradients.

Multiple Reference Models:Instead of a single RM, you could train or gather several, each focusing on a different domain or style. Then, combine their token scores to produce a more robust multi-domain filtering system.

Alignment and Safety:There’s a growing trend toward factoring in alignment or truthfulness. One might train a reference model to give higher scores to well-supported statements and zero out tokens that look factually incorrect or harmful.

Thanks for reading, and I hope this breakdown helps you understand this NeurIPS 2024 best-paperwork.

The post Beyond Causal Language Modeling appeared first on Towards Data Science.