This tutorial will teach you how to use Braintrust to generate better titles for Github issues, based on their
content. This is a great way to learn how to work with text and evaluate subjective criteria, like summarization quality.
We'll use a technique called model graded evaluation to automatically evaluate the newly generated titles
against the original titles, and improve our prompt based on what we find.
Before starting, please make sure that you have a Braintrust account. If you do not, please sign up. After this tutorial, feel free to dig deeper by visiting the docs.
To see a list of dependencies, you can view the accompanying package.json file. Feel free to copy/paste snippets of this code to run in your environment, or use tslab to run the tutorial in a Jupyter notebook.
The instrumentation hook is only called after visiting a route--------------------------------------------------------------### Link to the code that reproduces this issuehttps://github.com/daveyjones/nextjs-instrumentation-bug### To Reproduce\`\`\`shellgit clone git@github.com:daveyjones/nextjs-instrumentation-bug.gitcd nextjs-instrumentation-bugnpm installnpm run dev # The register function IS callednpm run build && npm start # The register function IS NOT called until you visit http://localhost:3000\`\`\`### Current vs. Expected behaviorThe \`register\` function should be called automatically after running \`npm ...
Let's try to generate better titles using a simple prompt. We'll use OpenAI, although you could try this out with any model that supports text generation.
We'll start by initializing an OpenAI client and wrapping it with some Braintrust instrumentation. wrapOpenAI
is initially a no-op, but later on when we use Braintrust, it will help us capture helpful debugging information about the model's performance.
import { wrapOpenAI } from "braintrust";import { OpenAI } from "openai";const client = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY || "Your OpenAI API Key", }));
import { ChatCompletionMessageParam } from "openai/resources";function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] { return [ { role: "system", content: "Generate a new title based on the github issue. Return just the title.", }, { role: "user", content: "Github issue: " + content, }, ];}async function generateTitle(input: string) { const messages = titleGeneratorMessages(input); const response = await client.chat.completions.create({ model: "gpt-3.5-turbo", messages, seed: 123, }); return response.choices[0].message.content || "";}const generatedTitle = await generateTitle(ISSUE_DATA[0].body);console.log("Original title: ", ISSUE_DATA[0].title);console.log("Generated title:", generatedTitle);
Original title: The instrumentation hook is only called after visiting a routeGenerated title: Next.js: \`register\` function not automatically called after build and start
Ok cool! The new title looks pretty good. But how do we consistently and automatically evaluate whether the new titles are better than the old ones?
With subjective problems, like summarization, one great technique is to use an LLM to grade the outputs. This is known as model graded evaluation. Below, we'll use a summarization prompt
from Braintrust's open source autoevals library. We encourage you to use these prompts, but also to copy/paste them, modify them, and create your own!
The prompt uses Chain of Thought which dramatically improves a model's performance on grading tasks. Later, we'll see how it helps us debug the model's outputs.
Let's try running it on our new title and see how it performs.
import { Summary } from "autoevals";await Summary({ output: generatedTitle, expected: ISSUE_DATA[0].title, input: ISSUE_DATA[0].body, // In practice we've found gpt-4 class models work best for subjective tasks, because // they are great at following criteria laid out in the grading prompts. model: "gpt-4-1106-preview",});
{ name: 'Summary', score: 1, metadata: { rationale: "Summary A ('The instrumentation hook is only called after visiting a route') is a partial and somewhat ambiguous statement. It does not specify the context of the 'instrumentation hook' or the technology involved.\n" + "Summary B ('Next.js: \`register\` function not automatically called after build and start') provides a clearer and more complete description. It specifies the technology ('Next.js') and the exact issue ('\`register\` function not automatically called after build and start').\n" + 'The original text discusses an issue with the \`register\` function in a Next.js application not being called as expected, which is directly reflected in Summary B.\n' + "Summary B also aligns with the section 'Current vs. Expected behavior' from the original text, which states that the \`register\` function should be called automatically but is not until a route is visited.\n" + "Summary A lacks the detail that the issue is with the Next.js framework and does not mention the expectation of the \`register\` function's behavior, which is a key point in the original text.", choice: 'B' }, error: undefined}
Let's dig into a couple examples to see what's going on. Thanks to the instrumentation we added earlier, we can see the model's reasoning for its scores.
Hmm, it looks like the model is missing certain key details. Let's see if we can improve our prompt to encourage the model to include more details, without being too verbose.
function titleGeneratorMessages(content: string): ChatCompletionMessageParam[] { return [ { role: "system", content: `Generate a new title based on the github issue. The title should include all of the keyidentifying details of the issue, without being longer than one line. Return just the title.`, }, { role: "user", content: "Github issue: " + content, }, ];}async function generateTitle(input: string) { const messages = titleGeneratorMessages(input); const response = await client.chat.completions.create({ model: "gpt-3.5-turbo", messages, seed: 123, }); return response.choices[0].message.content || "";}
This is just the start of evaluating and improving this AI application. From here, you should dig into
individual examples, verify whether they legitimately improved, and test on more data. You can even use
logging to capture real-user examples and incorporate
them into your evals.