Product Update | December 2023

New evals features include test set run comparisons, custom global LLM evaluators, per-query evaluators, creation of eval cases from production transcripts, a new evals onboarding, creation of test cases in the UI, and many UX improvements.

Henry Scott-Green

Dec 29, 2023 — 2 min read

December has been all about evals at Context.ai! New evals features include test set run comparisons, custom global LLM evaluators, per-query evaluators, creation of eval cases from production transcripts, a new evals onboarding, creation of test cases in the UI, and many UX improvements.

Test Set Run Comparisons

Completed Test Set runs can now be directly compared to one another from the Test Set Version table. Multi-select any number of Test Set Versions and click compare, and you’ll see the of the executed Test Sets compared to one another, with the evaluator results and the generated responses for each.

Custom Global LLM Evaluators

When we launched evals we only supported a number of default global evaluators that can run over every query, assessing common problems such as hallucination and response refusal. Users can now define additional custom global LLM evaluators from the Evaluators page by selecting Create New Evaluator, and adding an LLM prompt.

Per Query Evaluators

Evaluators can now be enabled at the Test Case level in addition to the Test Set level. This allows you to define evaluators that should only run over one or a subset of the Test Cases within your Test Set, giving you more granular control of pass criteria for your evaluators.

Create Test Cases From User Transcripts

Defining Test Cases is often time-consuming and frustrating, so you can now use production transcripts to make this much quicker and easier! Users with real user transcripts in their analytics can now copy the queries into new Test Cases from any point in the conversation.

Create New Test Cases

New Test Cases can now be created in the UI in addition to the API, allowing you to easily define additional tests within the product.

Fork Test Set Versions

Existing Test Sets can now be forked in the UI to create a new version that can be modified in the UI. This allows you to vary system prompts or queries to assess the performance impact of changes you make.

New Onboarding

Onboarding to evals no longer requires the ingestion of analytics transcripts - anyone can get started with evals before using our analytics products.

Hiring!

We’re hiring! We’re looking for a Founding GTM Lead to join our team in London. This is a generalist position and will be our first non-engineering hire. We’re looking for someone with experience in GTM at early-stage startups and for technical B2B products. More information here.

What product experiences are enabled by multi-agent LLM frameworks?

It feels like everyone is excited about multi-agent frameworks - even though their performance isn’t yet ready for prime-time. These performance problems are improving with increasingly powerful models like Claude 3 and GPT-4o - and great things are expected from GPT-5, a launch that will likely make agentic workflows

Launching Custom Conversion Events - Product Update | July 2024

Today we’re launching support for custom conversion events 🧾 This addresses one of the biggest challenges in the LLM ecosystem - proving ROI 📈 Context.ai users can now log custom conversion events with their LLM conversation transcripts, indicating where users completed an action: a purchase, a link click, or even

Are your LLM Products Guardrails working?

How do you know if the guardrails on your LLM product are working? 🛡️🎯 Some people wait until they show up in the The New York Times - like McDonald's, Air Canada, or Chevrolet Conversational LLM products are a challenging consumer experience as users can ask an infinite number

Is LLM progress slowing?

LLMs haven’t significantly improved since GPT4: is progress slowing? 🐢 Dramatically more powerful model training clusters are being built: 15 of them, with 31 times more power than trained GPT4 This means models much more powerful than GPT4 are coming 🐇 SemiAnalysis did a phenomenal deep dive into this topic -