Evaluating Language Models

Language models can perform complex tasks. Evals help measure a model's ability to perform a specific task. Evals are defined as Spicepod components and can evaluate any Spicepod model's performance.

Refer to the Cookbook for related examples.

Overview

In Spice, an eval consists of the following core components:

Evals: A defined task for a model to perform and a method to measure its performance.
Eval Run: An single evaluation of a specific model.
Eval Result: The model output and score for a single input task within an eval run.
Eval Scorer: A method to score the model's performance on an eval result.

Eval Components

An eval component is defined as follows:

evals:
  - name: australia
    description: Make sure the model understands Aussies, and importantly Cricket.
    dataset: cricket_questions
    scorers:
      - match

datasets:
  - name: cricket_questions
    from: https://github.com/openai/evals/raw/refs/heads/main/evals/registry/data/cricket_situations/samples.jsonl

Where:

name is a unique identifier for this eval (like models, datasets, etc.).
dataset is a dataset component.
scorers is a list of scoring methods.

For complete details on the evals component, see the Spicepod reference.

Running an Eval

To run an eval, ensure

Define an eval component (and it's associated dataset).
Add a language model to the spicepod (this is the model that will be evaluated).

An eval can be started via the HTTP API:

curl -XPOST http://localhost:8090/v1/evals/australia \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "my_model",
  }'

Depending on the dataset and model, the eval run can take some time to complete. On completion, results will be available in two tables:

eval.runs: Summarises the status and scores from the eval run.
eval.results: Contains the input, expected output, and actual output for each eval run, and the score from each scorer.

Dataset Formats

Datasets are used to define the input and expected output for an eval. Evals expect a particular format,

input: The input to the model. It should be either:
- A plain string (e.g., "Hello, how are you?"), interpreted as a single user message.
- A JSON array is interpreted as multiple OpenAI-compatible messages (e.g., [{"role":"system","content":"You are a helpful assistant."}, ...]).
For the ideal column:
- A plain string (e.g., "I'm doing well, thanks!"), interpreted as a single assistant response.
- A JSON array is interpreted as multiple OpenAI-compatible choices (e.g., [{"index":0,"message":{"role":"assistant","content":"Sure!"}, ...}]).

To use a dataset with a different format, use a view. For example:

views:
  # This view defines an eval dataset containing previous ai completion tasks from the `runtim.task_history` table.
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'

Overview​

Eval Components​

Running an Eval​

Dataset Formats​

Overview

Eval Components

Running an Eval

Dataset Formats