Evaluating Language Models
Language models can perform complex tasks. Evals help measure a model's ability to perform a specific task. Evals are defined as Spicepod components and can evaluate any Spicepod model's performance.
Refer to the Cookbook for related examples.
Overview​
In Spice, an eval consists of the following core components:
- Evals: A defined task for a model to perform and a method to measure its performance.
- Eval Run: An single evaluation of a specific model.
- Eval Result: The model output and score for a single input task within an eval run.
- Eval Scorer: A method to score the model's performance on an eval result.
Eval Components​
An eval component is defined as follows:
evals:
  - name: australia
    description: Make sure the model understands Aussies, and importantly Cricket.
    dataset: cricket_questions
    scorers:
      - match
datasets:
  - name: cricket_questions
    from: https://github.com/openai/evals/raw/refs/heads/main/evals/registry/data/cricket_situations/samples.jsonl
Where:
- nameis a unique identifier for this eval (like- models,- datasets, etc.).
- datasetis a dataset component.
- scorersis a list of scoring methods.
For complete details on the evals component, see the Spicepod reference.
Running an Eval​
To run an eval, ensure
- Define an evalcomponent (and it's associateddataset).
- Add a language model to the spicepod (this is the model that will be evaluated).
An eval can be started via the HTTP API:
curl -XPOST http://localhost:8090/v1/evals/australia \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "my_model",
  }'
Depending on the dataset and model, the eval run can take some time to complete. On completion, results will be available in two tables:
- eval.runs: Summarises the status and scores from the eval run.
- eval.results: Contains the input, expected output, and actual output for each eval run, and the score from each scorer.
Dataset Formats​
Datasets are used to define the input and expected output for an eval. Evals expect a particular format,
- input: The input to the model. It should be either:- A plain string (e.g., "Hello, how are you?"), interpreted as a single user message.
- A JSON array is interpreted as multiple OpenAI-compatible messages (e.g., [{"role":"system","content":"You are a helpful assistant."}, ...]).
 
- A plain string (e.g., 
- For the idealcolumn:- A plain string (e.g., "I'm doing well, thanks!"), interpreted as a single assistant response.
- A JSON array is interpreted as multiple OpenAI-compatible choices (e.g., [{"index":0,"message":{"role":"assistant","content":"Sure!"}, ...}]).
 
- A plain string (e.g., 
To use a dataset with a different format, use a view. For example:
views:
  # This view defines an eval dataset containing previous ai completion tasks from the `runtim.task_history` table.
  - name: user_queries
    sql: |
      SELECT
        json_get_json(input, 'messages') AS input,
        json_get_str((captured_output -> 0), 'content') as ideal
      FROM runtime.task_history
      WHERE task='ai_completion'
