Filesystem Hosted Models
To use a model hosted on a filesystem, specify the path to the model file or folder in the from field:
models:
  - from: file://models/llms/llama3.2-1b-instruct/
    name: llama3
    params:
      model_type: llama
      
Supported formats include GGUF, GGML, and SafeTensor for large language models (LLMs) and ONNX for traditional machine learning (ML) models.
Configuration​
from​
An absolute or relative path to the model file or folder:
from: file://absolute/path/models/llms/llama3.2-1b-instruct/
from: file:models/llms/llama3.2-1b-instruct/
params (optional)​
| Param | Description | 
|---|---|
| model_type | The architecture to load the model as. Supported values: mistral,gemma,mixtral,llama,phi2,phi3,qwen2,gemma2,starcoder2,phi3.5moe,deepseekv2,deepseek | 
| tools | Which tools should be made available to the model. Set to autoto use all available tools. | 
| system_prompt | An additional system prompt used for all chat completions to this model. | 
| chat_template | Customizes the transformation of OpenAI chat messages into a character stream for the model. See Overriding the Chat Template. | 
See Large Language Models for additional configuration options.
files (optional)​
The files field specifies additional files required by the model, such as tokenizer, configuration, and other files.
- name: local-model
  from: file://models/llms/llama3.2-1b-instruct/model.safetensors
  files:
    - path: //models/llms/llama3.2-1b-instruct/tokenizer.json
    - path: //models/llms/llama3.2-1b-instruct/tokenizer_config.json
    - path: //models/llms/llama3.2-1b-instruct/config.json
Examples​
Loading a GGML Model​
models:
  - from: file://absolute/path/to/my/model.ggml
    name: local_ggml_model
    files:
      - path: models/llms/ggml/tokenizer.json
      - path: models/llms/ggml/tokenizer_config.json
      - path: models/llms/ggml/config.json
Example: Loading a SafeTensor Model​
models:
  - name: safety
    from: file:models/llms/llama3.2-1b-instruct/model.safetensors
    files:
      - path: models/llms/llama3.2-1b-instruct/tokenizer.json
      - path: models/llms/llama3.2-1b-instruct/tokenizer_config.json
      - path: models/llms/llama3.2-1b-instruct/config.json
Loading LLM from a directory​
models:
  - name: llama3
    from: file:models/llms/llama3.2-1b-instruct/
Note: The folder provided should contain all the expected files (see examples above).
Loading an ONNX Model​
models:
  - from: file://absolute/path/to/my/model.onnx
    name: local_fs_model
Loading a GGUF Model​
models:
  - from: file://absolute/path/to/my/model.gguf
    name: local_gguf_model
Overriding the Chat Template​
Chat templates convert the OpenAI compatible chat messages (see format) and other components of a request into a stream of characters for the language model. It follows Jinja3 templating syntax.
Further details on chat templates can be found here.
models:
  - name: local_model
    from: file:path/to/my/model.gguf
    params:
      chat_template: |
        {% set loop_messages = messages %}
        {% for message in loop_messages %}
          {% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}
          {{ content }}
        {% endfor %}
        {% if add_generation_prompt %}
          {{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}
        {% endif %}
Templating Variables​
- messages: List of chat messages, in the OpenAI format.
- add_generation_prompt: Boolean flag whether to add a generation prompt.
- tools: List of callable tools, in the OpenAI format.
- The throughput, concurrency & latency of a locally hosted model will vary based on the underlying hardware and model size. Spice supports Apple metal and CUDA for accelerated inference.
