〈 Going back to llama.cpp

Sep 16, 2025 • ✎ edit

Since I originally posted about using llama.cpp managing your own models has gotten easier. I want to remove Ollama from my workflow and work with llama.cpp directly. This removes an extra dependency and potential cause of problems.

Install llama.cpp

brew install llama.cpp

Using Qwen/Qwen3-0.6B-GGUF

I will be using a smaller model to test the theory: Qwen/Qwen3-0.6B-GGUF

llama-server -hf Qwen/Qwen3-0.6B-GGUF
-hf <Model>
Hugging Face integration to download models directly

Basic usage

curl http://localhost:8080/v1/models | jq '.models[0].model | split("/")[-1][:-5]'
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
    "messages": [
        {
            "role": "user",
            "content": "Why is the sky blue?"
        }
    ]
}
JSON

Comparing with unsloth’s version

Unsloth is a group who specialize in fine tuning models for performance. Here is unsloth’s model: unsloth/Qwen3-0.6B-GGUF

llama-server -hf unsloth/Qwen3-0.6B-GGUF
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
    "messages": [
        {
            "role": "user",
            "content": "Why is the sky blue?"
        }
    ]
}
JSON

Comparing a simple prompt I ran the following script with both models:

total=0; count=20; 
for n in $(echo {1..$count}); do 
    time=$(curl http://localhost:8080/v1/chat/completions -d @- -w "%{time_total}" -o /dev/null << JSON
    {
        "messages": [
            {
                "role": "user",
                "content": "Why is the sky blue?"
            }
        ]
    }
JSON
)
    ((total+=time))
    echo "${n} => ${time}"
done
echo "$(curl http://localhost:8080/v1/models | jq -r '.models[0].name | split("/")[-1][:-5]') Average = $((total/count))"

We run 20 requests and record the total time to calculate the average response time:

unsloth_Qwen3-0.6B-GGUF_Qwen3-0.6B-Q4_K_M
2.819 seconds
Qwen_Qwen3-0.6B-GGUF_Qwen3-0.6B-Q8_0
3.781 seconds

Unsloth’s model shows an improvement for this query.

Tool calling

Let’s run through a simple tool calling agent loop manually to make sure we have the entire workflow well understood. Tool calling lets models interact with external functions, part of the core loop of agent workflows. OpenAI published the tool calling standard that most models that support: Function Calling (aka Tool Calling)

1. The initial request

llama-server -hf unsloth/Qwen3-0.6B-GGUF --jinja
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
  "messages": [
    { "role": "system", "content": "user = {date_of_birth = 1981-05-20}" }, 
    { "role": "user", "content": "How many days until my birthday?" }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "days_until_date",
        "description": "Count the number of days until a date.",
        "parameters": {
          "type": "object",
          "properties": {
            "month": { "type": "number", "description": "The target month" },
            "day": { "type": "number", "description": "The target day" }
          },
          "required": ["month", "day"]
        }
      }
    }
  ]
}
JSON

The LLM identifies the function and the parameter to call:

{
  "choices": [
    {
      "finish_reason": "tool_calls",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, the user is asking how many days until their birthday. They provided the date_of_birth as 1981-05-20. Let me check if I can use the days_until_date function here.\n\nThe function requires the month and day parameters. The user's birthday is May 20th, so month is 5 and day is 20. I need to make sure those values are correctly passed into the function. Since the tool requires both, I'll construct the JSON object with those arguments. There's no need to do anything else here because the function is straightforward. Just call it with the provided month and day.",
        "content": null,
        "tool_calls": [
          {
            "type": "function",
            "function": {
              "name": "days_until_date",
              "arguments": "{\"month\":5,\"day\":20}"
            },
            "id": "ZPMxzd5LRZNr8avoiHZ8Id29WQAYOk2p"
          }
        ]
      }
    }
  ],
  "created": 1758051395,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b6440-33daece8",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 165,
    "prompt_tokens": 204,
    "total_tokens": 369
  },
  "id": "chatcmpl-Vn4UMHatSSebXk0Z6JU0BF0d3AZxCLQ9",
  "timings": {
    "cache_n": 0,
    "prompt_n": 204,
    "prompt_ms": 165.767,
    "prompt_per_token_ms": 0.8125833333333333,
    "prompt_per_second": 1230.6430109732335,
    "predicted_n": 165,
    "predicted_ms": 1133.406,
    "predicted_per_token_ms": 6.8691272727272725,
    "predicted_per_second": 145.57890111751658
  }
}

2. Agent calls the tool

Python script which matches the tool:

import datetime
import sys
import json

args = json.load(sys.stdin)

today = datetime.date.today()

future = datetime.date(today.year, args["month"], args["day"])
if future > today:
  diff = future - today
  print(diff.days)
else:
  future = datetime.date(today.year+1, args["month"], args["day"])
  diff = future - today
  print(diff.days)

The agent calls the tool with the LLM’s values:

echo '{"month":5, "day":20}' | python3 days_until_date.py => 246

3. The results of the tool call are fed back into the LLM:

curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
  "messages": [
    { "role": "system",                               "content": "user = {date_of_birth = 1981-05-20}" }, 
    { "role": "user",                                 "content": "How many days until my birthday?" },
    { "role": "tool", "tool_name": "days_until_date", "content": "246"  }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "days_until_date",
        "description": "Count the number of days until a date.",
        "parameters": {
          "type": "object",
          "properties": {
            "month": { "type": "number", "description": "The target month" },
            "day": { "type": "number", "description": "The target day" }
          },
          "required": ["month", "day"]
        }
      }
    }
  ]
}
JSON

The LLM responds to the user: 246 days until your birthday.

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": "Okay, the user asked how many days until their birthday, which is May 20, 1981. They provided the function \"days_until_date\" which requires the month and day parameters. I need to check if the user has provided both the month and day correctly. The user's birthday is given as 1981-05-20, so the month is 5 and the day is 20. I should call the function with these values. The response from the tool says 246 days, so I'll present that as the answer.",
        "content": "246 days until your birthday."
      }
    }
  ],
  "created": 1758053504,
  "model": "gpt-3.5-turbo",
  "system_fingerprint": "b6440-33daece8",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 134,
    "prompt_tokens": 216,
    "total_tokens": 350
  },
  "id": "chatcmpl-XxYDqYOrXSufN0b8L5Y8ypIKOeBoLGC3",
  "timings": {
    "cache_n": 202,
    "prompt_n": 14,
    "prompt_ms": 38.262,
    "prompt_per_token_ms": 2.733,
    "prompt_per_second": 365.8982802780827,
    "predicted_n": 134,
    "predicted_ms": 1071.797,
    "predicted_per_token_ms": 7.998485074626866,
    "predicted_per_second": 125.02367519222389
  }
}

Conclusion

Llama.cpp is now easy to use negating the need for the ollama middle layer. I will continue working with it and explore unsloth’s optimized models when possible.

Leave a Comment!