〈 Going back to llama.cpp
Since I originally posted about using llama.cpp managing your own models has gotten easier. I want to remove Ollama from my workflow and work with llama.cpp directly. This removes an extra dependency and potential cause of problems.
Install llama.cpp
brew install llama.cpp
Using Qwen/Qwen3-0.6B-GGUF
I will be using a smaller model to test the theory: Qwen/Qwen3-0.6B-GGUF
llama-server -hf Qwen/Qwen3-0.6B-GGUF
-hf <Model>- Hugging Face integration to download models directly
Basic usage
curl http://localhost:8080/v1/models | jq '.models[0].model | split("/")[-1][:-5]'
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
]
}
JSON
Comparing with unsloth’s version
Unsloth is a group who specialize in fine tuning models for performance. Here is unsloth’s model: unsloth/Qwen3-0.6B-GGUF
llama-server -hf unsloth/Qwen3-0.6B-GGUF
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
]
}
JSON
Comparing a simple prompt I ran the following script with both models:
total=0; count=20;
for n in $(echo {1..$count}); do
time=$(curl http://localhost:8080/v1/chat/completions -d @- -w "%{time_total}" -o /dev/null << JSON
{
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
]
}
JSON
)
((total+=time))
echo "${n} => ${time}"
done
echo "$(curl http://localhost:8080/v1/models | jq -r '.models[0].name | split("/")[-1][:-5]') Average = $((total/count))"
We run 20 requests and record the total time to calculate the average response time:
- unsloth_Qwen3-0.6B-GGUF_Qwen3-0.6B-Q4_K_M
2.819seconds- Qwen_Qwen3-0.6B-GGUF_Qwen3-0.6B-Q8_0
3.781seconds
Unsloth’s model shows an improvement for this query.
Tool calling
Let’s run through a simple tool calling agent loop manually to make sure we have the entire workflow well understood. Tool calling lets models interact with external functions, part of the core loop of agent workflows. OpenAI published the tool calling standard that most models that support: Function Calling (aka Tool Calling)
1. The initial request
llama-server -hf unsloth/Qwen3-0.6B-GGUF --jinja
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
"messages": [
{ "role": "system", "content": "user = {date_of_birth = 1981-05-20}" },
{ "role": "user", "content": "How many days until my birthday?" }
],
"tools": [
{
"type": "function",
"function": {
"name": "days_until_date",
"description": "Count the number of days until a date.",
"parameters": {
"type": "object",
"properties": {
"month": { "type": "number", "description": "The target month" },
"day": { "type": "number", "description": "The target day" }
},
"required": ["month", "day"]
}
}
}
]
}
JSON
The LLM identifies the function and the parameter to call:
{
"choices": [
{
"finish_reason": "tool_calls",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "Okay, the user is asking how many days until their birthday. They provided the date_of_birth as 1981-05-20. Let me check if I can use the days_until_date function here.\n\nThe function requires the month and day parameters. The user's birthday is May 20th, so month is 5 and day is 20. I need to make sure those values are correctly passed into the function. Since the tool requires both, I'll construct the JSON object with those arguments. There's no need to do anything else here because the function is straightforward. Just call it with the provided month and day.",
"content": null,
"tool_calls": [
{
"type": "function",
"function": {
"name": "days_until_date",
"arguments": "{\"month\":5,\"day\":20}"
},
"id": "ZPMxzd5LRZNr8avoiHZ8Id29WQAYOk2p"
}
]
}
}
],
"created": 1758051395,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b6440-33daece8",
"object": "chat.completion",
"usage": {
"completion_tokens": 165,
"prompt_tokens": 204,
"total_tokens": 369
},
"id": "chatcmpl-Vn4UMHatSSebXk0Z6JU0BF0d3AZxCLQ9",
"timings": {
"cache_n": 0,
"prompt_n": 204,
"prompt_ms": 165.767,
"prompt_per_token_ms": 0.8125833333333333,
"prompt_per_second": 1230.6430109732335,
"predicted_n": 165,
"predicted_ms": 1133.406,
"predicted_per_token_ms": 6.8691272727272725,
"predicted_per_second": 145.57890111751658
}
}
2. Agent calls the tool
Python script which matches the tool:
import datetime
import sys
import json
args = json.load(sys.stdin)
today = datetime.date.today()
future = datetime.date(today.year, args["month"], args["day"])
if future > today:
diff = future - today
print(diff.days)
else:
future = datetime.date(today.year+1, args["month"], args["day"])
diff = future - today
print(diff.days)
The agent calls the tool with the LLM’s values:
echo '{"month":5, "day":20}' | python3 days_until_date.py => 246
3. The results of the tool call are fed back into the LLM:
curl http://localhost:8080/v1/chat/completions -d @- << JSON | jq
{
"messages": [
{ "role": "system", "content": "user = {date_of_birth = 1981-05-20}" },
{ "role": "user", "content": "How many days until my birthday?" },
{ "role": "tool", "tool_name": "days_until_date", "content": "246" }
],
"tools": [
{
"type": "function",
"function": {
"name": "days_until_date",
"description": "Count the number of days until a date.",
"parameters": {
"type": "object",
"properties": {
"month": { "type": "number", "description": "The target month" },
"day": { "type": "number", "description": "The target day" }
},
"required": ["month", "day"]
}
}
}
]
}
JSON
The LLM responds to the user: 246 days until your birthday.
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"role": "assistant",
"reasoning_content": "Okay, the user asked how many days until their birthday, which is May 20, 1981. They provided the function \"days_until_date\" which requires the month and day parameters. I need to check if the user has provided both the month and day correctly. The user's birthday is given as 1981-05-20, so the month is 5 and the day is 20. I should call the function with these values. The response from the tool says 246 days, so I'll present that as the answer.",
"content": "246 days until your birthday."
}
}
],
"created": 1758053504,
"model": "gpt-3.5-turbo",
"system_fingerprint": "b6440-33daece8",
"object": "chat.completion",
"usage": {
"completion_tokens": 134,
"prompt_tokens": 216,
"total_tokens": 350
},
"id": "chatcmpl-XxYDqYOrXSufN0b8L5Y8ypIKOeBoLGC3",
"timings": {
"cache_n": 202,
"prompt_n": 14,
"prompt_ms": 38.262,
"prompt_per_token_ms": 2.733,
"prompt_per_second": 365.8982802780827,
"predicted_n": 134,
"predicted_ms": 1071.797,
"predicted_per_token_ms": 7.998485074626866,
"predicted_per_second": 125.02367519222389
}
}
Conclusion
Llama.cpp is now easy to use negating the need for the ollama middle layer. I will continue working with it and explore unsloth’s optimized models when possible.