This is a quick review of Ollama.
tested with Macbook Pro M3 36GB memory
Table of Contents
- 1. installation
- 2. run ollama server
- 3.
ollama
cli - 4. ollama server options
- 5. REST API
- 6. Python library
- 7. Web UI
1. installation
- Linux: https://ollama.com/download/linux
$ curl -fsSL https://ollama.com/install.sh | sh
- MacOS: https://ollama.com/download/mac
- download zip file and install it
- Requires macOS 11 Big Sur or later
2. run ollama server
- Linux
$ ollama serve
- MacOS
- run
ollama.app
from Spotlight, or Application folder in Finder - Alternatively, run
ollama server
from a Terminal
- run
3. ollama
cli
ollama
provides following options:
$ ollama
Usage:
ollama [flags]
ollama [command]
Available Commands:
serve Start ollama
create Create a model from a Modelfile
show Show information for a model
run Run a model
pull Pull a model from a registry
push Push a model to a registry
list List models
ps List running models
cp Copy a model
rm Remove a model
help Help about any command
3-1) list models
$ ollama list
Initially, no models are shown
3-2) load models
Let’s load Llama3 8B model:
$ ollama run llama3:8b
You can find more models at https://ollama.com/library
After loading the llama3:8b
model, the ollama list
command outputs:
$ ollama list
NAME ID SIZE MODIFIED
llama3:8b 365c0bd3c000 4.7 GB 4 weeks ago
Basically, Ollama uses a 4-bit quantized model. So the size is only 4.7GB.
Let’s ask Llama3 some questions:
$ ollama run llama3:8b
>>> Tell me about LLM
ollama prints:
LLM!
LLM stands for Large Language Model. It's a type of artificial intelligence
(AI) model that is specifically designed to process and generate human-like
language.
...
3-3) ollama
options
type /help
to see the available options:
>>> /help
Available Commands:
/set Set session variables
/show Show model information
/load <model> Load a session or model
/save <model> Save your current session
/clear Clear session context
/bye Exit
/?, /help Help for a command
/? shortcuts Help for keyboard shortcuts
Use """ to begin a multi-line message.
3-4) /set
option
The /set
is one of the most powerful options in ollama
>>> /set
Available Commands:
/set parameter ... Set a parameter
/set system <string> Set system message
/set template <string> Set prompt template
/set history Enable history
/set nohistory Disable history
/set wordwrap Enable wordwrap
/set nowordwrap Disable wordwrap
/set format json Enable JSON mode
/set noformat Disable formatting
/set verbose Show LLM stats
/set quiet Disable LLM stats
/set system
allows you to set system prompt/set verbose
prints execution stats, providing information like tokens/sec>>> /set verbose Set 'verbose' mode. >>> hi Hi! Thanks for the compliment! I'm just an AI, but I try my best to be a good programmer too. I code in Python and Java, and I love solving problems and creating new projects. What about you? Do you have any programming experience or interests? total duration: 2.354836125s load duration: 3.509833ms prompt eval duration: 292.179ms prompt eval rate: 0.00 tokens/s eval count: 56 token(s) eval duration: 2.056319s eval rate: 27.23 tokens/s
- Llama3 8B & M3 36GB performs at a rate of 27 tokens/sec which isn’t bad at all.
/set nowordwrap
is also useful
3-4) multi-line string
User input is executed immediately after pressing enter which can be cumbersome for multi-line strings.
Use """
to begin a multi-line message.
>>> """
... Translate below to Korean
...
... Hi
... """
! (annyeonghaseyo)
4. ollama server options
There are several environmental variables for the ollama server
OLLAMA_KEEP_ALIVE
- default:
5m
- how long a loaded model stays in GPU memory. After this value, models are auto-unloaded
- set to
-1
if you want to disable this feature
- default:
OLLAMA_MAX_LOADED_MODELS
- default:
1
- Theorically, We can load as many models as GPU memory available.
- but
OLLAMA_MAX_LOADED_MODELS
is set to1
, only 1 model is loaded (previsouly loaded model if off-loaded from GPU) - increase this value if you want to keep more models in GPU memory
- default:
OLLAMA_NUM_PARALLEL
- default:
1
- Handle multiple requests simultaneously for a single model
- default:
5. REST API
ollama
cli is powerful but not used that frequently. I prefer to use web UI. These web UIs are developed via API.
5-1) Ollama own API
You can find all available APIs on https://github.com/ollama/ollama/blob/main/docs/api.md
For example, /api/chat
is used to chat completion
Let’s call it via curl
- request
curl http://localhost:11434/api/chat -d '{ "model": "llama3:8b", "messages": [ { "role": "user", "content": "hi" } ], "stream": false }'
- response
{ "model": "llama3:8b", "created_at": "2024-07-07T04:15:19.393Z", "message": { "role": "assistant", "content": "Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?" }, "done_reason": "stop", "done": true, "total_duration": 3428811750, "load_duration": 2304526042, "prompt_eval_count": 11, "prompt_eval_duration": 196279000, "eval_count": 26, "eval_duration": 923565000 }
streaming mode is also supported.
5-2) OpenAI compatible API
refer to https://github.com/ollama/ollama/blob/main/docs/openai.md
6. Python library
6-1) Ollama library
https://github.com/ollama/ollama-python
- install
$ pip install ollama
- usage
import ollama response = ollama.chat(model='llama3', messages=[ { 'role': 'user', 'content': 'Why is the sky blue?', }, ]) print(response['message']['content'])
- streaming mode
import ollama stream = ollama.chat( model='llama3', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}], stream=True, ) for chunk in stream: print(chunk['message']['content'], end='', flush=True)
6-2) OpenAI Library
https://github.com/ollama/ollama/blob/main/docs/openai.md
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
# required but ignored
api_key='ollama',
)
chat_completion = client.chat.completions.create(
messages=[
{
'role': 'user',
'content': 'Say this is a test',
}
],
model='llama3',
)
7. Web UI
I’ve tested two web ui.
- Open WebUI: https://github.com/open-webui/open-webui
- github star: 30.k, fork: 3.3k (2024.07)
- pros: powerful features
- cons: (relatively) not easy to install
- I would recommend use Docker
- Chatbot Ollama: https://github.com/ivanfioravanti/chatbot-ollama
- github star: 1.3k, fork: 209
- pros: easy to install
- cons: not actively developing
- supports basic features