For `/api/show`, a fully missing `model_info` field trips up various
integrators (including a recent Android Studio integration).
The primary source of missing info tends to come from models with a
remote that are also missing other data. It seems better to me to return
an empty `model_info` than making up some other fields within
`model_info` (like saying the architecture is `remote` or something like
that). So this does slightly change `/api/show`'s behavior that possibly
someone is relying on, but it seems more important to ensure the field
is always there (from a quick sampling integrations seem to be robust to
missing fields _within_ it).
Fixes: https://github.com/ollama/ollama/issues/13783
Move the unload check (empty prompt + KeepAlive=0) before the image
generation model dispatch in GenerateHandler. This prevents models like
flux from being loaded into memory just to be immediately unloaded when
running `ollama rm`.
Also fix a bug in DeleteHandler where `args[0]` was used instead of
`arg` in the delete loop, causing only the first model to be unloaded
when deleting multiple models.
The loadImageGen function was not setting Options on the runnerRef,
causing needsReload() to always return true (since it checks if
runner.Options == nil). This resulted in the image generation
subprocess being killed and restarted for every request.
* x: make `ollama create --experimental` import from safetensors
This change allows pulling in safetensors models into the new experimental model format, and also
fixes the `ollama show` command to be able to correctly display the model information.
* gofumpt the linter
* gofumpt the linter again
* validate the model name
Added validation to ensure auth redirects stay on the same host as the original request. The fix is a single check in getAuthorizationToken comparing the realm URL's host against the request host. Added tests for the auth flow.
Co-Authored-By: Gecko Security <188164982+geckosecurity@users.noreply.github.com>
* gofmt
---------
Co-authored-by: Gecko Security <188164982+geckosecurity@users.noreply.github.com>
TeaCache:
- Timestep embedding similarity caching for diffusion models
- Polynomial rescaling with configurable thresholds
- Reduces transformer forward passes by ~30-50%
FP8 quantization:
- Support for FP8 quantized models (8-bit weights with scales)
- QuantizedMatmul on Metal, Dequantize on CUDA
- Client-side quantization via ollama create --quantize fp8
Other bug fixes:
- Fix `/api/show` API for image generation models
- Server properly returns model info (architecture, parameters, quantization)
- Memory allocation optimizations
- CLI improvements for image generation
RemoveLayers was calling Manifests() for each layer to check if it was
shared with other models. For models with many blobs (e.g., tensor
models), this caused O(N*M) manifest reads.
Now loads manifests once and builds a set of in-use digests.
* api: add Anthropic Messages API compatibility layer
Add middleware to support the Anthropic Messages API format at /v1/messages.
This enables tools like Claude Code to work with Ollama local and cloud models through the
Anthropic API interface.
* WIP - MLX backend with gemma3
* MLX: add cmake and go tag build toggles
To build the new MLX backend code:
cmake --preset MLX
cmake --build --preset MLX --parallel
cmake --install build --component MLX
go build -tags mlx .
Note: the main.go entrypoint for the MLX engine will change in a follow up commit.
* add experimental image generation runtime
* add experimental image generation runtime
* MLX: wire up cuda build for linux
* MLX: get dependencies correct and dedup
This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
directory.
* fix relative link bug in dedup
* Add darwin build and readme
* add go build tag for mlx dependent code and wire up build_darwin.sh
* lint cleanup
* macos: build mlx for x86
This will be CPU only.
* cuda build instructions and fix drift from mlx bump
* stale comment
* Delete agent helper doc
* Clean up readme.md
* Revise README for tokenizer clarity and details
Updated README to clarify tokenizer functionality and removed correctness section.
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
* preserve tool definition and call JSON ordering
This is another iteration of
<https://github.com/ollama/ollama/pull/12518>, but this time we've
simplified things by relaxing the competing requirements of being
compatible AND order-preserving with templates (vs. renderers). We
maintain backwards compatibility at the cost of not guaranteeing order
for templates. We plan on moving more and more models to renderers,
which have been updated to use these new data types, and additionally
we could add an opt-in way of templates getting an order-preserved list
(e.g., via sibling template vars)
* orderedmap_test: remove testify
The normalize function now checks for NaN and Inf values in the
embedding vector before processing. This prevents JSON encoding
failures when models produce invalid floating-point values.
Fixes#13572
Signed-off-by: majiayu000 <1835304752@qq.com>
Refactored the ConfigV2 and RootFS types from server/images.go to a new types/model/config.go file under the model package. Updated all references to use model.ConfigV2 and model.RootFS. This allows for use in other projects without worrying about compiling the c code in the llama package.
Currently for both the old and new engines, there is code to
calculate how much memory is required for a model and lay out
the layers onto GPUs. This reuses the new engine's lay out code
for the old engine as well, bringing them closer together. The
old engine continues to use its current method of estimating
required memory.
This reduces maintainence effort and improves consistency, as new
features only need to be implemented in one place. The newer code
is also more accurate, especially with multiple GPUs.
Adds logprobs support to Ollama's API including support for Ollama's
OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
in the API, Ollama will return the log probabilities for each token generated.
'top_logprobs', an integer value can also be specified up to the value 20.
When specified, the API will also provide the number of most likely tokens to
return at each token position
Co-authored-by: Baptiste Jamin <baptiste@crisp.chat>
* app: add code for macOS and Windows apps under 'app'
* app: add readme
* app: windows and linux only for now
* ci: fix ui CI validation
---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.
Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.
Fixes: https://github.com/ollama/ollama/issues/12792
Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.
* DRY out the runner lifecycle code
Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place. This also unifies GPU discovery types with the newer ml.DeviceInfo
* win: make incremental builds better
Place build artifacts in discrete directories so incremental builds don't have to start fresh
* Adjust sort order to consider iGPUs
* handle cpu inference oom scenarios
* review comments
* test: harden scheduler tests
This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.
* test: tune tests for partial loads
Give stress tests more time when the model is split between CPU/GPU