Running LLM-jp-4 Locally on a MacBook Pro M4 Max 128GB with Ollama’s OpenAI-Compatible API

I tested whether LLM-jp-4, released by NII, could run locally on a MacBook Pro M4 Max 128GB and be used from an existing batch pipeline through Ollama’s OpenAI-compatible API.

The short version is:

8B is practical on this machine.
32B-A3B also runs locally on this class of hardware.
However, with the GGUF + Ollama setup I used here, 32B-A3B was less stable in extraction quality than 8B.

What I wanted to do

The goal was not just chatting locally. I wanted to:

run LLM-jp-4 as a local server
call it from external client software
integrate it into an existing extraction batch through an OpenAI-compatible API

The target task was extracting facial-expression and emotion descriptions from Japanese OCR text.

Conclusion

At this stage, my practical conclusions were:

LLM-jp-4 8B runs comfortably on a MacBook Pro M4 Max 128GB
LLM-jp-4 32B-A3B also runs locally in quantized form
Ollama is the fastest route if you want an OpenAI-compatible API
I did not use the original Hugging Face weights directly here; I used GGUF conversions
For this extraction task, 8B was more stable than 32B-A3B in my local setup

That last point matters. The official LLM-jp-4 release is intended for transformers, while my local setup used GGUF conversions through Ollama. So this was not a “pure official implementation” benchmark.

Environment

Machine: MacBook Pro M4 Max 128GB
Runtime: Ollama 0.20.3
API: http://localhost:11434/v1/
Model: llm-jp-4-8b-thinking-Q4_K_M.gguf

The 8B Q4_K_M file is about 5.3GB, which makes it quite manageable for local testing.

Installing the model

I installed Ollama via Homebrew:

brew install ollama
brew services start ollama

Then I downloaded a GGUF file and created a Modelfile.

curl -L --fail -o ~/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf \
  'https://huggingface.co/mmnga-o/llm-jp-4-8b-thinking-gguf/resolve/main/llm-jp-4-8b-thinking-Q4_K_M.gguf?download=true'

FROM /Users/yourname/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.7

ollama create llm-jp-4-8b-q4 -f ~/git/llm/ollama/modelfiles/llm-jp-4-8b-q4.Modelfile

At that point the model was available through Ollama as llm-jp-4-8b-q4.

Calling it through an OpenAI-compatible API

The endpoint was:

base_url = "http://localhost:11434/v1/"
api_key = "ollama"

Using the openai SDK looked like this:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llm-jp-4-8b-q4",
    messages=[
        {"role": "user", "content": "This is a test. Return JSON only."}
    ],
)

print(response.choices[0].message.content)

A test script for the extraction batch

Instead of modifying the production code directly, I created a separate test script.

Input: /Users/nakamura/git/ndl/face/data/886000
Output: JSON
Backend: local Ollama API

At first I only instructed the model to “return a JSON array,” but in practice it often returned extra text, code fences, or invalid placeholders like ....

So I tightened the setup with:

response_format using JSON Schema
temperature=0
per-chunk streaming logs
per-chunk timing

Why structured output was necessary

For this kind of extraction task, prompt-only JSON constraints are not enough. Local models often:

prepend explanations
wrap output in code fences
omit required fields
generate unexpected field values

Using response_format with a JSON Schema did not make the output perfect, but it made it much less fragile.

Measurements

I first tested a single page:

chunk 1/1 pages=8-8 chars=1116
  first_token=1.2s
  streaming=5.0s chars=287
  completed=5.1s response_chars=290
  extracted=1

Under these conditions, one page took about 5s.

Then I tested 5 pages. I eventually reduced chunk_size to 1200 so I could see progress more clearly.

chunk 1/5 pages=8-8 chars=1116
  first_token=1.2s
  streaming=5.0s chars=287
  completed=5.1s response_chars=290
  extracted=1
chunk 2/5 pages=9-9 chars=836
  first_token=1.0s
  completed=3.6s response_chars=302
  extracted=1
chunk 3/5 pages=10-10 chars=1113
  first_token=2.1s
  streaming=5.0s chars=256
  completed=7.1s response_chars=356
  extracted=1
chunk 4/5 pages=11-11 chars=869
  first_token=1.7s
  streaming=5.0s chars=365
  streaming=10.0s chars=867
  streaming=15.0s chars=1402
  completed=15.8s response_chars=1457
  extracted=5
chunk 5/5 pages=12-12 chars=777
  first_token=1.6s
  completed=4.3s response_chars=262
  extracted=1
saved=/Users/nakamura/git/llm/outputs/886000_extraction_test_5pages.json
total=9

This made one thing obvious: 5 pages is not simply 5 x 5s.

The main reasons are:

complexity depends on the page content, not just the page count
structured output generation tends to be slower
if a chunk yields more extracted items, the response becomes longer

In my run, the chunk for page 11 alone took 15.8s.

Output quality

The output was valid JSON often enough to save it, but the content quality was still rough.

Typical problems included:

surface or context being replaced with ...
empty pages arrays
unexpected body_part values
character values being overly long or noisy

So yes, 8B was usable as an API. But no, it was not production-ready for extraction quality as-is.

Is 32B-A3B realistic on this machine?

Yes. On this hardware, it is realistic.

But “it runs” and “it is practical” are different questions. Even 8B spins up the fans under continuous inference. 32B-A3B Q4_K_M is noticeably heavier.

I also tested 32B-A3B Q4_K_M locally on a single page:

chunk 1/1 pages=8-8 chars=1116
  first_token=7.9s
  streaming=12.9s chars=246
  completed=13.0s response_chars=251
  extracted=1

That log by itself suggests: slow, but usable. The real problem was content quality. The response was JSON, but not well grounded in the source text. In some cases:

surface and context were hallucinated strings
pages came back empty
values drifted away from the actual OCR text

One example looked like this:

[
  {
    "surface": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "context": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "emotion": "その他",
    "body_part": "涙",
    "valence": "ambiguous",
    "intensity": "low",
    "character": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "pages": []
  }
]

When I tightened the constraints even further, I got outputs like this:

[
  {
    "surface": "/c/co/n/t/e/x/t/",
    "context": "/c/co/n/t/e/x/t/",
    "emotion": "悲しみ・哀れ",
    "body_part": "口",
    "valence": "ambiguous",
    "intensity": "low",
    "character": "ださいづちあたぶお才槌頭ハちいてほ%",
    "pages": [8]
  }
]

So in this GGUF + Ollama setup, 32B-A3B definitely ran, but it did not give me better extraction quality than 8B.

Practical takeaway

At least for this task and this setup:

use 8B for API checks, prompt iteration, and small batch runs
evaluate 32B-A3B separately if you care about quality
do not judge 32B-A3B only through GGUF + Ollama

If I wanted to evaluate 32B-A3B seriously, I would do it again with the official transformers implementation instead of stopping at the local GGUF route.

Summary

Running LLM-jp-4 locally on a MacBook Pro M4 Max 128GB and exposing it through Ollama’s OpenAI-compatible API is entirely feasible.

My current summary is:

8B is practical as a local API server
without structured output, extraction is too unstable
even a small five-page test can contain chunks that take around 15s
32B-A3B also runs on this class of Mac
but in this GGUF + Ollama setup, 8B was more stable than 32B-A3B

So for now, my practical local answer is 8B, while 32B-A3B remains worth re-checking through the official transformers path.

References

NII release: https://www.nii.ac.jp/news/release/2026/0403.html
Official LLM-jp-4 8B thinking: https://huggingface.co/llm-jp/llm-jp-4-8b-thinking
GGUF converted 8B: https://huggingface.co/mmnga-o/llm-jp-4-8b-thinking-gguf
GGUF converted 32B-A3B: https://huggingface.co/mmnga-o/llm-jp-4-32b-a3b-thinking-gguf
Ollama OpenAI compatibility: https://docs.ollama.com/api/openai-compatibility
Ollama Structured Outputs: https://docs.ollama.com/capabilities/structured-outputs

What I wanted to do#

Conclusion#

Environment#

Installing the model#

Calling it through an OpenAI-compatible API#

A test script for the extraction batch#

Why structured output was necessary#

Measurements#

Output quality#

Is 32B-A3B realistic on this machine?#

Practical takeaway#

Summary#

References#