I tested whether LLM-jp-4, released by NII, could run locally on a MacBook Pro M4 Max 128GB and be used from an existing batch pipeline through Ollama’s OpenAI-compatible API.

The short version is:

  • 8B is practical on this machine.
  • 32B-A3B also runs locally on this class of hardware.
  • However, with the GGUF + Ollama setup I used here, 32B-A3B was less stable in extraction quality than 8B.

What I wanted to do

The goal was not just chatting locally. I wanted to:

  • run LLM-jp-4 as a local server
  • call it from external client software
  • integrate it into an existing extraction batch through an OpenAI-compatible API

The target task was extracting facial-expression and emotion descriptions from Japanese OCR text.

Conclusion

At this stage, my practical conclusions were:

  • LLM-jp-4 8B runs comfortably on a MacBook Pro M4 Max 128GB
  • LLM-jp-4 32B-A3B also runs locally in quantized form
  • Ollama is the fastest route if you want an OpenAI-compatible API
  • I did not use the original Hugging Face weights directly here; I used GGUF conversions
  • For this extraction task, 8B was more stable than 32B-A3B in my local setup

That last point matters. The official LLM-jp-4 release is intended for transformers, while my local setup used GGUF conversions through Ollama. So this was not a “pure official implementation” benchmark.

Environment

  • Machine: MacBook Pro M4 Max 128GB
  • Runtime: Ollama 0.20.3
  • API: http://localhost:11434/v1/
  • Model: llm-jp-4-8b-thinking-Q4_K_M.gguf

The 8B Q4_K_M file is about 5.3GB, which makes it quite manageable for local testing.

Installing the model

I installed Ollama via Homebrew:

brew install ollama
brew services start ollama

Then I downloaded a GGUF file and created a Modelfile.

curl -L --fail -o ~/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf \
  'https://huggingface.co/mmnga-o/llm-jp-4-8b-thinking-gguf/resolve/main/llm-jp-4-8b-thinking-Q4_K_M.gguf?download=true'
FROM /Users/yourname/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf

PARAMETER num_ctx 8192
PARAMETER temperature 0.7
ollama create llm-jp-4-8b-q4 -f ~/git/llm/ollama/modelfiles/llm-jp-4-8b-q4.Modelfile

At that point the model was available through Ollama as llm-jp-4-8b-q4.

Calling it through an OpenAI-compatible API

The endpoint was:

base_url = "http://localhost:11434/v1/"
api_key = "ollama"

Using the openai SDK looked like this:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1/",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llm-jp-4-8b-q4",
    messages=[
        {"role": "user", "content": "This is a test. Return JSON only."}
    ],
)

print(response.choices[0].message.content)

A test script for the extraction batch

Instead of modifying the production code directly, I created a separate test script.

  • Input: /Users/nakamura/git/ndl/face/data/886000
  • Output: JSON
  • Backend: local Ollama API

At first I only instructed the model to “return a JSON array,” but in practice it often returned extra text, code fences, or invalid placeholders like ....

So I tightened the setup with:

  • response_format using JSON Schema
  • temperature=0
  • per-chunk streaming logs
  • per-chunk timing

Why structured output was necessary

For this kind of extraction task, prompt-only JSON constraints are not enough. Local models often:

  • prepend explanations
  • wrap output in code fences
  • omit required fields
  • generate unexpected field values

Using response_format with a JSON Schema did not make the output perfect, but it made it much less fragile.

Measurements

I first tested a single page:

chunk 1/1 pages=8-8 chars=1116
  first_token=1.2s
  streaming=5.0s chars=287
  completed=5.1s response_chars=290
  extracted=1

Under these conditions, one page took about 5s.

Then I tested 5 pages. I eventually reduced chunk_size to 1200 so I could see progress more clearly.

chunk 1/5 pages=8-8 chars=1116
  first_token=1.2s
  streaming=5.0s chars=287
  completed=5.1s response_chars=290
  extracted=1
chunk 2/5 pages=9-9 chars=836
  first_token=1.0s
  completed=3.6s response_chars=302
  extracted=1
chunk 3/5 pages=10-10 chars=1113
  first_token=2.1s
  streaming=5.0s chars=256
  completed=7.1s response_chars=356
  extracted=1
chunk 4/5 pages=11-11 chars=869
  first_token=1.7s
  streaming=5.0s chars=365
  streaming=10.0s chars=867
  streaming=15.0s chars=1402
  completed=15.8s response_chars=1457
  extracted=5
chunk 5/5 pages=12-12 chars=777
  first_token=1.6s
  completed=4.3s response_chars=262
  extracted=1
saved=/Users/nakamura/git/llm/outputs/886000_extraction_test_5pages.json
total=9

This made one thing obvious: 5 pages is not simply 5 x 5s.

The main reasons are:

  • complexity depends on the page content, not just the page count
  • structured output generation tends to be slower
  • if a chunk yields more extracted items, the response becomes longer

In my run, the chunk for page 11 alone took 15.8s.

Output quality

The output was valid JSON often enough to save it, but the content quality was still rough.

Typical problems included:

  • surface or context being replaced with ...
  • empty pages arrays
  • unexpected body_part values
  • character values being overly long or noisy

So yes, 8B was usable as an API. But no, it was not production-ready for extraction quality as-is.

Is 32B-A3B realistic on this machine?

Yes. On this hardware, it is realistic.

But “it runs” and “it is practical” are different questions. Even 8B spins up the fans under continuous inference. 32B-A3B Q4_K_M is noticeably heavier.

I also tested 32B-A3B Q4_K_M locally on a single page:

chunk 1/1 pages=8-8 chars=1116
  first_token=7.9s
  streaming=12.9s chars=246
  completed=13.0s response_chars=251
  extracted=1

That log by itself suggests: slow, but usable. The real problem was content quality. The response was JSON, but not well grounded in the source text. In some cases:

  • surface and context were hallucinated strings
  • pages came back empty
  • values drifted away from the actual OCR text

One example looked like this:

[
  {
    "surface": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "context": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "emotion": "その他",
    "body_part": "涙",
    "valence": "ambiguous",
    "intensity": "low",
    "character": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
    "pages": []
  }
]

When I tightened the constraints even further, I got outputs like this:

[
  {
    "surface": "/c/co/n/t/e/x/t/",
    "context": "/c/co/n/t/e/x/t/",
    "emotion": "悲しみ・哀れ",
    "body_part": "口",
    "valence": "ambiguous",
    "intensity": "low",
    "character": "ださいづちあたぶお才槌頭ハちいてほ%",
    "pages": [8]
  }
]

So in this GGUF + Ollama setup, 32B-A3B definitely ran, but it did not give me better extraction quality than 8B.

Practical takeaway

At least for this task and this setup:

  • use 8B for API checks, prompt iteration, and small batch runs
  • evaluate 32B-A3B separately if you care about quality
  • do not judge 32B-A3B only through GGUF + Ollama

If I wanted to evaluate 32B-A3B seriously, I would do it again with the official transformers implementation instead of stopping at the local GGUF route.

Summary

Running LLM-jp-4 locally on a MacBook Pro M4 Max 128GB and exposing it through Ollama’s OpenAI-compatible API is entirely feasible.

My current summary is:

  • 8B is practical as a local API server
  • without structured output, extraction is too unstable
  • even a small five-page test can contain chunks that take around 15s
  • 32B-A3B also runs on this class of Mac
  • but in this GGUF + Ollama setup, 8B was more stable than 32B-A3B

So for now, my practical local answer is 8B, while 32B-A3B remains worth re-checking through the official transformers path.

References