I tested whether LLM-jp-4, released by NII, could run locally on a MacBook Pro M4 Max 128GB and be used from an existing batch pipeline through Ollama’s OpenAI-compatible API.
The short version is:
8Bis practical on this machine.32B-A3Balso runs locally on this class of hardware.- However, with the
GGUF + Ollamasetup I used here,32B-A3Bwas less stable in extraction quality than8B.
What I wanted to do
The goal was not just chatting locally. I wanted to:
- run
LLM-jp-4as a local server - call it from external client software
- integrate it into an existing extraction batch through an
OpenAI-compatible API
The target task was extracting facial-expression and emotion descriptions from Japanese OCR text.
Conclusion
At this stage, my practical conclusions were:
LLM-jp-4 8Bruns comfortably on aMacBook Pro M4 Max 128GBLLM-jp-4 32B-A3Balso runs locally in quantized formOllamais the fastest route if you want anOpenAI-compatible API- I did not use the original Hugging Face weights directly here; I used
GGUFconversions - For this extraction task,
8Bwas more stable than32B-A3Bin my local setup
That last point matters. The official LLM-jp-4 release is intended for transformers, while my local setup used GGUF conversions through Ollama. So this was not a “pure official implementation” benchmark.
Environment
- Machine:
MacBook Pro M4 Max 128GB - Runtime:
Ollama 0.20.3 - API:
http://localhost:11434/v1/ - Model:
llm-jp-4-8b-thinking-Q4_K_M.gguf
The 8B Q4_K_M file is about 5.3GB, which makes it quite manageable for local testing.
Installing the model
I installed Ollama via Homebrew:
brew install ollama
brew services start ollama
Then I downloaded a GGUF file and created a Modelfile.
curl -L --fail -o ~/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf \
'https://huggingface.co/mmnga-o/llm-jp-4-8b-thinking-gguf/resolve/main/llm-jp-4-8b-thinking-Q4_K_M.gguf?download=true'
FROM /Users/yourname/git/llm/ollama/models/llm-jp-4-8b-thinking-Q4_K_M.gguf
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
ollama create llm-jp-4-8b-q4 -f ~/git/llm/ollama/modelfiles/llm-jp-4-8b-q4.Modelfile
At that point the model was available through Ollama as llm-jp-4-8b-q4.
Calling it through an OpenAI-compatible API
The endpoint was:
base_url = "http://localhost:11434/v1/"
api_key = "ollama"
Using the openai SDK looked like this:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1/",
api_key="ollama",
)
response = client.chat.completions.create(
model="llm-jp-4-8b-q4",
messages=[
{"role": "user", "content": "This is a test. Return JSON only."}
],
)
print(response.choices[0].message.content)
A test script for the extraction batch
Instead of modifying the production code directly, I created a separate test script.
- Input:
/Users/nakamura/git/ndl/face/data/886000 - Output: JSON
- Backend: local Ollama API
At first I only instructed the model to “return a JSON array,” but in practice it often returned extra text, code fences, or invalid placeholders like ....
So I tightened the setup with:
response_formatusing JSON Schematemperature=0- per-chunk streaming logs
- per-chunk timing
Why structured output was necessary
For this kind of extraction task, prompt-only JSON constraints are not enough. Local models often:
- prepend explanations
- wrap output in code fences
- omit required fields
- generate unexpected field values
Using response_format with a JSON Schema did not make the output perfect, but it made it much less fragile.
Measurements
I first tested a single page:
chunk 1/1 pages=8-8 chars=1116
first_token=1.2s
streaming=5.0s chars=287
completed=5.1s response_chars=290
extracted=1
Under these conditions, one page took about 5s.
Then I tested 5 pages. I eventually reduced chunk_size to 1200 so I could see progress more clearly.
chunk 1/5 pages=8-8 chars=1116
first_token=1.2s
streaming=5.0s chars=287
completed=5.1s response_chars=290
extracted=1
chunk 2/5 pages=9-9 chars=836
first_token=1.0s
completed=3.6s response_chars=302
extracted=1
chunk 3/5 pages=10-10 chars=1113
first_token=2.1s
streaming=5.0s chars=256
completed=7.1s response_chars=356
extracted=1
chunk 4/5 pages=11-11 chars=869
first_token=1.7s
streaming=5.0s chars=365
streaming=10.0s chars=867
streaming=15.0s chars=1402
completed=15.8s response_chars=1457
extracted=5
chunk 5/5 pages=12-12 chars=777
first_token=1.6s
completed=4.3s response_chars=262
extracted=1
saved=/Users/nakamura/git/llm/outputs/886000_extraction_test_5pages.json
total=9
This made one thing obvious: 5 pages is not simply 5 x 5s.
The main reasons are:
- complexity depends on the page content, not just the page count
- structured output generation tends to be slower
- if a chunk yields more extracted items, the response becomes longer
In my run, the chunk for page 11 alone took 15.8s.
Output quality
The output was valid JSON often enough to save it, but the content quality was still rough.
Typical problems included:
surfaceorcontextbeing replaced with...- empty
pagesarrays - unexpected
body_partvalues charactervalues being overly long or noisy
So yes, 8B was usable as an API. But no, it was not production-ready for extraction quality as-is.
Is 32B-A3B realistic on this machine?
Yes. On this hardware, it is realistic.
But “it runs” and “it is practical” are different questions. Even 8B spins up the fans under continuous inference. 32B-A3B Q4_K_M is noticeably heavier.
I also tested 32B-A3B Q4_K_M locally on a single page:
chunk 1/1 pages=8-8 chars=1116
first_token=7.9s
streaming=12.9s chars=246
completed=13.0s response_chars=251
extracted=1
That log by itself suggests: slow, but usable. The real problem was content quality. The response was JSON, but not well grounded in the source text. In some cases:
surfaceandcontextwere hallucinated stringspagescame back empty- values drifted away from the actual OCR text
One example looked like this:
[
{
"surface": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
"context": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
"emotion": "その他",
"body_part": "涙",
"valence": "ambiguous",
"intensity": "low",
"character": "あやま後から〓〓さいづちあたぶお才槌頭ハちいてほ%",
"pages": []
}
]
When I tightened the constraints even further, I got outputs like this:
[
{
"surface": "/c/co/n/t/e/x/t/",
"context": "/c/co/n/t/e/x/t/",
"emotion": "悲しみ・哀れ",
"body_part": "口",
"valence": "ambiguous",
"intensity": "low",
"character": "ださいづちあたぶお才槌頭ハちいてほ%",
"pages": [8]
}
]
So in this GGUF + Ollama setup, 32B-A3B definitely ran, but it did not give me better extraction quality than 8B.
Practical takeaway
At least for this task and this setup:
- use
8Bfor API checks, prompt iteration, and small batch runs - evaluate
32B-A3Bseparately if you care about quality - do not judge
32B-A3Bonly throughGGUF + Ollama
If I wanted to evaluate 32B-A3B seriously, I would do it again with the official transformers implementation instead of stopping at the local GGUF route.
Summary
Running LLM-jp-4 locally on a MacBook Pro M4 Max 128GB and exposing it through Ollama’s OpenAI-compatible API is entirely feasible.
My current summary is:
8Bis practical as a local API server- without structured output, extraction is too unstable
- even a small five-page test can contain chunks that take around
15s 32B-A3Balso runs on this class of Mac- but in this
GGUF + Ollamasetup,8Bwas more stable than32B-A3B
So for now, my practical local answer is 8B, while 32B-A3B remains worth re-checking through the official transformers path.
References
- NII release: https://www.nii.ac.jp/news/release/2026/0403.html
- Official
LLM-jp-4 8B thinking: https://huggingface.co/llm-jp/llm-jp-4-8b-thinking GGUFconverted8B: https://huggingface.co/mmnga-o/llm-jp-4-8b-thinking-ggufGGUFconverted32B-A3B: https://huggingface.co/mmnga-o/llm-jp-4-32b-a3b-thinking-gguf- Ollama OpenAI compatibility: https://docs.ollama.com/api/openai-compatibility
- Ollama Structured Outputs: https://docs.ollama.com/capabilities/structured-outputs