I tested the official LLM-jp-4-32b-a3b-thinking model on an mdx.jp server with A100 40GB x2, and got it working as a vLLM OpenAI-compatible API server.
The short version is:
Transformers + device_map="auto"hit anOOMduring generation- switching to
vLLM + tensor_parallel_size=2worked - however, the
thinkingmodel exposed internalanalysistext through the OpenAI-compatible API - I had to extract
assistant finalon the client side
Goal
The goal was to:
- run the official
LLM-jp-4-32b-a3b-thinking - do it on a stable
A100server - expose it through an
OpenAI-compatible API - connect it to a Japanese OCR extraction workflow
I had already tested local GGUF + Ollama setups on a Mac, but I wanted to evaluate the official implementation on a proper GPU server as well.
Environment
- Platform:
mdx.jp - OS:
Ubuntu 22.04 LTS - GPU:
NVIDIA A100-SXM4-40GB x 2 - CUDA Driver:
590.48.01 - CUDA Version:
13.1 - Virtual disk:
360GB - Model:
llm-jp/llm-jp-4-32b-a3b-thinking
nvidia-smi showed both GPUs correctly:
| 0 NVIDIA A100-SXM4-40GB On | ... | 0MiB / 40960MiB |
| 1 NVIDIA A100-SXM4-40GB On | ... | 0MiB / 40960MiB |
The 360GB virtual disk turned out to be very helpful for the model download and Hugging Face cache.
First attempt: direct Transformers inference
I first tried a direct Transformers path close to the official model-card example:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
MODEL_ID = "llm-jp/llm-jp-4-32b-a3b-thinking"
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID,
trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
The download and weight loading succeeded:
Download complete: 100%|...| 64.3G/64.3G
Loading weights: 100%|...| 355/355
But generation failed with an out-of-memory error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.79 GiB.
GPU 0 has a total capacity of 39.49 GiB of which 6.11 GiB is free.
So this was not “the server is insufficient,” but rather “this loading strategy is insufficient.” The model fit well enough to load, but not to generate under this layout.
At that point the practical conclusion was simple:
- the hardware was not the issue
- naive
Transformersinference was not the right serving path - if I wanted an API server, I should switch to
vLLM
Switching to vLLM
I used Docker + vLLM. The command was:
docker run --runtime nvidia --gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
vllm/vllm-openai:latest \
--model llm-jp/llm-jp-4-32b-a3b-thinking \
--trust-remote-code \
--tensor-parallel-size 2
The key flags were:
--trust-remote-code--tensor-parallel-size 2- a mounted Hugging Face cache
vLLM startup
The important parts of the startup log looked like this:
Resolved architecture: Qwen3MoeForCausalLM
Using max model len 65536
tensor_parallel_size=2
After loading, the server reported roughly:
Model loading took 29.99 GiB memory
GPU KV cache size: 149,168 tokens
Maximum concurrency for 65,536 tokens per request: 2.28x
That was the point where the setup became practical.
Verifying the OpenAI-compatible API
First I checked the model list:
curl http://localhost:8000/v1/models
The response included:
{
"object": "list",
"data": [
{
"id": "llm-jp/llm-jp-4-32b-a3b-thinking",
"object": "model",
"max_model_len": 65536
}
]
}
So the vLLM server itself was clearly up and working.
The main issue: thinking output is awkward over OpenAI-compatible chat
When I called /v1/chat/completions, I did not get a clean final answer only. Instead, I got a long response that contained internal analysis text before the final answer.
For example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llm-jp/llm-jp-4-32b-a3b-thinking",
"messages": [
{"role": "system", "content": "You are a research assistant for Japanese text analysis."},
{"role": "user", "content": "What emotion is the expression ず笑ふて closest to? Answer briefly in Japanese."}
],
"max_tokens": 512,
"temperature": 0
}'
The output looked roughly like:
analysis ...
assistant final 悲しみや失望に近い感情です。
So the server worked, but the response shape was not application-friendly.
tokenizer.parse_response did not help
My first idea was to use the official tokenizer’s parse_response():
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"llm-jp/llm-jp-4-32b-a3b-thinking",
trust_remote_code=True,
)
print(tokenizer.parse_response(raw_text))
But in practice it returned almost nothing useful:
{'role': 'assistant'}
That means the OpenAI-compatible text returned by vLLM did not match the response format the official tokenizer expected.
Practical workaround: extract assistant final
So for now, the simplest workaround is to parse out the assistant final section manually:
import re
match = re.search(r"assistant final\s*(.*)$", raw, flags=re.DOTALL)
if match:
final_text = match.group(1).strip()
else:
final_text = raw.strip()
That was enough to recover the final answer text reliably.
For example, from the earlier output, this gave:
悲しみや失望に近い感情です。
Applying this to OCR extraction
I then built a small extraction client on top of this setup:
- read a
.txtfile - send it to
vLLMthrough/chat/completions - extract
assistant final - find the JSON array inside it
- fill in
pagesif needed - save the result
One practical problem was that the model sometimes appended extra explanation text after the JSON array. My first attempt used a greedy regex like \[.*\], which caused:
json.decoder.JSONDecodeError: Extra data
The fix was to use json.JSONDecoder().raw_decode() and extract only the first valid array.
Current takeaway
At this point my conclusions are:
- the official
LLM-jp-4-32b-a3b-thinkingdoes run onA100 40GB x2 - direct
Transformers + device_map="auto"was not sufficient for generation vLLM + tensor_parallel_size=2worked- the OpenAI-compatible API itself works
- but the
thinkingmodel exposes internalanalysis tokenizer.parse_response()did not match thevLLMoutput format in my setup- for now, client-side
assistant finalextraction is needed
So the infrastructure side is in good shape. The remaining work is mostly on response handling and extraction prompting.
Next steps
The next things I want to test are:
- multiple OCR pages instead of just one
- prompt tuning for extraction quality
- reducing reliance on
assistant final - comparing against the
basemodel - tuning stop conditions or templates on the
vLLMside
The main point is: serving the model is solved. Making the thinking output clean enough for downstream batch processing still needs some work.
References
- LLM-jp-4 32B A3B Thinking: https://huggingface.co/llm-jp/llm-jp-4-32b-a3b-thinking
- vLLM OpenAI Compatible Server: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html
- vLLM Docker: https://docs.vllm.ai/en/latest/deployment/docker.html