I tested the official LLM-jp-4-32b-a3b-thinking model on an mdx.jp server with A100 40GB x2, and got it working as a vLLM OpenAI-compatible API server.

The short version is:

  • Transformers + device_map="auto" hit an OOM during generation
  • switching to vLLM + tensor_parallel_size=2 worked
  • however, the thinking model exposed internal analysis text through the OpenAI-compatible API
  • I had to extract assistant final on the client side

Goal

The goal was to:

  • run the official LLM-jp-4-32b-a3b-thinking
  • do it on a stable A100 server
  • expose it through an OpenAI-compatible API
  • connect it to a Japanese OCR extraction workflow

I had already tested local GGUF + Ollama setups on a Mac, but I wanted to evaluate the official implementation on a proper GPU server as well.

Environment

  • Platform: mdx.jp
  • OS: Ubuntu 22.04 LTS
  • GPU: NVIDIA A100-SXM4-40GB x 2
  • CUDA Driver: 590.48.01
  • CUDA Version: 13.1
  • Virtual disk: 360GB
  • Model: llm-jp/llm-jp-4-32b-a3b-thinking

nvidia-smi showed both GPUs correctly:

|   0  NVIDIA A100-SXM4-40GB          On  | ... | 0MiB / 40960MiB |
|   1  NVIDIA A100-SXM4-40GB          On  | ... | 0MiB / 40960MiB |

The 360GB virtual disk turned out to be very helpful for the model download and Hugging Face cache.

First attempt: direct Transformers inference

I first tried a direct Transformers path close to the official model-card example:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "llm-jp/llm-jp-4-32b-a3b-thinking"

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The download and weight loading succeeded:

Download complete: 100%|...| 64.3G/64.3G
Loading weights: 100%|...| 355/355

But generation failed with an out-of-memory error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.79 GiB.
GPU 0 has a total capacity of 39.49 GiB of which 6.11 GiB is free.

So this was not “the server is insufficient,” but rather “this loading strategy is insufficient.” The model fit well enough to load, but not to generate under this layout.

At that point the practical conclusion was simple:

  • the hardware was not the issue
  • naive Transformers inference was not the right serving path
  • if I wanted an API server, I should switch to vLLM

Switching to vLLM

I used Docker + vLLM. The command was:

docker run --runtime nvidia --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --model llm-jp/llm-jp-4-32b-a3b-thinking \
  --trust-remote-code \
  --tensor-parallel-size 2

The key flags were:

  • --trust-remote-code
  • --tensor-parallel-size 2
  • a mounted Hugging Face cache

vLLM startup

The important parts of the startup log looked like this:

Resolved architecture: Qwen3MoeForCausalLM
Using max model len 65536
tensor_parallel_size=2

After loading, the server reported roughly:

Model loading took 29.99 GiB memory
GPU KV cache size: 149,168 tokens
Maximum concurrency for 65,536 tokens per request: 2.28x

That was the point where the setup became practical.

Verifying the OpenAI-compatible API

First I checked the model list:

curl http://localhost:8000/v1/models

The response included:

{
  "object": "list",
  "data": [
    {
      "id": "llm-jp/llm-jp-4-32b-a3b-thinking",
      "object": "model",
      "max_model_len": 65536
    }
  ]
}

So the vLLM server itself was clearly up and working.

The main issue: thinking output is awkward over OpenAI-compatible chat

When I called /v1/chat/completions, I did not get a clean final answer only. Instead, I got a long response that contained internal analysis text before the final answer.

For example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llm-jp/llm-jp-4-32b-a3b-thinking",
    "messages": [
      {"role": "system", "content": "You are a research assistant for Japanese text analysis."},
      {"role": "user", "content": "What emotion is the expression ず笑ふて closest to? Answer briefly in Japanese."}
    ],
    "max_tokens": 512,
    "temperature": 0
  }'

The output looked roughly like:

analysis ...
assistant final 悲しみや失望に近い感情です。

So the server worked, but the response shape was not application-friendly.

tokenizer.parse_response did not help

My first idea was to use the official tokenizer’s parse_response():

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "llm-jp/llm-jp-4-32b-a3b-thinking",
    trust_remote_code=True,
)

print(tokenizer.parse_response(raw_text))

But in practice it returned almost nothing useful:

{'role': 'assistant'}

That means the OpenAI-compatible text returned by vLLM did not match the response format the official tokenizer expected.

Practical workaround: extract assistant final

So for now, the simplest workaround is to parse out the assistant final section manually:

import re

match = re.search(r"assistant final\s*(.*)$", raw, flags=re.DOTALL)
if match:
    final_text = match.group(1).strip()
else:
    final_text = raw.strip()

That was enough to recover the final answer text reliably.

For example, from the earlier output, this gave:

悲しみや失望に近い感情です。

Applying this to OCR extraction

I then built a small extraction client on top of this setup:

  • read a .txt file
  • send it to vLLM through /chat/completions
  • extract assistant final
  • find the JSON array inside it
  • fill in pages if needed
  • save the result

One practical problem was that the model sometimes appended extra explanation text after the JSON array. My first attempt used a greedy regex like \[.*\], which caused:

json.decoder.JSONDecodeError: Extra data

The fix was to use json.JSONDecoder().raw_decode() and extract only the first valid array.

Current takeaway

At this point my conclusions are:

  • the official LLM-jp-4-32b-a3b-thinking does run on A100 40GB x2
  • direct Transformers + device_map="auto" was not sufficient for generation
  • vLLM + tensor_parallel_size=2 worked
  • the OpenAI-compatible API itself works
  • but the thinking model exposes internal analysis
  • tokenizer.parse_response() did not match the vLLM output format in my setup
  • for now, client-side assistant final extraction is needed

So the infrastructure side is in good shape. The remaining work is mostly on response handling and extraction prompting.

Next steps

The next things I want to test are:

  • multiple OCR pages instead of just one
  • prompt tuning for extraction quality
  • reducing reliance on assistant final
  • comparing against the base model
  • tuning stop conditions or templates on the vLLM side

The main point is: serving the model is solved. Making the thinking output clean enough for downstream batch processing still needs some work.

References