I set up vLLM on an mdx.jp GPU server and exposed it externally through Cloudflare Tunnel so that it could be used as an OpenAI-compatible API from a local machine or another server.

This article focuses on the publishing and connectivity side rather than the model-serving side itself. For the separate note on running LLM-jp-4-32b-a3b-thinking on vLLM, see:

What I wanted

The target setup was:

  • run vLLM in Docker on mdx.jp
  • keep inbound ports closed on the server
  • reach the API through Cloudflare Tunnel at a hostname such as https://llm.example.jp/v1/...
  • use Cloudflare Zero Trust for SSH as well

Conceptually, it looked like this:

Local machine
  ├── https://llm.example.jp/v1/chat/completions
  └── ssh mdx-llm-cf
      Cloudflare
   cloudflared on mdx.jp
          ├── host.docker.internal:8000 -> vLLM API
          └── host.docker.internal:22   -> SSH

The key point is that vLLM and cloudflared run as separate containers, and cloudflared forwards traffic to the host-side vLLM endpoint.

Environment

  • Platform: mdx.jp
  • OS: Ubuntu 22.04
  • GPU: A100 40GB x2
  • docker installed
  • a Cloudflare account with a managed domain
  • a Hugging Face token already configured

I confirmed the GPU layout with:

nvidia-smi --query-gpu=name,memory.total --format=csv,noheader

Example output:

NVIDIA A100-SXM4-40GB, 40960 MiB
NVIDIA A100-SXM4-40GB, 40960 MiB

Starting vLLM

I first started vLLM on the host:

HF_TOKEN=$(cat ~/.cache/huggingface/token)

docker run -d \
  --name vllm \
  --restart unless-stopped \
  --runtime nvidia \
  --gpus all \
  --ipc=host \
  -p 8000:8000 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \
  vllm/vllm-openai:latest \
  --host 0.0.0.0 \
  --port 8000 \
  --model llm-jp/llm-jp-4-32b-a3b-thinking \
  --trust-remote-code \
  --tensor-parallel-size 2

The important part here is --host 0.0.0.0.

At first, I had the server effectively bound to 127.0.0.1:8000, and Cloudflare returned 502. That was not a vLLM failure. The issue was that cloudflared was running in another container, so a localhost-only bind was not reachable from there. Switching to 0.0.0.0 fixed that.

I verified the local server with:

curl http://localhost:8000/v1/models

Creating the Cloudflare Tunnel

Here I assume a remote-managed tunnel because the runtime example below uses run --token. In practice, that means creating the tunnel first in the Zero Trust dashboard or via API, assigning hostnames for the API and SSH, and then retrieving the TUNNEL_TOKEN.

Using cloudflared tunnel create would describe the locally-managed tunnel flow, so it is better not to mix that directly with the later run --token example.

Running cloudflared

On the server, I started cloudflared like this:

docker run -d \
  --name cloudflared \
  --restart unless-stopped \
  --add-host host.docker.internal:host-gateway \
  cloudflare/cloudflared:latest \
  tunnel --protocol http2 run --token <TUNNEL_TOKEN>

The --add-host host.docker.internal:host-gateway flag matters. Without it, the containerized tunnel cannot reach the host’s 8000 or 22.

I also explicitly used --protocol http2. In this environment, that was a safer choice than relying on QUIC.

Ingress configuration

The tunnel ingress was configured conceptually like this:

{
  "config": {
    "ingress": [
      {
        "hostname": "llm.example.jp",
        "service": "http://host.docker.internal:8000"
      },
      {
        "hostname": "ssh-llm.example.jp",
        "service": "ssh://host.docker.internal:22"
      },
      {
        "service": "http_status:404"
      }
    ]
  }
}

The API hostname points to the vLLM OpenAI-compatible server, and the SSH hostname points to the host SSH daemon.

The final http_status:404 entry is the required catch-all rule.

Testing from outside

Once the tunnel was configured, I could check the model list from outside:

curl https://llm.example.jp/v1/models

In this setup, the response looked like:

{
  "object": "list",
  "data": [
    {
      "id": "llm-jp/llm-jp-4-32b-a3b-thinking"
    }
  ]
}

Chat completions also worked:

curl https://llm.example.jp/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llm-jp/llm-jp-4-32b-a3b-thinking",
    "messages": [
      {"role": "system", "content": "You are a Japanese assistant."},
      {"role": "user", "content": "What is the capital of Japan? Answer in one sentence."}
    ],
    "temperature": 0,
    "max_tokens": 256
  }'

Because this was a thinking model, the response could include something like:

analysis ...
assistant final Tokyo is the capital of Japan.

That is not really a vLLM issue. It is more a property of the model behavior. If the downstream client expects a clean final answer only, some post-processing is needed.

Is using a thinking model with vLLM a natural choice?

Yes, using a thinking model with vLLM is natural in the sense that vLLM is a reasonable serving layer for large GPU-hosted models. If the goal is fast serving plus an OpenAI-compatible API, vLLM is one of the most straightforward choices.

However, that does not mean the output shape is ideal for every API use case. A thinking model may expose intermediate text such as analysis, which makes it less convenient for applications that want clean structured output or final-answer-only behavior.

So the practical fit depends on the use case:

  • if you want stronger reasoning and can tolerate output cleanup, thinking + vLLM is fine
  • if you want a clean production-style API, a non-thinking or instruct-style model is usually simpler

More straightforward alternatives

If the real goal is to build a stable API service, there are simpler options.

1. Use a non-thinking model

This is the most straightforward option.

  • less output cleanup
  • less client-side parsing
  • fewer surprises in downstream pipelines

For production-style APIs, that is often the better default.

2. Keep the thinking model for internal batch jobs

Another reasonable pattern is to use the thinking model only for internal workloads such as:

  • OCR text extraction
  • labeling
  • candidate generation
  • pre-processing before human review

In those cases, output quirks are easier to absorb in client-side post-processing.

3. Split production and experimentation

This may be the most practical arrangement:

  • public or shared API: non-thinking model
  • internal testing or harder reasoning tasks: thinking model

That keeps the external API simpler while still preserving the option to use a reasoning-heavy model where it actually helps.

Zero Trust SSH

I also routed SSH through Cloudflare.

One important detail is that if you want the browser-based Zero Trust SSH flow, you also need a Cloudflare Access SSH application and policy for ssh-llm.example.jp. The local ~/.ssh/config entry alone does not create the authentication step.

With that in place, my local ~/.ssh/config looked like this:

Host mdx-llm-cf
  HostName ssh-llm.example.jp
  User mdx-user01
  IdentityFile ~/.ssh/mdx/id_rsa
  ProxyCommand cloudflared access ssh --hostname %h

Then I could connect with:

ssh mdx-llm-cf

On first use, this opens a browser-based Cloudflare Access authentication flow.

What actually caused trouble

The three main practical issues were these.

1. A localhost bind caused 502

This was the easiest trap to fall into.

  • vLLM was running
  • curl http://localhost:8000/v1/models worked on the host
  • but https://llm.example.jp/v1/models returned 502

The reason was simply that cloudflared was running in another container and could not reach a localhost-only bind. Changing vLLM to --host 0.0.0.0 fixed it.

2. host.docker.internal was required

Since the tunnel container had to forward to host services, I needed:

--add-host host.docker.internal:host-gateway

Without that, service: http://host.docker.internal:8000 did not work.

3. The API becomes publicly reachable unless protected

https://llm.example.jp/v1/... is convenient, but it can easily become an unauthenticated public endpoint. That is a poor default for an LLM API because requests are expensive and abuse is easy.

The practical options are:

  • protect it with Cloudflare Access
  • do not expose the API hostname publicly and use SSH port forwarding instead

For small-team or personal use, either of those is safer than leaving the endpoint open.

Conclusion

Exposing vLLM on mdx.jp through Cloudflare Tunnel worked well once the networking details were correct. The important points were making vLLM listen on 0.0.0.0, giving cloudflared access to host.docker.internal, and deciding early whether the API should be public at all.

If the goal is simply to make a GPU-hosted LLM reachable through an OpenAI-compatible API, this approach is practical. But if the model is a thinking model, the cleaner design may be to reserve it for internal jobs and expose a more conventional instruct-style model for external API use.