I migrated a set of production web services from a configuration where DNS pointed directly at the origin (Docker + Traefik on a VPS) to one where CloudFront + AWS WAF sit in front of the origin. This article summarises the patterns I used and the pitfalls I did not expect, in a general form.
The goal is to help anyone migrating a similar setup avoid the same mistakes.
Before
Browser ──► DNS ──► Origin IP (reverse proxy: Traefik on VPS)
├── service-a (equivalent to cultural.jp)
├── service-b (equivalent to api.cultural.jp)
└── service-c (equivalent to webcatplus.jp)
- Each service is a Docker container.
- Traefik routes by the Host header and terminates TLS with Let’s Encrypt (HTTP-01).
- The CrowdSec bouncer plugin handles attack detection.
After
Browser ──► DNS ──► CloudFront ──► origin domain ──► Traefik
│ (origin.example.com)
└── WAF (OWASP / known bad inputs / IP reputation / rate limit)
Three key points:
- Set up a separate subdomain for the origin (
origin.<service>.example.com). - Inject a “shared secret” as a custom header between CloudFront and the origin.
- Reuse the existing wildcard ACM certificate on CloudFront.
1. Why set up an origin-only subdomain
When this pattern is needed
This pattern tends to be necessary under one of the following conditions:
- You are retrofitting CloudFront in front of a domain that currently receives direct traffic (migration).
- The origin is a VPS / non-AWS host and only has its own TLS certificate.
- The origin is an AWS ALB, but the ALB certificate is not an
*.elb.amazonaws.com-style cert (e.g. a custom*.example.comcert).
For a greenfield build on S3, or on an ALB with a native DNS name whose cert is sufficient, you can point CloudFront directly at the origin without a separate subdomain.
Why a separate name is needed
If you set example.com as CloudFront’s Alternate Domain Name (CNAME), you cannot also specify example.com as the origin (DNS loop).
The origin side has to be reachable under a different name.
When using an ALB as the origin, the certificate the ALB’s HTTPS listener returns (in most cases *.example.com) does not match the SNI that CloudFront sends (the ALB native DNS, dualstack.*.elb.amazonaws.com), and TLS validation fails. Since ACM cannot issue *.elb.amazonaws.com certificates, setting up an origin-only subdomain is the standard solution.
Recommended: add a new origin.<service>.<domain> name
example.com ALIAS → CloudFront
origin.example.com A → Existing origin IP
CloudFront hits origin.example.com as its origin, and Traefik accepts traffic with a router matching that hostname.
Why this approach works well:
- You can run both in parallel with the existing domain (you can verify behaviour before the DNS cutover).
- A wildcard cert
*.example.comis usable on the origin side too (either Let’s Encrypt or ACM). - If you later want to move the origin to a different server, you only swap the CNAME — CloudFront configuration stays untouched.
Approaches to avoid:
- Hardcoding the origin IP as CloudFront’s origin → CloudFront cannot send SNI and TLS validation fails.
- Using the ALB’s
dualstack.*.elb.amazonaws.comas the origin name → the ELB cert and SNI do not line up and validation fails (the cert is usually issued for*.example.com).
Watch the subdomain depth
A wildcard cert *.example.com only covers one level.
origin.example.com→ ✅ coveredorigin.api.example.com→ ❌ not covered (two levels)
So for an origin corresponding to api.example.com, keeping it to a single level as origin-api.example.com (instead of origin.api.example.com) lets the existing wildcard cert keep working.
In my case, I tried to cover the API domain with origin.api.example.com relying on the wildcard, hit a TLS error, and ended up switching to origin-api.example.com to be consistent (wasted time).
2. Origin protection: secret header
For origin protection in front of CloudFront, a custom header plus header validation at the reverse proxy is simple and effective.
On the CloudFront side
custom_header {
name = "X-Origin-Secret"
value = var.origin_secret # openssl rand -hex 32
}
On the Traefik side (just add a Headers match to the router rule)
labels:
traefik.http.routers.svc-origin.rule: >-
Host(`origin.example.com`) &&
Headers(`X-Origin-Secret`, `${ORIGIN_SECRET}`)
Requests without a matching header get a 404 (the router itself does not fire). So even if an attacker identifies the origin IP and hits it directly, they only see an unrelated 404.
Nginx version
server {
server_name origin.example.com;
if ($http_x_origin_secret != "${ORIGIN_SECRET}") {
return 404;
}
...
}
If you want to harden further
- On an ALB, you can restrict the EC2 SG to allow only the
com.amazonaws.global.cloudfront.origin-facingprefix list (see below). - On a VPS you can also allow-list CloudFront’s IP ranges at the firewall, but the ranges are large and change frequently, so the secret header alone tends to be enough in practice.
3. Splitting the ALB ⇄ EC2 SG (pitfall)
This was the biggest incident in this migration.
The ALB and the EC2 instances behind it shared the same Security Group. When I tried to narrow the ALB inbound down to the CloudFront prefix list by revoking 0.0.0.0/0 for 80/443, the internal ALB → EC2 traffic was collateral damage and got cut off as well.
Recommended SG layout
[ALB SG] [EC2 SG]
inbound: inbound:
443 from CF prefix list 80 from ALB SG (← via self/other reference)
22 from admin IPs
Create the ALB SG and EC2 SG as separate groups. When they are shared, narrowing one narrows both, which is exactly what caused the incident.
Splitting an existing environment where the SG is shared (with zero downtime)
- Create a new SG for EC2 (allow inbound 80 from the ALB SG).
- Attach both SGs to the EC2 ENI.
- Verify behaviour.
- Detach the old SG from the ENI.
- Reorganise the old (shared) SG into ALB-only rules.
As long as you go through “attach both → verify → detach one” in the middle, you can cut over without downtime.
4. Cache key design
Baseline policy
| Item | Value | Reason |
|---|---|---|
| Path | ✅ | Obvious |
| Query string (all) | ✅ | Keep separate cache entries per API/SPARQL parameter |
| Cookie | ❌ | Hit rate drops dramatically |
Authorization | ❌ | Same as above; split into a separate policy for authenticated traffic |
Accept-Encoding (gzip/br) | ✅ (CF automatic) | Separate cache per compression format |
Accept-Language | ❓ | Not needed if i18n is in the URL as /ja/, /en/ |
If i18n is URL-path-based (/ja/foo, /en/foo), you can leave Accept-Language out of the cache key. That makes a big difference — including it fragments the cache per bot/browser and hurts efficiency.
Choosing TTLs
- Default TTL = 24h
- Max TTL = 7 days
- Min TTL = 0 (respect the origin’s
Cache-Control)
Even for an API where the origin often does not return Cache-Control, CloudFront will cache based on the Default TTL.
For low-update-frequency sites, 1–7 days causes little practical harm, and invalidation on deploy makes propagation instant.
An extra policy for static assets
Hash-bearing paths such as /_next/static/* or /img/* use a separate policy:
- Min/Default/Max = 1 year
- Permanent cache (invalidation unnecessary because the filename includes a hash)
Effect on SPARQL and search APIs
GET requests with query strings hit for 24h as long as the URL is byte-identical.
- Heavy SPARQL DESCRIBE query: first 1.1 s → warm 50 ms (~20× speedup)
- Faceted search API: first 480 ms → warm 45 ms (~10×)
- The effect is largest for UIs that repeatedly issue the same query (infinite scroll, facet spamming).
Caveats:
- POST requests are not cached (RFC-compliant behaviour per the HTTP spec).
- SPARQL endpoints often switch between “short queries via GET, long queries via POST” on the client side, so the benefit grows if you can bias towards GET.
5. Always start WAF in COUNT mode
One of the reasons I brought in WAF was that “CrowdSec was producing frequent false bans”. To avoid making the same mistake, I also ran WAF with all rules in COUNT mode for the first week.
What COUNT mode means
Matching rules do not block, they only log. CloudWatch Metrics and Sampled Requests let you see “how many requests would have been blocked if the rule were enforcing”.
Rules I enabled
| Rule | Purpose |
|---|---|
AWSManagedRulesCommonRuleSet | XSS / SQLi / common attacks |
AWSManagedRulesKnownBadInputsRuleSet | Known bad inputs (log4shell, etc.) |
AWSManagedRulesAmazonIpReputationList | Malicious IPs (IP reputation) |
| Rate-based rule | 1000 req/IP per 5 minutes (DoS mitigation) |
Rules I skipped:
AWSManagedRulesBotControlRuleSet— separately priced and fairly prone to false positives, so I left it off at the start.
Switching modes in Terraform
Writing override_action so it can be toggled via a variable makes it a one-command job to move from the initial count to production none (block enabled).
variable "waf_rule_action" {
type = string
default = "count"
validation {
condition = contains(["count", "block"], var.waf_rule_action)
error_message = "Must be 'count' or 'block'."
}
}
locals {
managed_override = var.waf_rule_action == "block" ? "none" : "count"
}
resource "aws_wafv2_web_acl" "cf" {
rule {
name = "AWSManagedRulesCommonRuleSet"
override_action {
dynamic "count" {
for_each = local.managed_override == "count" ? [1] : []
content {}
}
dynamic "none" {
for_each = local.managed_override == "none" ? [1] : []
content {}
}
}
...
}
}
A week of COUNT operation is enough to identify rules that produce false positives, so you can flip only the safe rules to block.
Why I dropped CrowdSec in favour of WAF
CrowdSec is behaviour-analysis based, and scenarios like http-crawl-non_statics ban “many dynamic-resource requests in a short time”. That means:
- Static site / low traffic → ◎ works well
- Interactive web app + search API → △ normal users get banned
My stack is mostly the latter (faceted search, infinite scroll), which does not play well with CrowdSec. WAF Managed Rules judge on request patterns, so they mostly do not react to request volume itself, and the structure tends to produce fewer false positives.
6. A gradual cutover sequence
To cut over to production with no downtime, this order works well:
1. Add origin DNS (origin.example.com)
2. Add an "origin router" on the reverse proxy (running in parallel with the existing one)
3. Build CloudFront (do not touch DNS yet)
4. Verify via the CF dist-domain (xxxx.cloudfront.net) using curl --resolve or /etc/hosts
5. Verify in the browser using /etc/hosts (only your environment goes via CF)
6. DNS cutover (point the ALIAS at CloudFront)
7. Remove the old router (after the DNS TTL expires)
The important thing is that there is room to test between 5 and 6. The CloudFront distribution domain (xxxx.cloudfront.net) is issued as soon as the distribution is created, so you can verify production-equivalent behaviour with curl and /etc/hosts before switching real DNS.
# /etc/hosts
18.65.x.x example.com
18.64.y.y api.example.com
At this step, watch out for DNS resolver behaviour. In my environment, the local dig resolver was REFUSING cloudfront.net, so I had to pass dig @8.8.8.8 ... explicitly to get CloudFront IPs.
7. Wire invalidation into CI
You need one extra step to clear cache after a deploy.
Minimal IAM policy
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "cloudfront:CreateInvalidation",
"Resource": "arn:aws:cloudfront::<account>:distribution/*"
}]
}
GitHub Actions step
- name: Invalidate CloudFront
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
AWS_DEFAULT_REGION: us-east-1
run: |
aws cloudfront create-invalidation \
--distribution-id ${{ secrets.CF_DIST_ID }} \
--paths "/*"
For services where articles are updated via a CMS, you can do the equivalent from the CMS webhook → Lambda → invalidation.
8. Measured speed improvement
Measured from a Tokyo client against a Tokyo origin (VPS):
| Path | TTFB |
|---|---|
| Direct to origin | ~70 ms |
| CloudFront cold (first-time cache miss) | ~125 ms (first request only) |
| CloudFront warm (cache hit) | ~45 ms |
Even Tokyo-to-Tokyo gets 35–60% faster on cache hit. Contributing factors:
- The SSL handshake to the origin is reused at the edge.
- TCP/TLS round trips are shorter.
- Compression (brotli) happens at the edge.
The effect is even larger for overseas users (the trans-Pacific hop to the origin disappears).
Cold miss is genuinely slow, but with a 24h TTL, the distribution is “only the first request is slow; everything in the next 24h is fast”.
9. Monthly cost
| Item | Rough cost |
|---|---|
| WAF Web ACL | $5/month |
| WAF Managed Rules (3) | $3/month |
| WAF request-based charges | $0.60/million requests |
| CloudFront data transfer (first 1 TB/month is free) | ~$0–5/month |
| ACM certificate | Free |
| Route53 ALIAS | Free (Hosted Zone $0.50/month only) |
All in, around $10–15/month. One WAF ACL can be shared across distributions for multiple domains, so it is surprisingly cheap.
10. Side benefits I did not expect
- I was able to stop exposing Elasticsearch externally. A good chance to reduce the attack surface.
- Reorganising Traefik labels in
docker-compose.ymlclarified each router’s responsibilities. - By managing things as IaC with Terraform, adding a new domain is a matter of adding one entry to the
sites = { ... }map. - I got out from under CrowdSec’s operational cost (no more false-positive handling or allow-list maintenance).
11. Things that went wrong, lessons learned
| Situation | Mistake | Lesson |
|---|---|---|
| ALB SG | Removing 0.0.0.0/0 from the shared SG also cut off internal traffic | Separate ALB SG and EC2 SG from the start |
| Origin name | Used origin.api.example.com (two levels), outside wildcard cert coverage | Keep it to one level, e.g. origin-api.example.com |
| Let’s Encrypt | HTTP-01 requires port 80; a later CF-only cutover can cause renewal to stall | Consider switching to DNS-01 alongside the migration |
| DNS resolver | Local DNS REFUSED cloudfront.net | Use dig @8.8.8.8 explicitly |
| OGP rendering | The first entry of the API’s description array happened to be “none” / “unknown” | Filter empty values and join multiple values |
| CrowdSec | Frequent false bans against API-heavy users | WAF Managed Rules are a safer choice than behaviour-based filters |
Summary
The pattern of retrofitting CloudFront + WAF in front of an existing origin comes down to:
- Introduce a separate origin-only domain (runs in parallel, and makes later migrations easy).
- Effectively hide the origin using a secret header.
- Run WAF in COUNT mode for a week, then switch to block.
- Separate ALB and EC2 SGs from the start.
- Model sites as a for_each’d map in Terraform (easy to extend).
- Automate invalidation from CI on deploy.
Following this, you can raise both security and speed at the same time, with no downtime.
In particular, “separating the ALB and EC2 SGs” is a point to get right at the initial design stage. I am sharing this article so that others can avoid the same incident.