This article is co-authored with a generative AI. Facts have been cross-checked against official documentation where possible, but errors may remain. Please verify against primary sources before making any important decisions.

TL;DR

  • A cultural-archive site I run started seeing bot scraping
  • The main source started from AWS Singapore (ap-southeast-1) and pivoted to US, then CN after each round of mitigation
  • rate-limit didn’t catch it; I responded in stages — IP block → Geo block → JA3 fingerprint block → UA block
  • This post lays out a reproducible procedure and cost considerations against this kind of attack

Trigger: spotted through GA4 anomalies

I run several public cultural-archive sites. While taking stock of GA4 properties, I noticed unusual numbers on the main site.

The PV ≒ Sessions ≒ Users “stairs” had completely flattened. Normally a single visitor browses multiple pages, so the ratio should look like PV > Sessions > Users, which made the observed pattern hard to explain as human traffic.

Multi-axis profiling with a custom script

I wrote a script using the GA4 Data API to analyse the traffic across multiple axes. Key excerpts:

PV/session = around 1.0   (humans: 2-5; values near 1.0 are bot-like)
sessions/user = around 1.0 (humans: 1.3-2+; values near 1.0 imply no cookie persistence)
avg session seconds = single digits  (< 5s is bot-suspicious)
bounce rate = extremely high (high 90s%)

By country:

Singapore: dominant share  (over 90% of total)
Vietnam, China, US: small
Japan:    very few  ← this is where legitimate users sit

By browser / OS / device / referrer:

Chrome on Windows: nearly 100%
desktop:           nearly 100%
(direct)/(none):   nearly 100%   ← no referrer
time of day:       flat 24h (no day/night cycle)

Landing pages were skewed toward book detail paths, suggesting the book DB was being crawled exhaustively.

In sum:

  • Spoofing Chrome on Windows (most likely Puppeteer / Playwright headless)
  • No cookies retained, so each visit counts as a new user
  • 24-hour continuous crawling
  • Concentrated in Singapore → cloud-hosted origin

The behavior was consistent enough to conclude that this was automated, cloud-originated scraping rather than human traffic.

Initial check: the AWS-side state

I started by inspecting the CloudFront and WAF configuration:

aws cloudfront list-distributions \
  --query 'DistributionList.Items[].{Id:Id,Domain:DomainName,WebACL:WebACLId}' \
  --output json

A shared Web ACL is applied across multiple distributions, with Managed Rules and rate-limit running as a baseline defense. WAF logs (CloudWatch Logs) and CloudFront standard access logs (S3) are also already enabled, so the attack history can be traced back through logs:

Rule 1: AWSManagedRulesCommonRuleSet
Rule 2: AWSManagedRulesKnownBadInputsRuleSet
Rule 3: AWSManagedRulesAmazonIpReputationList
Rule 4: rate-limit-per-ip
DefaultAction: Allow

That said, this baseline configuration alone wasn’t catching the distributed scraping pattern, so I started analysing the attack profile from the logs.

Identifying the attack profile from WAF logs

A single WAF log entry already includes the JA3 fingerprint:

{
  "action": "ALLOW",
  "terminatingRuleId": "Default_Action",
  "rateBasedRuleList": [
    {"rateBasedRuleName":"rate-limit-per-ip","limitKey":"IP",
     "limitValue":"<attacker-ip>"}
  ],
  "httpRequest": {
    "clientIp": "<attacker-ip>",
    "country": "SG",
    "uri": "/_nuxt/builds/meta/...",
    "host": "site-a.example"
  },
  "ja3Fingerprint": "<ja3-hash>",
  "ja4Fingerprint": "<ja4-hash>"
}

The key points are that country: "SG" is attached by default, and that JA3/JA4 are recorded as standard fields.

Aggregation in CloudWatch Logs Insights:

fields httpRequest.country as country, action
| stats count() as n by country, action
| sort n desc

For the past 30 minutes, SG stood out by a wide margin and was uniformly ALLOW. Counting unique SG IPs gave several thousand IPs / 30 minutes. Each individual IP came in at only a few percent of the threshold, indicating extremely distributed scraping.

Identifying the SG attacker IPs

Cross-referencing the top IP ranges against AWS’ published IP ranges (https://ip-ranges.amazonaws.com/ip-ranges.json) showed they fell within AWS ap-southeast-1 EC2 ranges. The attacker appears to have stood up many EC2 instances in AWS Singapore for distributed scraping.

Step 1: rate-limit doesn’t catch it

The first thing to verify is whether the active rate-limit-per-ip is catching this traffic. Aggregating WAF logs showed each IP was sending only a few percent of the threshold or less, so the rate-limit didn’t trigger. Lowering the threshold aggressively is an option, but this risks false positives against legitimate SPARQL clients or corporate NAT users, so it isn’t realistic. I accepted the rate-limit as an inherent limit and looked for another approach.

Step 2: IP range block

I registered the AWS SG range in an IPset and added a Block rule:

aws wafv2 create-ip-set --name block-aws-sg-range --scope CLOUDFRONT \
  --ip-address-version IPV4 --addresses <aws-sg-cidr>

Effect after a few minutes:

  • Blocks started appearing
  • But SG ALLOWs were still flowing through in large volume

Re-analysing the top IPs showed the attacker had already pivoted to a neighbouring /16. Adding those to the IPset only led to yet another /16 surfacing. The attacker was sliding through consecutive /16 blocks, so I concluded that IP-based whack-a-mole couldn’t keep up.

Step 3: Geo block as root-cause stop

The site is a Japanese-language cultural archive, and legitimate users in Singapore are statistically very few. If side effects are small, a Geo block is the cleanest option.

Rule: geo-block-sg
Priority: 1
Action: Block
Statement:
  GeoMatchStatement:
    CountryCodes: [SG]

After propagation, SG traffic was largely blocked. There was no observable impact on legitimate JP users, and the whack-a-mole state subsided.

At the same time, the IPset-based block had fulfilled its purpose, and since it was now redundant with the Geo block, I removed it to simplify the configuration.

Analysing the attack profile

With the bleeding stopped, I sat down to analyse the attack profile properly. Top URI aggregation in CloudWatch Logs Insights, over the past 60 minutes:

fields httpRequest.country as country, httpRequest.uri as uri
| filter country = "SG"
| stats count() as n by uri
| sort n desc
| limit 20

The top SG URIs were:

  • /sparql (Linked Data SPARQL endpoint)
  • /snorql/ (SPARQL UI)
  • /app/prx/...sparql (proxy relaying to other SPARQL endpoints)
  • Systematic crawling across multilingual content

The target was concentrated on the public SPARQL endpoint and Linked Data resources.

  • HTTP method: only GET and HEAD (read-only)
  • JA3 fingerprint: a single specific hash dominates (= same automation tool)
  • Managed Rule (CommonRuleSet, KnownBadInputs) matches: 0

So this isn’t exploiting a vulnerability. It’s systematic scraping of the public SPARQL endpoint and Linked Data. From the AWS WAF perspective, it’s a grey zone — “high-volume access to legitimate public resources rather than an attack” — but it does cause real harm in terms of server load and bulk data exfiltration.

Pivot: US traffic analysis

After the response, while watching the logs, I noticed Allowed Requests from the US had risen to dozens of times the JP volume. That ratio is unnatural for Japanese-language content, so I broke it down:

  • US ALLOW is dozens of times JP volume
  • /sparql is still being hit heavily
  • The same JA3 appeared (exactly matching the SG attack)

It looked as if the same attack tool had pivoted from SG to US.

That said, the US case differs from SG in that legitimate bots are mixed in:

IP/RangeIdentity
40.77.x.xBingbot (Microsoft)
52.167.x.xMicrosoft Azure (Bingbot)
66.249.x.xGooglebot
meta-externalagent UAMeta crawler
Amazonbot UAAmazon crawler
(single IPs in PSINet/Cogent)abnormal access concentration
(WebMeUp/BLEXBot)same as above

The design needs to stop the attack without sweeping up legitimate bots.

Step 4: Individual IP + JA3 fingerprint block

Individual IP block (only the noticeably concentrated ones)

I registered just the few single IPs with conspicuously concentrated access into an IPset:

aws wafv2 create-ip-set --name block-aggressive-scrapers \
  --scope CLOUDFRONT --ip-address-version IPV4 \
  --addresses <ip1>/32 <ip2>/32 <ip3>/32

JA3 fingerprint block

JA3 is a fingerprint of the TLS Client Hello (an MD5 of TLS version / cipher suites / extensions / elliptic curves / EC point formats). As long as the attacker doesn’t change their TLS stack, the JA3 stays the same even when User-Agent or IP changes.

Block the JA3 that matched between the SG and US attacks:

{
  "Name": "block-attacker-ja3",
  "Priority": 6,
  "Statement": {
    "ByteMatchStatement": {
      "SearchString": "<ja3-hash>",
      "FieldToMatch": {
        "JA3Fingerprint": {"FallbackBehavior": "NO_MATCH"}
      },
      "TextTransformations": [{"Priority": 0, "Type": "NONE"}],
      "PositionalConstraint": "EXACTLY"
    }
  },
  "Action": {"Block": {}}
}

This is a relatively stable signature block that keeps working even as IPs change, as long as the TLS stack stays the same. The cost is just one rule’s monthly fee.

Step 5: UA block rule (sweeping out LLM crawlers)

Block AI-training crawlers and SEO scrapers in bulk. First, create a RegexPatternSet (case-insensitive, matching multiple UAs):

aws wafv2 create-regex-pattern-set --name bot-ua-patterns --scope CLOUDFRONT \
  --regular-expression-list \
    'RegexString=(?i)GPTBot' \
    'RegexString=(?i)CCBot' \
    'RegexString=(?i)ClaudeBot' \
    'RegexString=(?i)anthropic-ai' \
    'RegexString=(?i)Bytespider' \
    'RegexString=(?i)(PerplexityBot|OAI-SearchBot|ImagesiftBot|Diffbot|DataForSeoBot)' \
    'RegexString=(?i)(SemrushBot|AhrefsBot|MJ12bot|DotBot|BLEXBot)'

Then add a rule that blocks requests whose User-Agent header contains any of these patterns.

The point is that this is scoped to AI-training and SEO-analysis bots only. Googlebot / Bingbot / Applebot (Siri search) are excluded so SEO impact stays out of scope. Meta-ExternalAgent and Amazonbot have dual roles (FB OGP / Alexa), so they were intentionally excluded.

This swept out declared bots at minimum cost without bringing in Bot Control.

robots.txt placement (application layer)

In parallel with WAF, I placed a robots.txt on the SPARQL endpoint server (nginx).

# This is a SPARQL endpoint server, not intended for crawling.

# --- AI training crawlers: total denial ---
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
... (anthropic-ai, OAI-SearchBot, etc.)

# --- SEO scanners ---
User-agent: SemrushBot
Disallow: /
... (AhrefsBot, MJ12bot, DotBot, BLEXBot, DataForSeoBot)

# --- All other bots: avoid SPARQL endpoint and URI-deref paths ---
User-agent: *
Disallow: /sparql
Disallow: /snorql/
Disallow: /data/
Disallow: /entity/

robots.txt has zero enforcement, but:

  • Well-behaved crawlers read it and back off → never reach WAF (= no charge)
  • Crawlers that ignore robots.txt are stopped by the WAF UA block
  • Crawlers that spoof the UA are caught by JA3 / IPset / Geo block

I treat it as the upper layer of a three-tiered defense.

Recurrence detection via CloudWatch alarms

To run a “respond when it rings” pattern, I set up three alarms plus an SNS topic:

AlarmMetricThreshold (example)
waf-blocked-attack-sources-spikeBlockedRequests (sum across major attack-source countries)spike-detection level
waf-sg-allowed-spikeAllowedRequests (Geo-block target country)bypass-detection level
waf-non-jp-allowed-spikeAllowedRequests (sum of non-JP)pivot-detection level

Notifications go through an SNS topic to email.

Metric Math pitfall

I first tried to aggregate dynamically with SEARCH('{AWS/WAFV2,Rule,WebACL,Country} ...'), but CloudWatch alarms didn’t appear to support SEARCH expressions:

ValidationError: SEARCH is not supported on Metric Alarms.

As a workaround I enumerated target countries explicitly with Metric Math like FILL(m_sg,0) + FILL(m_cn,0) + FILL(m_vn,0) + FILL(m_hk,0). FILL(m_xx, 0) treats missing data points as zero, which keeps the alarm calculation correct even when traffic from a given country is near zero.

Alarm threshold and baseline

For notification-based ops, the threshold has to be meaningful. If the baseline already sits close to the threshold, new anomalies are hard to detect.

Mid-way through the US traffic investigation I hit a state where non-JP Allow was already at several tens of percent of the threshold as a baseline. Raising the threshold would cut false alarms but also hide the ongoing attack — a clear dilemma. I ended up bringing the baseline down with the IP + JA3 + UA blocks and keeping the threshold as is.

Revisiting the rate-limit threshold

Finally, I revisited the rate-limit-per-ip threshold:

  • The original threshold could false-trigger for API use (especially SPARQL clients) and corporate NAT
  • With main defense moved to Geo block / JA3 / IPset, it makes sense to relax rate-limit and treat it as a final safety net for runaway IPs

I raised the threshold by 5x, to a level that humans and normal API use don’t reach but runaway clients do. SPARQL power users and corporate NAT users are now also protected.

Re-tuning the CommonRuleSet override config

Together with that, I revisited the override configuration of the AWS Managed Rules CommonRuleSet. It’s an important defense layer that broadly inspects general web attacks (SQLi / XSS / LFI / RCE etc.), and it needs continuous tuning to fit the site’s specifics.

Override design

CommonRuleSet contains dozens of sub-rules, some of which are prone to false positives depending on content nature. I aggregated actual matches from the WAF logs and decided to fix only the FP-prone individual rules to Count.

Rule: AWSManagedRulesCommonRuleSet
OverrideAction: None  # ← group-level Block
Statement:
  ManagedRuleGroupStatement:
    VendorName: AWS
    Name: AWSManagedRulesCommonRuleSet
    RuleActionOverrides:
      - Name: SizeRestrictions_BODY        # POST body over 8KB
        ActionToUse: Count
      - Name: NoUserAgent_HEADER           # backend monitoring without UA, etc.
        ActionToUse: Count
      - Name: CrossSiteScripting_BODY      # legitimate <script> in content
        ActionToUse: Count
      - Name: GenericLFI_QueryArguments    # legitimate paths containing '..'
        ActionToUse: Count
      - Name: EC2MetaDataSSRF_BODY         # API handling IP-like strings
        ActionToUse: Count

RuleActionOverrides lets you keep the group on Block while pinning specific sub-rules to Count. The result:

  • UserAgent_BadBots_HEADER (declared SEO scrapers), which had many matches in WAF logs → Block (double defense alongside the UA block)
  • RestrictedExtensions_URIPATH (probes for .env.bak etc.) → Block
  • Other SQL / XSS / LFI / RCE sub-rules → Block
  • 5 high-FP-risk rules → Count (continuous observation)

From here on, the cycle is to keep watching nonTerminatingMatchingRules for the rules pinned to Count and promote them to Block individually as evidence allows.

The next day: pivot to CN

Just as I was thinking the IP + JA3 + UA blocks, the rate-limit revision, and the CommonRuleSet re-tuning had wrapped things up — the next day, while watching the logs, I noticed CN traffic spiking. The total volume was about 3x the SG/US peaks.

CN attack profile

It looked qualitatively different from before:

ObservationSG/US eraCN era
Target/sparql + multilingual contentNuxt frontend at large + /sparql
Top URIs/sparql, /snorql/, /en/*/favicon.ico, /_nuxt/*.js (all chunks), /sparql
IP distributionConcentrated on consecutive /16s (AWS SG range)No concentration even at /16 level; broadly distributed
JA3Single dominant JA3No dominant JA3 (many fingerprints)
Estimated behaviorLightweight HTTP client (curl/Python)Headless browser-driven (real Chrome via Puppeteer/Playwright)

The big differences are that all the Nuxt JS chunks are being downloaded on each request, and that the JA3s are diverse. This can be read as:

  • Headless browser-driven = TLS stack is browser-native
  • Many browser instances spun up = mimics real Chrome behavior
  • = JA3 block is inefficient against this

Why Geo block CN is the right answer

The candidates and my evaluation:

OptionEvaluation
Geo block CN◎ Immediate stop; side effect only on legitimate CN users
/sparql-only CN block○ A middle path, but with the Nuxt frontend also under attack the limited scope buys little
Bot Control○ Could handle JA3 diversity, but cost is high
IP / JA3-based additions✕ Not realistic given the breadth of distribution

The site is a database of Japanese books and cultural materials, and CN-resident legitimate users are statistically very few. By the same logic as the SG case, I judged Geo block CN to be the best fit.

Implementation

Add geo-block-cn to the WebACL at Priority 2 (keeping the lineup with SG):

Rule: geo-block-cn
Priority: 2
Action: Block
Statement:
  GeoMatchStatement:
    CountryCodes: [CN]

After propagation, CN traffic was largely blocked (BLOCK over 99% / ALLOW near zero). Total traffic dropped by more than 6x and stopped much like the SG case.

Observations from the pivot response

  • The same playbook gets repeated from a different country (SG → US → CN pivot)
  • JA3 fingerprint block is effective when the attack tool is fixed; less effective against headless real browsers
  • Geo block has high generality — it can blanket-block regardless of tool differences
  • The “Geo block when side effects are small” threshold judgment landed on the same conclusion in both SG and CN cases

Final configuration

Web ACL shared-cloudfront-acl

PriRuleAction
1geo-block-sgBlock
2geo-block-cnBlock
5block-aggressive-scrapers (individual IPs)Block
6block-attacker-ja3Block
7block-bot-uas (RegexPatternSet)Block
10AWSManagedRulesCommonRuleSetBlock (some FP-prone sub-rules pinned to Count via Override)
20AWSManagedRulesKnownBadInputsRuleSetBlock
30AWSManagedRulesAmazonIpReputationListBlock
100rate-limit-per-ipBlock

Monthly cost

AWS WAFv2 pricing (public):

  • Web ACL: $5.00 / month (base, independent of rule count)
  • Rule / rule group: $1.00 / month (per rule)
  • Request processing: $0.60 / 1M requests

On top of that, WAF log ingestion ($0.50/GB) and CloudWatch alarms / SNS are added.

The base of this configuration is in the low-double-digit dollars per month, and request processing scales with volume. I considered Bot Control but chose to defer: Common Bot Control ($10/month + $1/1M reqs) is mostly header/UA-based, with limited coverage of JA3 diversity, and stepping into JA3 diversity requires Targeted Bot Control ($10/1M reqs) which is an order of magnitude more expensive. Currently declared bots are caught by UA and unknown bots by JA3, so the trade-off didn’t favor Bot Control. Whether it becomes necessary in the medium-to-long term is something I’ll keep watching.

Lessons

1. GA4 is the first window into “unnaturalness”

Looking at the WAF dashboard alone makes it easy to feel that “this is just how it is.” Periodically checking GA4’s PV / Session / User ratios alone catches bot contamination in many cases. PV / Session ≒ 1.0 was the most recognizable signal here.

2. Log operations are the key to attack analysis

Run CloudFront standard access logs (S3) and WAF logs (CloudWatch Logs) at a granularity and retention that lets CloudWatch Logs Insights aggregate the past 60 minutes / 24 hours instantly. Without multi-axis aggregation across country / JA3 / URI / IP, identifying the attack and detecting a pivot tend to lag.

3. IP-range blocks are weak against distributed attacks

It worked momentarily here because the attack was concentrated on consecutive /16s, but you can’t keep up with attacks that go through residential proxies. IP-based approaches are best used:

  • For emergency stop-the-bleeding (immediacy)
  • With permanent defense relegated to Geo block / JA3 / UA / Bot Control

That separation worked well in practice. Manually updating an IP list every week would have been hard to sustain.

4. JA3 fingerprint is a more stable signature than IP

JA3 (and JA4) is a fingerprint of the TLS Client Hello. As long as the attacker doesn’t change their tooling, the JA3 stays the same regardless of IP/UA. It works as a relatively stable block signal, and the fact that WAF records JA3/JA4 by default is quietly convenient.

5. Geo block is justified by “small side effects”

Geo block is powerful but its side effect (collateral damage to legitimate users) is also large, so it has to be judged in context against the site’s nature and geographic user distribution. In this case, non-Japan access against Japanese-language content was almost entirely bot traffic, so the side effects were within an acceptable range.

6. Alarm thresholds get their meaning from the baseline

“Respond when it rings” only works if the baseline sits well below the threshold. If attack traffic has already blended into the baseline, it’s more practical to first lower the baseline with blocks, and then set a threshold. Going the other way (raising the threshold) tends to increase the risk of missed detection.

7. SPARQL endpoints aren’t designed with perimeter defense in mind

A public SPARQL endpoint is built for export-style use and isn’t designed to be defended behind a WAF. Based on this experience, when standing up new ones the following would be a reasonable minimum set:

  • Application-layer rate-limit on /sparql (e.g. nginx limit_req_zone)
  • Server-side limits on query complexity / time / result-size
  • A separate, authenticated bulk-download channel
  • robots.txt + WAF UA block configured from day one

Next steps

  • Additional alarms for AllowedRequests spikes from non-JP regions (DE / FR / IN, etc.)
  • Attaching the Web ACL to remaining distributions where WAF isn’t applied yet

Attack and defense have an arms-race quality to them, but with the cycle of observation (logs) → hypothesis (aggregation) → verification (block) → monitoring (alarms) running as a framework, subsequent waves can be caught and addressed earlier. Currently the same framework keeps the trailing traffic under continuous observation.