How to Check If Bots Can Crawl Your Website
The most reliable way to test bot accessibility is to simulate requests using each crawler's real User-Agent string. This guide walks you through checking all 5 blocking layers — robots.txt rules, HTTP status codes, X-Robots-Tag headers, meta robots tags, and WAF challenge detection — using curl and free online tools.
Websites accidentally block search engine crawlers more often than most developers realize. A single misconfigured Cloudflare WAF rule can prevent Googlebot, BingBot, and every AI crawler from indexing your site. The result: zero organic traffic, no AI citations, and broken social media previews. This guide shows you exactly how bot blocking works, what each layer checks, and how to fix every common issue.
Why Bot Accessibility Matters for SEO and AI Visibility
Search engines can only rank pages they can access. If Googlebot receives a 403 Forbidden or a Cloudflare JavaScript challenge instead of your page content, that URL will never appear in search results.
- Search traffic depends on crawlability — Google, Bing, Yandex, and other search engines index only pages their bots can successfully fetch
- AI citations require bot access — GPTBot, ClaudeBot, and PerplexityBot must crawl your content before AI search engines can cite it
- Social sharing needs bot access — Twitterbot, Facebookbot, and LinkedInBot fetch Open Graph meta tags to generate link preview cards
- Silent failures go undetected — A site can be blocked for months without any visible error or warning from search engines
- Revenue loss is immediate — Zero crawlability means zero organic traffic, zero AI mentions, and zero social previews
According to Bing's Webmaster Guidelines (2026), content that cannot be reliably rendered may not be indexed or selected for grounding results. This applies equally to Bing Search and Microsoft Copilot.
The 5 Blocking Layers You Need to Check
When diagnosing bot access issues, test each of these 5 independent layers. A bot is effectively "blocked" if ANY layer blocks it.
Layer 1: robots.txt Rules
The robots.txt file is the first thing a crawler checks before accessing any URL. Fetch yours and look for bot-specific rules:
curl https://your-site.com/robots.txt
A common mistake: allowing all bots with User-agent: * / Allow: / while simultaneously blocking specific bots with their own directives:
User-agent: * Allow: / User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: /
This configuration allows search engines but blocks AI crawlers. Whether that's intentional depends on your content strategy.
Layer 2: HTTP Response Codes
Send an actual HTTP request using each bot's real User-Agent header string. This reveals whether your server, CDN, or WAF returns different responses to different bots:
curl -A "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" \ -I https://your-site.com # Test as BingBot curl -A "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" \ -I https://your-site.com # Test as GPTBot curl -A "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" \ -I https://your-site.com
| Status Code | Meaning | Impact |
|---|---|---|
| 200 | OK | Bot can access the page |
| 301/302 | Redirect | Followed automatically, usually fine |
| 403 | Forbidden | Bot is blocked by WAF or firewall |
| 405 | Method Not Allowed | Server rejects the request method |
| 429 | Too Many Requests | Bot is being rate-limited |
| 503 | Service Unavailable | May indicate a Cloudflare JS challenge |
Layer 3: X-Robots-Tag HTTP Headers
Some servers send indexing directives via HTTP response headers instead of HTML meta tags. Check for noindex and nofollow in the X-Robots-Tag header:
curl -I https://your-site.com | grep -i "x-robots-tag"
Example header that blocks indexing: X-Robots-Tag: noindex, nofollow
This is particularly dangerous because it's invisible in the HTML source. You'd only find it by inspecting HTTP response headers.
Layer 4: Meta Robots Tags
Check the HTML for <meta name="robots" content="..."> tags. A noindex directive here prevents the page from appearing in search results even if the bot can crawl it:
curl -s https://your-site.com | grep -i "noindex"
Common mistake: setting noindex during development and forgetting to remove it before going to production.
Layer 5: WAF Challenge Detection
Cloudflare and other WAFs can serve JavaScript challenges, CAPTCHAs, and interstitial pages that bots cannot solve. Look for these patterns in the response body:
curl -s -A "Mozilla/5.0 (compatible; Googlebot/2.1)" https://your-site.com | grep -i "challenge\|captcha\|just a moment"
Cloudflare challenges include challenge-platform markers and "Just a moment" challenge pages.
Step-by-Step: Running a Full Bot Crawl Audit
Step 1: Prepare Your Bot User-Agent List
Here are the most important User-Agent strings to test:
# Search Engines GOOGLEBOT="Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" BINGBOT="Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)" YANDEXBOT="Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)" # AI Crawlers GPTBOT="Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)" CLAUDEBOT="ClaudeBot/1.0; +https://www.anthropic.com/claude-bot" PERPLEXITYBOT="PerplexityBot/1.0" # Social Media TWITTERBOT="Twitterbot/1.0" FACEBOOKBOT="facebookexternalhit/1.1"
Step 2: Test Each Bot Against Your URL
Run a HEAD request (-I) for each bot and note the status code:
URL="https://your-site.com"
for BOT in "$GOOGLEBOT" "$BINGBOT" "$GPTBOT" "$CLAUDEBOT" "$TWITTERBOT"; do
echo "--- $BOT ---"
curl -s -o /dev/null -w "%{http_code}" -A "$BOT" -I "$URL"
echo ""
doneStep 3: Check for X-Robots-Tag and Meta noindex
For any bot that returns HTTP 200, verify the response doesn't contain noindex directives:
# Check headers curl -s -A "$GOOGLEBOT" -I "$URL" | grep -i "x-robots" # Check HTML meta tags curl -s -A "$GOOGLEBOT" "$URL" | grep -i "noindex"
Step 4: Verify robots.txt
curl -s "$URL/robots.txt"
Look for Disallow: / under any bot-specific User-agent sections.
Step 5: Review Search Console and Webmaster Tools
- Google Search Console: Coverage > Excluded for "Blocked by robots.txt" or "Noindex"
- Bing Webmaster Tools: URL Inspection for "Discovered but not crawled" or "Blocked"
All Important Bots to Test
Search Engine Crawlers
| Bot | Operator | Purpose |
|---|---|---|
| Googlebot | Main Google Search indexing | |
| BingBot | Microsoft | Bing Search + Copilot indexing |
| YandexBot | Yandex | Russian search engine |
| Baiduspider | Baidu | Chinese search engine |
| DuckDuckBot | DuckDuckGo | Privacy-focused search |
| Applebot | Apple | Siri and Spotlight suggestions |
| SeznamBot | Seznam | Czech search engine |
| Yeti | Naver | Korean search engine |
AI Crawlers
| Bot | Operator | Purpose |
|---|---|---|
| GPTBot | OpenAI | AI model training |
| ChatGPT-User | OpenAI | Real-time web browsing in ChatGPT |
| ClaudeBot | Anthropic | Claude AI content access |
| PerplexityBot | Perplexity | AI-powered search answers |
| Google-Extended | Gemini AI training | |
| CCBot | Common Crawl | Open web crawl dataset for AI |
| cohere-ai | Cohere | Enterprise AI models |
Social Media Bots
| Bot | Operator | Purpose |
|---|---|---|
| Twitterbot | X (Twitter) | Link preview cards |
| Facebookbot | Meta | Link preview cards |
| LinkedInBot | Link preview cards |
How to Fix Common Bot Blocking Issues
Fix: Blocked by robots.txt
Edit your robots.txt file to add explicit Allow rules for the blocked bot:
User-agent: GPTBot Allow: / User-agent: ClaudeBot Allow: /
Use the FindUtils Robots.txt Generator to build a properly formatted robots.txt file.
Fix: HTTP 403 (WAF/Firewall Blocking)
If your site uses Cloudflare:
- Go to Security > Bots and check if Bot Fight Mode is enabled
- If enabled, create a WAF custom rule: expression
(cf.client.bot)with action Skip to allow verified bots through - Check Security > WAF for any custom rules that block or challenge crawler user agents
Fix: Cloudflare JavaScript Challenge
Bots cannot solve JavaScript challenges. Lower your Security Level from "High" or "I'm Under Attack" to "Medium" or create a firewall rule that skips challenges for known bot user agents.
Fix: Meta noindex Tag
Search your HTML templates for <meta name="robots" content="noindex"> and remove or conditionally render it. This tag is sometimes added during development and forgotten in production.
Fix: X-Robots-Tag Header
Check your server configuration (nginx, Apache) or CDN settings for X-Robots-Tag headers. Use the FindUtils Security Headers Analyzer to inspect all HTTP response headers, or run curl -I against your URLs.
Manual Testing vs Automated Tools
| Feature | Manual curl Testing | Automated Crawl Checkers |
|---|---|---|
| Price | Free | Free to paid |
| Bots tested | One at a time | Multiple simultaneously |
| robots.txt parsing | Manual interpretation | Automatic per-bot analysis |
| WAF detection | Must read response body | Automatic challenge detection |
| Meta robots check | Must grep HTML source | Automatic HTML parsing |
| Report format | Raw terminal output | Structured report |
| Time required | 30+ minutes | 2-5 minutes |
| Technical skill | Requires CLI knowledge | Varies by tool |
Real-World Scenarios
Scenario 1: Cloudflare WAF Blocking All Crawlers
A developer launches a new site on Cloudflare and enables a custom WAF rule that challenges all non-browser traffic. Three months later, they notice zero organic search traffic. Testing with curl -A "Googlebot" reveals all bots receive 403 or Cloudflare challenges. Fix: add a WAF exception for verified bots using (cf.client.bot).
Scenario 2: Accidentally Blocking AI Crawlers
A company adds User-agent: GPTBot / Disallow: / to their robots.txt after reading about AI copyright concerns. They don't realize this also prevents their content from appearing in ChatGPT web search results and Perplexity answers. Testing reveals GPTBot, ClaudeBot, and PerplexityBot all blocked by robots.txt while search engines pass fine.
Scenario 3: Meta noindex Left from Development
A staging site configuration includes <meta name="robots" content="noindex, nofollow"> which gets deployed to production. Google Search Console shows "noindex" warnings, but the developer doesn't check for 6 months. A quick curl -s | grep noindex catches this instantly.
Related Tools
- Robots.txt Generator — Create properly formatted robots.txt files
- Security Headers Analyzer — Inspect HTTP response headers including X-Robots-Tag
- DNS Security Scanner — Check DNS configuration and security records
- SSL Certificate Checker — Verify SSL/TLS certificate validity
FAQ
Q1: What is a bot crawl checker?
A: A bot crawl checker tests whether well-known web crawlers (like Googlebot, BingBot, GPTBot) can access your website. It simulates requests using each bot's real User-Agent header and checks multiple blocking layers including robots.txt, HTTP status codes, meta tags, response headers, and firewall challenges. You can do this manually with curl or use an automated tool.
Q2: Why is my site blocked by Googlebot? A: Common causes include a restrictive robots.txt file, Cloudflare Bot Fight Mode or aggressive WAF rules, a meta robots tag set to "noindex", X-Robots-Tag HTTP headers, or the server returning 403/429/503 errors to bot user agents.
Q3: Should I block AI crawlers like GPTBot? A: It depends on your goals. Allowing AI crawlers means your content can appear in AI-powered search experiences (ChatGPT, Claude, Perplexity), driving traffic and citations. Many sites selectively allow AI search bots (ChatGPT-User, PerplexityBot) while blocking training-only bots (GPTBot, CCBot).
Q4: How often should I check bot accessibility? A: Check after any infrastructure change: updating CDN settings, modifying robots.txt, deploying new server configurations, or changing WAF rules. Also check quarterly as a routine audit. Bot blocking can happen silently without any visible symptoms.
Q5: Can Cloudflare Bot Fight Mode block legitimate crawlers?
A: Yes. Cloudflare Bot Fight Mode challenges traffic it identifies as automated, which can include legitimate search engine crawlers. If enabled, create a WAF custom rule with expression (cf.client.bot) and action Skip to allow Cloudflare's verified bot list through without challenges.
Next Steps
- Generate a robots.txt — Use the Robots.txt Generator to create properly formatted directives
- Audit your security headers — Run the Security Headers Analyzer to check for X-Robots-Tag and other header issues
- Check your DNS security — Use the DNS Security Scanner to verify your DNS configuration
- Verify your SSL certificate — Run the SSL Certificate Checker to ensure HTTPS is properly configured