Website Crawlability AI Audit

Producing high-quality, well-structured content is only useful for GEO if AI crawlers can actually reach and render that content. A single misplaced robots.txt directive, a JavaScript-heavy rendering stack, or an absent llms.txt file can silently exclude your entire site from the training and retrieval pipelines of every major AI engine. The tool above audits a domain you provide and checks whether the main AI crawlers, including GPTBot, OAI-SearchBot, PerplexityBot, Google-Extended, and ClaudeBot, can access your pages and process them correctly.

What the audit checks

The tool above evaluates four main categories of crawlability:

Robots.txt directives: the audit reads your robots.txt file and identifies which AI crawler user-agents are explicitly blocked, accidentally blocked by wildcard rules, or missing from any allow list. It also checks whether the file itself is accessible, properly formatted, and does not exceed the 500 KB limit that some crawlers enforce.
Meta robots and X-Robots-Tag headers: a robots.txt that allows crawling is insufficient if individual pages carry a noindex or noarchive meta tag, or if server response headers instruct bots to skip the page. The audit inspects both sources.
JavaScript rendering dependency: pages that deliver critical content exclusively through JavaScript are invisible to crawlers that do not execute scripts. The audit detects whether the main content on your pages is available in the raw HTML or only after client-side rendering.
Sitemaps and llms.txt: a well-maintained sitemap.xml helps AI crawlers discover pages efficiently. The newer llms.txt standard, modelled on robots.txt but designed specifically for LLMs, lets you declare which sections of your site are suitable for AI consumption and summarise your content in a machine-readable way. The audit checks whether both files exist and are properly formatted.

How to interpret and act on the results

The tool above flags each issue with a severity level. Here is how to prioritise your remediation:

Blocked AI crawlers in robots.txt: remove or narrow the directive that blocks the relevant user-agent. If you intentionally block all AI crawlers for licensing reasons, confirm this is a deliberate policy decision rather than an accidental wildcard block inherited from a CMS template.
Noindex on key pages: review each flagged page. If a page contains valuable content you want cited, remove the noindex directive. If the page is intentionally excluded, verify that the block was indeed intentional and not a staging environment directive left in place after launch.
JavaScript-only content: implement server-side rendering (SSR) or static site generation (SSG) for content you want AI crawlers to index. At minimum, ensure that page titles, headings, and the first 200 words of body text are available in the server-rendered HTML before JavaScript executes.
Missing or outdated sitemap: generate a fresh sitemap.xml that includes all canonical URLs, excludes redirected or noindex pages, and is referenced in robots.txt. Update it automatically whenever new content is published.
No llms.txt file: create an llms.txt file at the root of your domain. At minimum, include a brief description of your site, the primary topics covered, and links to your most important pages. This is a low-effort signal that can meaningfully improve how AI crawlers categorise your site.

A benchmark on AI crawl access

AI Overviews now appear on approximately 31% of Google queries, and position-1 pages behind an AI Overview lose up to 58% of expected clicks (Ahrefs, 2025). The pages that capture that displaced traffic are those cited inside the AI answer. Crawlability is the prerequisite: if an AI bot cannot access your content, no amount of on-page optimisation will earn you a citation. Fixing crawl barriers is therefore the highest-leverage starting point for any GEO strategy.

For ongoing monitoring of your AI crawlability and citation performance across all major AI engines, Sorank tracks your GEO visibility and alerts you when access changes.

שאלות נפוצות

Which AI crawler user-agents should I allow in robots.txt?

The main AI crawler user-agents to be aware of are: GPTBot (OpenAI training), OAI-SearchBot (SearchGPT retrieval), PerplexityBot (Perplexity), Google-Extended (Google AI training and Gemini), ClaudeBot (Anthropic), and Meta-ExternalAgent (Meta AI). If you have no specific licensing reason to block them, allowing all of them maximises your potential AI visibility.

What is llms.txt and is it required?

llms.txt is an emerging convention, similar to robots.txt, that provides a plain-text summary of a site's content and structure specifically for LLMs. It is not a required standard, but it is a low-cost signal that helps AI systems understand your site's purpose and identify your most important pages. Creating one is recommended for any site serious about GEO.

Does blocking Googlebot also block Google's AI crawlers?

No. Google-Extended, which is used for AI training and Gemini, is a separate user-agent from Googlebot. You can block Google-Extended without affecting your standard Google Search indexing, and vice versa. Always specify user-agents explicitly in robots.txt rather than relying on wildcard rules that may unintentionally catch multiple crawlers.