Robots.txt: control crawling by Googlebot, block directories, crawl-rate directives, and SEO best practices.

Robots.txt is a simple but powerful file that controls how search engines crawl your website. It sits at `https://example.com/robots.txt` and contains rules specifying which pages and directories Google, Bing, and other bots can crawl. Robots.txt helps you manage crawl budget (the number of pages Google crawls daily), prevents wasting crawl on unimportant pages, and keeps private content out of search results.
Most websites have a robots.txt file, but many are misconfigured. A misconfigured robots.txt can accidentally block important pages, wasting ranking potential. A well-configured robots.txt improves crawl efficiency and protects your site's privacy. This guide covers robots.txt syntax, best practices, and real-world examples.
Robots.txt is a standardized text file that communicates crawl instructions to search engine bots. When a bot first visits your site, it requests `/robots.txt` before crawling anything else. The robot reads the rules and follows them (assuming the bot is well-behaved).
Google's robots.txt documentation is the authoritative reference for the standard. The robots.txt format was created in 1994 and has been widely adopted. All major search engines (Google, Bing, Baidu) respect robots.txt.
Important: Robots.txt is a guideline, not a firewall. Well-behaved bots (Google, Bing) respect robots.txt rules. Malicious bots and scrapers ignore robots.txt. Use robots.txt to manage search engine crawling, not to block hackers or scrapers. For security, use server-level tools.
Robots.txt uses simple text syntax. Each rule has two parts: a User-agent (which bot the rule applies to) and Disallow paths (which pages to block).
Basic example:
User-agent: *
Disallow: /admin/
Disallow: /staging/
Sitemap: https://example.com/sitemap.xml
This tells all bots (`*` means all) to not crawl the `/admin/` and `/staging/` directories. The Sitemap line tells bots where your sitemap is located.
User-agent: \* means all bots. You can also specify individual bots:
`User-agent: Googlebot` applies only to Google's bot. `User-agent: Bingbot` applies only to Bing's bot. You can have multiple User-agent sections with different rules.
Disallow: /path/ tells bots not to crawl that path. Disallow: / blocks the entire site. Disallow: (empty) allows everything. You can list multiple Disallow rules per User-agent.
Allow: /path/ allows crawling a specific path even if a parent directory is disallowed. Example: Disallow: /temp/ but Allow: /temp/important/ allows crawling only the /important/ subdirectory.
Pattern 1: Block admin pages
User-agent: *
Disallow: /admin/
Disallow: /user/
Disallow: /account/
This blocks administrative, user, and account pages from crawling. These pages are typically not meant for search engines.
Pattern 2: Block staging environment
User-agent: *
Disallow: /staging/
Disallow: /test/
Prevents bots from crawling test or staging versions of your site.
Pattern 3: Block specific file types
User-agent: *
Disallow: /*.pdf
Disallow: /*.zip
Prevents bots from crawling PDFs and ZIP files. This is useful if you have many PDFs that should not be indexed.
Pattern 4: Slow bots that hammer your server
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Crawl-delay: 10
Completely blocks Ahrefs bot (if you do not want your site crawled by SEO tools). Slows down Semrush bot by adding a 10-second delay between requests. Crawl-delay is useful for aggressive bots that overload your server.
Pattern 5: Allow all (default)
User-agent: *
Disallow:
This is the default. Empty Disallow means allow all. You can also omit robots.txt entirely if you want all content crawlable.
Robots.txt blocks crawling. Meta robots noindex blocks indexation. These serve different purposes.
Use robots.txt when: You want to save crawl budget. You have duplicate content that should not be crawled. You have admin pages that should not be touched by bots. You want to slow down aggressive bots.
Use meta robots noindex when: You want a page to be crawled but not indexed (to see errors and issues). You want to prevent indexation but still allow internal links and crawling. You want to eventually remove a page from search but keep it live.
Example: Paginated pages like `/products?page=2` can be blocked in robots.txt to save crawl budget (since Google typically consolidates pagination). But you might want them crawled to identify canonical relationships. In that case, use canonicals instead of robots.txt.
Crawl budget is the number of URLs Google crawls daily on your site. Large sites with millions of pages cannot have all pages crawled daily. Google allocates crawl budget based on your site's authority and change frequency. Crawl budget is finite. Wasting it on unimportant pages means important pages are crawled less frequently.
Optimize crawl budget by blocking pages that should not be crawled: duplicate content, paginated search results, user account pages, testing pages. Every page you block gives Google more budget to crawl your important content.
Common crawl budget wasters: infinite pagination (product filters create unlimited URLs), duplicate content with different parameters, session IDs appended to every URL, calendar/event pages generating endless URLs. Use robots.txt to block these patterns.
Google Search Console shows your site's crawl statistics. Monitor crawl requests daily. If Google crawls the same pages repeatedly without discovering new content, review your robots.txt and blocking strategy.
Include your sitemap URL in robots.txt. Add `Sitemap: https://example.com/sitemap.xml` at the end of your robots.txt file. This tells Google where to find your XML sitemap. You can list multiple sitemaps if you have multiple files.
Example:
User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
Listing sitemaps in robots.txt is optional (you can submit sitemaps via Google Search Console), but it is a best practice.
Google Search Console has a robots.txt tester. Go to Settings > Crawling > Test robots.txt. Enter a URL and see whether robots.txt blocks it. This is invaluable for validating your rules before deploying.
Always test before deploying robots.txt changes. A single mistake (like `Disallow: /` blocking your entire site) can tank your rankings. Use the tester to verify that:
Important pages are not blocked. Admin pages are blocked. Duplicate content patterns are blocked. No critical paths are accidentally disallowed.
After deploying robots.txt, monitor Google Search Console's Crawl report for changes. If crawl rate drops unexpectedly, you may have accidentally blocked important content.
Mistake 1: Blocking CSS and JavaScript. If you block `/css/` or `/js/` in robots.txt, Google cannot crawl your CSS and JavaScript. Without CSS, Google cannot render your pages properly. Do not block CSS or JavaScript.
Mistake 2: Blocking important content. Always test before deploying. A typo like `Disallow: /p` instead of `Disallow: /staging/` can block `/products/` unintentionally.
Mistake 3: Using robots.txt for security. Do not rely on robots.txt to protect sensitive data. Security-sensitive pages should require authentication, not just robots.txt. Robots.txt is public and easily circumvented.
Mistake 4: Inconsistent robots.txt across domains. If you have multiple domains, maintain consistent robots.txt policies. Accidentally different rules can cause crawl efficiency problems.
Mistake 5: Blocking the sitemap itself. Never block `/sitemap.xml` in robots.txt. Google needs to crawl the sitemap to discover pages.
Crawl-delay and Request-rate: These directives slow down bots. `Crawl-delay: 10` adds 10 seconds between requests. `Request-rate: 1/10` allows 1 request per 10 seconds. Use these for bots that overload your server. Google recommends using Search Console settings instead of these directives.
Allow directive: Allows crawling a specific path even if a parent path is disallowed. Useful for carving out exceptions. Example: `Disallow: /temp/` but `Allow: /temp/keep/` allows only the keep subdirectory.
Google's robots.txt specification documents all supported directives. Most features are rarely needed. Stick to basic User-agent, Disallow, and Sitemap for most sites.
User-agent specific rules allow different crawl rules for different bots. You can specify rules for Googlebot, Bingbot, and other user-agents separately. This is useful if you want Google to crawl your entire site but restrict Bing from accessing certain sections. Specify user-agent at the start of each rule block:
`User-agent: Googlebot` applies rules only to Google's crawler. `User-agent: *` applies rules to all bots. Rules apply to the specific user-agent until the next user-agent directive. You can create multiple rule blocks for different bots.
Crawl-delay and request-rate directives tell bots how often to crawl. `Crawl-delay: 5` tells the bot to wait 5 seconds between requests. This reduces server load. `Request-rate: 1/10` tells the bot to make at most 1 request per 10 seconds. Google's robots.txt documentation details all supported directives.
Sitemap location directives tell bots where to find your sitemap. `Sitemap: https://example.com/sitemap.xml` points bots to your XML sitemap. You can specify multiple sitemaps. This is recommended as it helps bots discover all your pages efficiently.
Clean-param directive removes URL parameters before crawling. `Clean-param: utm_source&utm_medium https://example.com` tells Google to ignore UTM parameters on example.com. This prevents Google from treating tracked links as duplicate content. This is less commonly used now since Google handles most tracking parameters automatically.
Test your robots.txt file in Google Search Console's robots.txt tester. The tool shows which URLs would be blocked by your robots.txt for Googlebot. This prevents accidental blocking of important pages.
Robots.txt is a simple but critical file for managing search engine crawling and protecting your site's privacy. A well-configured robots.txt blocks unimportant pages, saves crawl budget, and prevents duplicate content from being crawled multiple times. Misconfigured robots.txt can accidentally block important content and tank your rankings.
Always test robots.txt changes before deploying. Use Google Search Console's tester to validate rules. Monitor your crawl statistics monthly. Block unimportant content and manage crawl budget effectively. Use our GEO SEO audit tool to audit your robots.txt configuration and identify potential issues with crawlability and indexation across your entire site.
No. Blocking a page in robots.txt prevents Google from crawling it but does not help SEO. If you do not want a page indexed, use a noindex tag instead. Use robots.txt to save crawl budget.
No. Never block CSS or JavaScript in robots.txt. Google needs access to those resources to fully understand your page. Blocking them prevents Google from rendering your content correctly.
Generally, you do not need to set a crawl rate unless Google is making too many requests. If Google is exhausting your bandwidth, use the Crawl-delay directive. Most sites can handle Google's default crawl rate.