Preferences

Privacy is important to us, so you have the option of disabling certain types of storage that may not be necessary for the basic functioning of the website. Blocking categories may impact your experience on the website. More information

Accept all cookies

Robots.txt: Control How Googlebot Crawls

Robots.txt: control crawling by Googlebot, block directories, crawl-rate directives, and SEO best practices.

Man with dark hair and beard wearing a light brown shirt speaks in front of a microphone on a podcast or recording setup.Portrait of a man with short dark hair wearing a white shirt and dark jacket, looking directly at the camera with a neutral expression.Man with short dark hair, beard, and clear glasses wearing a black t-shirt with a white circular logo, standing in front of a stone wall.Celio fabianoSmiling young woman with long brown hair wearing a red top and necklace, outdoors in a tree-filled background.photo de profil du client Xavier Breull
+ 9'000 subscribers
A robots.txt file open in a text editor showing user-agent, disallow, and sitemap directives.
Upload UI element
Thibault Besson-Magdelain fondateur de Sorank

About Author

Thibault Besson-Magdelain

Founder of Sorank, 5+ years of experience in SEO, GEO enthusiast.
Share on

Summary: Robots.txt is a text file in your site root that tells search engines which parts of your site to crawl and which to skip, helping manage crawl budget and prevent indexing of private pages.

Robots.txt is a simple but powerful file that controls how search engines crawl your website. It sits at `https://example.com/robots.txt` and contains rules specifying which pages and directories Google, Bing, and other bots can crawl. Robots.txt helps you manage crawl budget (the number of pages Google crawls daily), prevents wasting crawl on unimportant pages, and keeps private content out of search results.

Most websites have a robots.txt file, but many are misconfigured. A misconfigured robots.txt can accidentally block important pages, wasting ranking potential. A well-configured robots.txt improves crawl efficiency and protects your site's privacy. This guide covers robots.txt syntax, best practices, and real-world examples.

What Is Robots.txt and How Search Engines Use It

Robots.txt is a standardized text file that communicates crawl instructions to search engine bots. When a bot first visits your site, it requests `/robots.txt` before crawling anything else. The robot reads the rules and follows them (assuming the bot is well-behaved).

Google's robots.txt documentation is the authoritative reference for the standard. The robots.txt format was created in 1994 and has been widely adopted. All major search engines (Google, Bing, Baidu) respect robots.txt.

Important: Robots.txt is a guideline, not a firewall. Well-behaved bots (Google, Bing) respect robots.txt rules. Malicious bots and scrapers ignore robots.txt. Use robots.txt to manage search engine crawling, not to block hackers or scrapers. For security, use server-level tools.

Robots.txt Syntax and Basic Rules

Robots.txt uses simple text syntax. Each rule has two parts: a User-agent (which bot the rule applies to) and Disallow paths (which pages to block).

Basic example:

User-agent: *
Disallow: /admin/
Disallow: /staging/
Sitemap: https://example.com/sitemap.xml

This tells all bots (`*` means all) to not crawl the `/admin/` and `/staging/` directories. The Sitemap line tells bots where your sitemap is located.

User-agent: \* means all bots. You can also specify individual bots:

`User-agent: Googlebot` applies only to Google's bot. `User-agent: Bingbot` applies only to Bing's bot. You can have multiple User-agent sections with different rules.

Disallow: /path/ tells bots not to crawl that path. Disallow: / blocks the entire site. Disallow: (empty) allows everything. You can list multiple Disallow rules per User-agent.

Allow: /path/ allows crawling a specific path even if a parent directory is disallowed. Example: Disallow: /temp/ but Allow: /temp/important/ allows crawling only the /important/ subdirectory.

Common Robots.txt Patterns

Pattern 1: Block admin pages

User-agent: *
Disallow: /admin/
Disallow: /user/
Disallow: /account/

This blocks administrative, user, and account pages from crawling. These pages are typically not meant for search engines.

Pattern 2: Block staging environment

User-agent: *
Disallow: /staging/
Disallow: /test/

Prevents bots from crawling test or staging versions of your site.

Pattern 3: Block specific file types

User-agent: *
Disallow: /*.pdf
Disallow: /*.zip

Prevents bots from crawling PDFs and ZIP files. This is useful if you have many PDFs that should not be indexed.

Pattern 4: Slow bots that hammer your server

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Crawl-delay: 10

Completely blocks Ahrefs bot (if you do not want your site crawled by SEO tools). Slows down Semrush bot by adding a 10-second delay between requests. Crawl-delay is useful for aggressive bots that overload your server.

Pattern 5: Allow all (default)

User-agent: *
Disallow:

This is the default. Empty Disallow means allow all. You can also omit robots.txt entirely if you want all content crawlable.

Robots.txt vs Meta Robots Noindex

Robots.txt blocks crawling. Meta robots noindex blocks indexation. These serve different purposes.

Use robots.txt when: You want to save crawl budget. You have duplicate content that should not be crawled. You have admin pages that should not be touched by bots. You want to slow down aggressive bots.

Use meta robots noindex when: You want a page to be crawled but not indexed (to see errors and issues). You want to prevent indexation but still allow internal links and crawling. You want to eventually remove a page from search but keep it live.

Example: Paginated pages like `/products?page=2` can be blocked in robots.txt to save crawl budget (since Google typically consolidates pagination). But you might want them crawled to identify canonical relationships. In that case, use canonicals instead of robots.txt.

Managing Crawl Budget With Robots.txt

Crawl budget is the number of URLs Google crawls daily on your site. Large sites with millions of pages cannot have all pages crawled daily. Google allocates crawl budget based on your site's authority and change frequency. Crawl budget is finite. Wasting it on unimportant pages means important pages are crawled less frequently.

Optimize crawl budget by blocking pages that should not be crawled: duplicate content, paginated search results, user account pages, testing pages. Every page you block gives Google more budget to crawl your important content.

Common crawl budget wasters: infinite pagination (product filters create unlimited URLs), duplicate content with different parameters, session IDs appended to every URL, calendar/event pages generating endless URLs. Use robots.txt to block these patterns.

Google Search Console shows your site's crawl statistics. Monitor crawl requests daily. If Google crawls the same pages repeatedly without discovering new content, review your robots.txt and blocking strategy.

Sitemap in Robots.txt

Include your sitemap URL in robots.txt. Add `Sitemap: https://example.com/sitemap.xml` at the end of your robots.txt file. This tells Google where to find your XML sitemap. You can list multiple sitemaps if you have multiple files.

Example:

User-agent: *
Disallow: /admin/
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

Listing sitemaps in robots.txt is optional (you can submit sitemaps via Google Search Console), but it is a best practice.

Testing and Validating Robots.txt

Google Search Console has a robots.txt tester. Go to Settings > Crawling > Test robots.txt. Enter a URL and see whether robots.txt blocks it. This is invaluable for validating your rules before deploying.

Always test before deploying robots.txt changes. A single mistake (like `Disallow: /` blocking your entire site) can tank your rankings. Use the tester to verify that:

Important pages are not blocked. Admin pages are blocked. Duplicate content patterns are blocked. No critical paths are accidentally disallowed.

After deploying robots.txt, monitor Google Search Console's Crawl report for changes. If crawl rate drops unexpectedly, you may have accidentally blocked important content.

Common Robots.txt Mistakes

Mistake 1: Blocking CSS and JavaScript. If you block `/css/` or `/js/` in robots.txt, Google cannot crawl your CSS and JavaScript. Without CSS, Google cannot render your pages properly. Do not block CSS or JavaScript.

Mistake 2: Blocking important content. Always test before deploying. A typo like `Disallow: /p` instead of `Disallow: /staging/` can block `/products/` unintentionally.

Mistake 3: Using robots.txt for security. Do not rely on robots.txt to protect sensitive data. Security-sensitive pages should require authentication, not just robots.txt. Robots.txt is public and easily circumvented.

Mistake 4: Inconsistent robots.txt across domains. If you have multiple domains, maintain consistent robots.txt policies. Accidentally different rules can cause crawl efficiency problems.

Mistake 5: Blocking the sitemap itself. Never block `/sitemap.xml` in robots.txt. Google needs to crawl the sitemap to discover pages.

Advanced Robots.txt Features

Crawl-delay and Request-rate: These directives slow down bots. `Crawl-delay: 10` adds 10 seconds between requests. `Request-rate: 1/10` allows 1 request per 10 seconds. Use these for bots that overload your server. Google recommends using Search Console settings instead of these directives.

Allow directive: Allows crawling a specific path even if a parent path is disallowed. Useful for carving out exceptions. Example: `Disallow: /temp/` but `Allow: /temp/keep/` allows only the keep subdirectory.

Google's robots.txt specification documents all supported directives. Most features are rarely needed. Stick to basic User-agent, Disallow, and Sitemap for most sites.

Advanced Robots.txt Directives

User-agent specific rules allow different crawl rules for different bots. You can specify rules for Googlebot, Bingbot, and other user-agents separately. This is useful if you want Google to crawl your entire site but restrict Bing from accessing certain sections. Specify user-agent at the start of each rule block:

`User-agent: Googlebot` applies rules only to Google's crawler. `User-agent: *` applies rules to all bots. Rules apply to the specific user-agent until the next user-agent directive. You can create multiple rule blocks for different bots.

Crawl-delay and request-rate directives tell bots how often to crawl. `Crawl-delay: 5` tells the bot to wait 5 seconds between requests. This reduces server load. `Request-rate: 1/10` tells the bot to make at most 1 request per 10 seconds. Google's robots.txt documentation details all supported directives.

Sitemap location directives tell bots where to find your sitemap. `Sitemap: https://example.com/sitemap.xml` points bots to your XML sitemap. You can specify multiple sitemaps. This is recommended as it helps bots discover all your pages efficiently.

Clean-param directive removes URL parameters before crawling. `Clean-param: utm_source&utm_medium https://example.com` tells Google to ignore UTM parameters on example.com. This prevents Google from treating tracked links as duplicate content. This is less commonly used now since Google handles most tracking parameters automatically.

Test your robots.txt file in Google Search Console's robots.txt tester. The tool shows which URLs would be blocked by your robots.txt for Googlebot. This prevents accidental blocking of important pages.

Conclusion

Robots.txt is a simple but critical file for managing search engine crawling and protecting your site's privacy. A well-configured robots.txt blocks unimportant pages, saves crawl budget, and prevents duplicate content from being crawled multiple times. Misconfigured robots.txt can accidentally block important content and tank your rankings.

Always test robots.txt changes before deploying. Use Google Search Console's tester to validate rules. Monitor your crawl statistics monthly. Block unimportant content and manage crawl budget effectively. Use our GEO SEO audit tool to audit your robots.txt configuration and identify potential issues with crawlability and indexation across your entire site.

Frequently questions asked

Does blocking a page in robots.txt improve SEO?

No. Blocking a page in robots.txt prevents Google from crawling it but does not help SEO. If you do not want a page indexed, use a noindex tag instead. Use robots.txt to save crawl budget.

Should you block CSS and JavaScript files?

No. Never block CSS or JavaScript in robots.txt. Google needs access to those resources to fully understand your page. Blocking them prevents Google from rendering your content correctly.

What crawl rate should you set?

Generally, you do not need to set a crawl rate unless Google is making too many requests. If Google is exhausting your bandwidth, use the Crawl-delay directive. Most sites can handle Google's default crawl rate.

Our Blog for Ambitious Company