What Is robots.txt and How Does It Work?
A robots.txt file is a plain text file that tells web crawlers — including Googlebot, Bingbot, and hundreds of others — which parts of your website they are allowed to access. Understanding what robots.txt is and how it works helps you make informed decisions about crawl budget, privacy, and indexation strategy. WikiPlus Robots Generator at wikiplus.co simplifies creating the file, but knowing the underlying mechanics helps you configure it correctly.
The Robots Exclusion Protocol Explained
The Robots Exclusion Protocol (REP) is the standard that defines how robots.txt works. It was created in 1994 and has been updated several times, most recently codified as an Internet standard (RFC 9309) in 2022. The protocol is voluntary — well-behaved crawlers like Googlebot comply with robots.txt instructions, but malicious bots often ignore them. The file is checked by the crawler before making any other request to your site. If the file is inaccessible (server error), Googlebot will typically pause crawling and retry — it does not assume unrestricted access. If the file returns a 404, Google treats the entire site as accessible. This makes correct hosting of the robots.txt file itself an important technical SEO consideration.
The Syntax of a robots.txt File
A robots.txt file consists of one or more record blocks. Each block begins with a User-agent line identifying which crawler it applies to, followed by Disallow and Allow directives. The User-agent value is case-insensitive; Googlebot, googlebot, and GOOGLEBOT are equivalent. An asterisk (*) User-agent applies to all crawlers not covered by a more specific block. Disallow: /path/ blocks the specified path and everything below it. Disallow: (empty value) allows all paths — this is the explicit allow-all rule. Allow: /path/ creates an exception within a broader Disallow, permitting a specific path. Paths are case-sensitive and must begin with a forward slash. The Sitemap: directive appears at the end of the file and is not inside a user-agent block.
How Googlebot Processes robots.txt Rules
Googlebot fetches robots.txt once per day per site and caches the result. It evaluates rules by finding the most specific matching rule for each URL. If both Allow: /page.html and Disallow: / apply, the more specific Allow takes precedence. If two rules of equal specificity conflict, Allow takes precedence over Disallow. Googlebot supports a limited subset of pattern matching: the asterisk wildcard (*) matches any sequence of characters within a path, and the dollar sign ($) anchors the pattern to the end of the URL. For example, Disallow: /*?s= blocks all URLs containing ?s= at any position, while Disallow: /*.pdf$ blocks all URLs ending in .pdf. Other crawlers may not support wildcards — check each crawler documentation.
What robots.txt Cannot Do
Understanding robots.txt limitations prevents false confidence in crawl control. Robots.txt cannot prevent a page from being indexed — it only controls whether Googlebot visits the URL. If other sites link to a disallowed page, Google can index the URL from those links without ever crawling it. Robots.txt cannot hide page content from Google if the page was previously crawled and is in the index — you need a noindex meta tag for that. Robots.txt does not affect how a page ranks once it is indexed. Robots.txt cannot block crawlers that do not respect the protocol — for blocking malicious bots, use server-level IP blocking or a web application firewall. For pages containing truly sensitive information, use authentication (password protection) rather than relying solely on robots.txt.
Frequently Asked Questions
- Does robots.txt affect SEO?
- Yes, robots.txt affects SEO indirectly through crawl budget management. Blocking low-value pages (search result pages, parameter variations, admin areas) from crawling ensures Googlebot spends its crawl budget on your important content. However, robots.txt does not directly affect rankings. A common mistake is blocking CSS and JavaScript files that Googlebot needs to render and evaluate pages — this can actually hurt rankings by preventing Google from seeing how pages look to users.
- Can I block a specific URL in robots.txt?
- Yes. Use Disallow: /specific-page.html to block a single URL. Note that this only prevents crawling, not indexing. If the page is already indexed or linked from other sites, it may still appear in search results. For a page that must not appear in Google results at all, add both the robots.txt rule and a noindex meta tag. The meta tag requires the page to be crawlable to be read, so do not disallow pages that also have noindex tags — the noindex will never be seen.
- What is crawl budget and why does robots.txt help?
- Crawl budget is the number of pages Googlebot crawls on your site within a given timeframe. Google allocates crawl budget based on your site size, server response speed, and overall site quality. For small sites (under a few hundred pages), crawl budget is rarely a concern. For large sites with tens of thousands of pages, efficient crawl budget allocation is critical — using robots.txt to block faceted navigation, search result pages, and duplicate parameter URLs ensures Googlebot focuses on your canonical, indexable content.