FAQ: Robots.txt Questions Answered
Robots.txt is one of those topics that generates a constant stream of questions — because a small file with enormous consequences deserves careful attention. What can it actually block? Does disallowing a page remove it from Google? How do I block AI crawlers? What is the right syntax for wildcard rules? This FAQ compiles the most frequently asked robots.txt questions from developers, SEOs, and site owners, with direct answers based on current Google documentation and real-world behavior.
Questions About robots.txt Basics
Q: What is the purpose of a robots.txt file? A: robots.txt tells web crawlers which parts of your site they are and are not allowed to access. It follows the Robots Exclusion Protocol and is read by all major search engine bots (Googlebot, Bingbot, and others) before they crawl any page on your site. It helps you control crawl budget, keep low-value pages out of crawlers' reach, and protect admin and technical pages from being visited by automated bots. Q: Where does robots.txt need to be located? A: Exactly at the root of your domain: https://yourdomain.com/robots.txt. It cannot be in a subdirectory. It cannot be named something else. The filename must be exactly robots.txt (all lowercase). If your site is at https://shop.yourdomain.com, the robots.txt for that subdomain must be at https://shop.yourdomain.com/robots.txt — the one at yourdomain.com does not apply to subdomains. Q: What should be in my robots.txt if I want Google to crawl everything? A: Nothing — or an explicit allow-everything statement. If robots.txt is absent, all crawlers treat it as 'allow all.' If you want to be explicit, you can write: User-agent: * Disallow: (A Disallow with no value means 'allow everything.') Q: Can I use comments in robots.txt? A: Yes. Lines starting with # are comments and are ignored by crawlers. Use them to explain your rules: # Block admin panel User-agent: * Disallow: /admin/ Q: Is robots.txt case-sensitive? A: The directives themselves (User-agent, Disallow, Allow) are case-insensitive. However, the path values in Disallow and Allow rules are case-sensitive. Disallow: /Admin/ would block /Admin/ but not /admin/ unless your server is case-insensitive. Since most Linux web servers are case-sensitive, match the exact case of your URL paths.
Questions About Crawl Blocking and Indexation
Q: If I block a page in robots.txt, will it be removed from Google Search? A: Not necessarily. Disallow prevents crawling, not indexing. Google can still include a URL in search results if it discovers that URL through links from other pages — even without having crawled the page itself. The listing will have no title or description, just the bare URL. To fully remove a page from search results, use a noindex meta tag (which requires the page to be crawlable) or use Google Search Console's URL Removal Tool. Q: Will blocking pages in robots.txt slow down my site's indexing? A: No — blocking low-value pages (admin pages, cart pages, search result pages) can actually improve indexing of important pages by freeing up crawl budget. Crawl budget is the number of pages Googlebot will crawl on your site in a given period. If crawlers spend less time on throwaway URLs, they have more capacity to crawl and re-index your valuable content. Q: Can I use robots.txt to block a single page (not a directory)? A: Yes. Use the exact path: User-agent: * Disallow: /specific-page This blocks exactly /specific-page. Add a trailing slash if it is a directory: Disallow: /specific-directory/ Q: I added a Disallow rule but Googlebot is still crawling that URL. Why? A: Several reasons. First, robots.txt is cached — Google re-reads it roughly every 24 hours, so changes may not take effect immediately. Second, if the rule has a syntax error, it may be ignored. Third, Google may have previously discovered the URL through a link and logged it before your rule was added — it may still appear in Search Console as 'Crawled' from before the rule was applied. Use Google Search Console's robots.txt tester to verify your rule is correctly formatted.
Questions About robots.txt Syntax
Q: What does Disallow: / mean? A: It blocks access to all paths — the entire site. This is the most aggressive Disallow rule. Combined with User-agent: *, it completely prevents any crawler from accessing any page on your site. Only use this intentionally on staging sites that should not be indexed. Q: How do wildcards work in robots.txt? A: The * character matches any sequence of characters in a path. Disallow: /search?* blocks any URL starting with /search? followed by anything (all search result pages). The $ character at the end anchors the match to the end of the URL. Disallow: /*.pdf$ blocks any URL that ends in .pdf. Q: What is the difference between Disallow: /folder and Disallow: /folder/? A: Disallow: /folder blocks the URL /folder exactly and all paths starting with /folder — which includes /folder/, /folder/anything, and /foldername (unintentional match if a page is named /foldername). Disallow: /folder/ with a trailing slash blocks /folder/ and all paths within it, but does NOT block /folder (without trailing slash). For directories, always use the trailing slash to be precise. Q: Can I have multiple User-agent groups in one robots.txt file? A: Yes. Each group starts with one or more User-agent: lines and is followed by its Disallow and Allow rules. Groups are separated by blank lines. Each group applies only to the crawlers matching its User-agent value. Example: User-agent: GPTBot Disallow: / User-agent: * Disallow: /admin/ Sitemap: https://yoursite.com/sitemap.xml
Questions About Specific Use Cases
Q: How do I block all AI training crawlers without blocking Google Search? A: Create separate User-agent blocks for each AI crawler and set Disallow: / for each, then add your standard rules under User-agent: *. The key is that Google-Extended (AI training) and Googlebot (search) are separate user-agents. Blocking Google-Extended does not affect Googlebot. Add: User-agent: GPTBot, User-agent: Google-Extended, User-agent: CCBot — each with Disallow: / — before your wildcard block. Q: My site has www and non-www versions. Do I need two robots.txt files? A: Yes, if both versions are served as separate origins. https://www.yoursite.com and https://yoursite.com are technically different origins, each needing their own robots.txt. However, if you have a proper redirect (301 from non-www to www or vice versa), the primary version needs a full robots.txt and the redirect domain can have a minimal one or none at all — crawlers will follow the redirect to the primary version. Q: Can I prevent Googlebot from indexing my images but still let it crawl my pages? A: Yes. Use a separate User-agent: Googlebot-Image block: User-agent: Googlebot-Image Disallow: / User-agent: * Disallow: /admin/ This specifically blocks Googlebot's image crawler from all content while leaving standard Googlebot access unchanged. Q: What is the maximum size for a robots.txt file? A: Google reads the first 500KB of a robots.txt file. Rules beyond 500KB are ignored. For the vast majority of sites this is not a concern — a comprehensive robots.txt file with dozens of rules is typically under 5KB. For very large sites with extremely detailed per-bot configurations, monitor the file size and prioritize the most important rules at the top.
Frequently Asked Questions
- Does robots.txt affect my site's SEO ranking directly?
- Not directly — robots.txt rules are not a ranking signal themselves. However, they have significant indirect effects. Blocking low-value pages improves crawl budget efficiency, which means important pages get crawled more frequently. Preventing thin content pages from being indexed removes potential duplicate content issues. Keeping admin and transactional pages out of the index maintains a cleaner, more focused site profile in Google's index. These effects collectively support better SEO health, making robots.txt an important foundational element of technical SEO.
- What should I do if I accidentally blocked my site and it disappeared from Google?
- Act immediately. First, fix the robots.txt file — remove the Disallow: / rule under User-agent: * or add Allow: / rules for your important pages. Then use Google Search Console's URL Inspection tool to request indexing for your homepage and key pages. It may take days to several weeks for Google to re-crawl and re-index your site, depending on its size and crawl frequency. Submit or resubmit your sitemap in Search Console to accelerate rediscovery. Monitor the Coverage report for 'Indexed' count recovery over the following weeks.
- Can I test robots.txt rules without putting a file on my live site?
- You can test robots.txt syntax offline using validator tools that accept pasted content, and you can manually verify whether rules would match specific URLs by applying the robots.txt matching algorithm yourself. However, you cannot use Google Search Console's URL testing feature without a deployed file on a publicly accessible URL, and you cannot see how Googlebot actually interprets your file without it being live. The safest approach is to test on a staging environment with a public URL before deploying to production.