WikiPlus

Robots.txt Guide: Control What Google Crawls

Your robots.txt file is a direct line of communication with Google's crawlers. Used correctly, it focuses Googlebot's attention on your most valuable pages, keeps low-quality content out of the index, and helps your important URLs get discovered and re-crawled more frequently. Used incorrectly, it can accidentally block your entire site from being indexed. This guide explains how to use robots.txt strategically to control what Google crawls, with specific rules for common situations and an explanation of what robots.txt cannot do.

How Googlebot Uses robots.txt

Before Googlebot crawls any URL on your site, it fetches and reads your robots.txt file. It caches the file and re-checks it approximately every 24 hours. The rules in your robots.txt file tell Googlebot which paths it is allowed to access and which to skip. If Googlebot is disallowed from a URL, it will not fetch that page. It will also not index it — with one important nuance: if other pages link to a disallowed URL, Google may still list the URL in search results as a 'known URL' even without having crawled it. The listing will have no title or description snippet. To completely prevent a URL from appearing in search results, you need to use the noindex meta tag instead of or in addition to robots.txt blocking. Googlebot has a crawl budget — a limit on how many pages it will crawl on your site in a given period. This budget is influenced by your site's authority (more authoritative sites get more crawl budget) and your server's crawl rate (how quickly it can handle crawler requests). For very large sites, crawl budget is a meaningful SEO concern. For smaller sites (under a few thousand pages), most pages are crawled without any budget constraints. For sites where crawl budget matters, robots.txt blocking is an effective way to redirect budget away from low-value URLs. Blocking /search?*, /tag/*, /page/2 and similar paginated, duplicate, or low-quality URL patterns frees up budget for your product pages, service pages, and content. Google's documentation notes that Googlebot specifically honors Disallow directives in robots.txt for Googlebot (user-agent: Googlebot) and for the wildcard (*). Rules set for other user-agents (like user-agent: Bingbot) are respected by those specific crawlers but not by Googlebot.

What to Disallow for Better Crawl Efficiency

Not all pages on your site deserve to be crawled and indexed. Here are the categories of URLs most commonly worth disallowing. Internal search results: URLs like /search?q=keyword or /results?query=term are thin pages that duplicate your core content with a different query parameter. They add no value to the search index and consume crawl budget. Block them with Disallow: /search or Disallow: /*?q= Admin and login pages: /admin/, /wp-admin/, /dashboard/, /login — these should never be in Google's index. Block them. Note: blocking them in robots.txt does not make them secure — use server-side authentication for that. Disallow: /admin/ and Disallow: /wp-admin/ are standard inclusions. Duplicate content from parameters: If your site has URLs with tracking or session parameters (e.g., /product?ref=homepage&session=abc123), these create thousands of duplicate URLs. Block the parameter patterns or use Google Search Console's URL Parameter handling tool instead of robots.txt blocking. Calendar and archive pages: On WordPress and blog sites, /date/2018/09/ archives of old dates, author archives with minimal content (/author/username/ on single-author sites), and heavily paginated tag or category archives (/tag/keyword/page/20) are low-value crawl targets. Staging and development content: Any test paths, /staging/, /dev/, or sandbox environments that might have leaked into production. APIs and data endpoints: /wp-json/, /api/v1/, /feed/ (RSS/Atom feeds). These return data for applications, not content for users. Some crawlers will index API responses if not blocked, which adds noise to your index without benefit.

A Practical robots.txt Template

Here is a practical robots.txt configuration for a standard website, with explanations for each rule. User-agent: * Disallow: /admin/ Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-json/ Disallow: /search Disallow: /cart Disallow: /checkout Disallow: /account/ Disallow: /login Disallow: /*.pdf$ Allow: /wp-admin/admin-ajax.php Sitemap: https://yourdomain.com/sitemap.xml Explanation of each rule: - Disallow: /admin/ and /wp-admin/ block WordPress admin areas. The Allow: /wp-admin/admin-ajax.php exception is needed because WordPress uses admin-ajax.php for some front-end functionality that crawlers may need to access. - Disallow: /wp-includes/ blocks WordPress system files — JavaScript and CSS from this directory do not need to be in the search index. - Disallow: /wp-json/ blocks the WordPress REST API. - Disallow: /search blocks internal search result pages. - Disallow: /cart and /checkout block WooCommerce checkout pages — these should never be indexed. - Disallow: /account/ blocks user account pages. - Disallow: /login blocks the login page. - Disallow: /*.pdf$ blocks all PDF files from being crawled (remove this if you want PDFs indexed). - Sitemap: points crawlers to your XML sitemap. For non-WordPress sites, remove the WordPress-specific rules and substitute your platform's equivalent paths.

robots.txt Limitations: What It Cannot Do

Understanding what robots.txt cannot do is as important as knowing what it can do. Misunderstanding its limitations leads to false security and misconfigured SEO strategies. robots.txt does not prevent pages from being indexed. It prevents crawling — but Google can still index a page it has never crawled if that page is linked to from other pages. The indexed entry will lack a title, description, or snippet, but the URL can still appear in search results. To fully prevent indexing, use a noindex directive in the page's HTTP headers or HTML head, not robots.txt. robots.txt does not keep content secret. Any bot, scraper, or person can read your robots.txt file — it is public. In fact, robots.txt files are sometimes used by researchers and hackers to discover paths that the site owner considers sensitive. Do not list the paths you want to protect; use server authentication to restrict access to sensitive content. robots.txt does not control all bots. Only bots that respect the Robots Exclusion Protocol will obey your rules. Malicious bots, content scrapers, and link harvesters commonly ignore robots.txt entirely. It is an honor system that reputable crawlers follow. robots.txt cannot control crawl frequency precisely. The Crawl-delay directive is ignored by Googlebot. To control how fast Googlebot crawls your site, use the crawl rate setting in Google Search Console. robots.txt cannot selectively block based on user geography or device type. The same rules apply to all requests from a given user-agent string. If you need device-specific or geo-specific crawl control, that requires server-side logic. robots.txt does not affect canonical URL resolution. If you have duplicate content issues, use canonical tags (rel=canonical) to point to the preferred version. robots.txt blocking a duplicate URL does not cause Google to consolidate link equity to the canonical — only the canonical tag does.

Frequently Asked Questions

Will blocking pages in robots.txt hurt my SEO?
Blocking low-value pages like admin panels, login pages, search results, and checkout pages will not hurt your SEO — it helps it by focusing crawl budget on your important content. However, accidentally blocking high-value pages (product pages, blog posts, category pages) with an overly broad rule will prevent them from being crawled and indexed, which directly harms their search visibility. Always verify your rules with Google Search Console's robots.txt tester before deploying, and check that your most important URLs are not matched by any Disallow pattern.
How quickly does Google re-read my robots.txt after changes?
Google fetches and caches your robots.txt file approximately every 24 hours. If you make a change today, Googlebot may continue using the old cached version for up to 24 hours. For urgent changes — such as if you accidentally blocked your entire site — you can request an expedited crawl through Google Search Console. Go to the robots.txt section in Search Console settings, which shows when Google last fetched your file and lets you view the cached version.
Should I create a separate robots.txt section for Googlebot vs other crawlers?
Only if you want different crawl rules for different bots. The wildcard User-agent: * rule applies to all crawlers, so most sites only need one user-agent block. A common case for multiple blocks is when you want to block AI training crawlers (like GPTBot) from all content while keeping your rules for search engine crawlers unchanged. In that case, add a separate User-agent: GPTBot block with Disallow: / followed by the standard User-agent: * block with your normal rules.