The Complete Guide to robots.txt [2026]
The complete guide to robots.txt in 2026 covers every aspect of the Robots Exclusion Protocol: file syntax, directive types, pattern matching, crawl budget strategy, platform-specific deployment, testing methods, and the mistakes that cause crawl disasters. WikiPlus Robots Generator at wikiplus.co generates a valid robots.txt from a form — this guide provides the conceptual foundation to use it effectively for any site type or complexity level.
robots.txt Syntax: The Complete Reference
A valid robots.txt file is UTF-8 encoded plain text served at yourdomain.com/robots.txt with Content-Type: text/plain. The file contains one or more groups, each beginning with one or more User-agent lines followed by Disallow and Allow directives. User-agent: * targets all crawlers not covered by a specific group. User-agent: Googlebot targets only Google. Disallow: /path/ blocks the path and all sub-paths. Allow: /path/ permits a specific path within a broader disallow. Sitemap: https://example.com/sitemap.xml appears as a standalone line, outside user-agent groups. Lines beginning with # are comments and are ignored by crawlers. Blank lines separate groups. The file must end with a newline character. Maximum recommended size is 500 KB.
Pattern Matching and Wildcards
Googlebot supports two special characters in path patterns. The asterisk (*) matches any sequence of characters, including none. Disallow: /search/* blocks /search/, /search/query, and /search/q=test equally. Disallow: /*?* blocks any URL containing a query string. Disallow: /*.pdf$ uses the dollar sign anchor to block only URLs ending in .pdf. The dollar sign anchors the match to the end of the URL string. These two wildcards cover the vast majority of real-world pattern needs. Important: the asterisk in User-agent: * has a completely different meaning — it does not use pattern matching but is a literal token meaning all crawlers. Wildcards in User-agent lines are not supported by the spec, though some crawlers may handle them.
Crawl Budget Strategy by Site Type
Small informational sites (under 100 pages): minimal robots.txt needed — allow everything and declare the sitemap. Medium blogs (100-1000 pages): block WordPress admin, search result pages, tag/category archives if they create duplicate content. Block URL parameters from social referral tracking (utm_ params) if they generate unique crawlable URLs. Large e-commerce (1000+ pages): aggressive parameter blocking for all faceted navigation (color, size, sort, filter), block session ID parameters, block tracking parameters, block comparison and wishlist pages, block pagination beyond page 3 (or implement rel=next/prev patterns). News and media sites: block archive pages beyond 3-6 months if older content is not updated, block search, block author pages if they are thin. Use robots.txt in combination with canonical tags and XML sitemaps for maximum crawl efficiency.
The robots.txt Maintenance Checklist
Perform these checks quarterly and after any major site change. Fetch yourdomain.com/robots.txt and confirm the content matches your intended configuration. Open Google Search Console robots.txt tester and verify the file was fetched recently with a 200 status. Test 5-10 key URLs that should be crawlable — confirm they show as Allowed. Test 3-5 URLs that should be blocked — confirm they show as Disallowed by the expected rule. Check the Coverage report for any sudden changes in blocked pages. After any CMS update, theme change, or plugin installation, re-verify the robots.txt since these changes frequently overwrite it. Keep a version history of your robots.txt changes in a simple document or your repository changelog.
Frequently Asked Questions
- Is robots.txt required for a website?
- No, robots.txt is not required. If no robots.txt file exists at your domain root, crawlers treat the entire site as accessible. For simple websites with no admin areas, no duplicate parameter URLs, and no sensitive paths, a missing robots.txt causes no SEO harm. The Sitemap directive in robots.txt is the main reason to have one even on simple sites — it helps crawlers discover your XML sitemap. For WordPress and e-commerce sites, a configured robots.txt is strongly recommended to manage crawl budget.
- Can robots.txt protect sensitive data?
- No. Robots.txt is a public file — anyone can read it by visiting yourdomain.com/robots.txt. Paths listed in Disallow rules are visible to anyone who reads the file. This means security by obscurity via robots.txt is ineffective — it can actually advertise the existence of sensitive paths. For truly sensitive data, use server-side authentication or access control. Robots.txt should be thought of as a crawl efficiency tool, not a security mechanism.
- What is the difference between robots.txt and the noindex tag?
- Robots.txt controls crawling — whether a search engine visits a URL at all. The noindex meta tag controls indexing — whether a crawled page appears in search results. A disallowed page can still be indexed from links. A noindex tag prevents a page from appearing in results but requires the page to be crawlable. For pages that should not appear in search results, the noindex tag is the correct tool. For pages that should not be crawled at all (saving crawl budget, hiding admin interfaces), robots.txt Disallow is correct. Never combine both on the same page — the noindex tag on a disallowed page will never be seen.