Should I include the Sitemap directive in robots.txt?

Yes, always. The Sitemap directive in robots.txt tells crawlers where to find your XML sitemap. Place it at the end of the file as a standalone line: Sitemap: https://yourdomain.com/sitemap.xml. This is not inside a user-agent block — it applies globally to all crawlers. Even though Google also discovers sitemaps through Search Console, the robots.txt declaration ensures every crawler that reads your robots.txt also finds your sitemap.

How do I handle robots.txt for a multilingual site?

For a multilingual site hosted on one domain with language subdirectories (e.g., /en/, /fr/, /de/), one robots.txt file at the domain root covers all language versions. Add your sitemap URL which should be a multilingual sitemap index covering all language variants. For sites on country-code top-level domains or subdomains (fr.example.com), each subdomain or domain needs its own robots.txt. Generate separate configurations using WikiPlus Robots Generator for each.

What is the maximum size of a robots.txt file?

Google recommends keeping robots.txt under 500 KB (512 kilobytes). Files larger than 500 KB may be truncated — Google will only read up to 500 KB and ignore rules beyond that limit. Bing has a similar limit. In practice, legitimate robots.txt files are rarely close to this limit unless they contain very large numbers of individual Disallow rules. If you need to block thousands of specific URLs, consider using the noindex meta tag on those pages rather than listing every URL in robots.txt.

seo-web6 min readrobots-generator

The Best robots.txt Configuration for SEO in 2026

By the WikiPlus Editorial Team

Researched with the help of AI tools, edited and reviewed for accuracy by Sergio Robles (Founder, WikiPlus).

Published December 17, 2024Last reviewed May 23, 2026

The best robots.txt configuration for SEO in 2026 balances open crawl access to your most important content with strategic blocking of duplicate, low-value, and utility URLs. Getting this balance wrong in either direction costs rankings: blocking too much wastes indexable content; blocking too little wastes crawl budget on pages that should not be indexed. WikiPlus Robots Generator at wikiplus.co helps you build the correct configuration for your site type. This guide covers best-practice configurations for common site types.

The Minimal Valid robots.txt Most Sites Need

Many sites overcomplicate robots.txt. For a simple blog or brochure website with no admin access, no dynamic search, and no duplicate parameter URLs, the optimal robots.txt is two lines: User-agent: * (blank line) Sitemap: https://yourdomain.com/sitemap.xml. This allows all crawlers full access and points them to your sitemap — nothing more is needed. Adding unnecessary Disallow rules on simple sites creates risk without benefit. Only add complexity when you have a specific problem to solve: admin areas accessible to crawlers, faceted navigation creating thousands of duplicate parameter URLs, or a large site where crawl budget is measurably insufficient.

Best Practice robots.txt for WordPress Sites

A well-configured WordPress robots.txt includes: Disallow: /wp-admin/ (except admin-ajax.php), Disallow: /?s= (search result pages), Disallow: /wp-includes/ (core files with no value to index), Disallow: /xmlrpc.php (security consideration). Leave crawlable: your pages, posts, categories, tags (unless they duplicate page content), images. Include a Sitemap directive pointing to your sitemap. If you use Yoast SEO, it auto-generates a sitemap at /sitemap_index.xml — use that URL. Do not disallow /wp-content/uploads/ — your media files may need to be crawlable for image search. Do not disallow /wp-content/themes/ or /wp-content/plugins/ — Google uses these to evaluate JavaScript and CSS rendering.

Best Practice robots.txt for E-Commerce Sites

E-commerce sites have more complex crawl management needs. Common blocks: /cart/, /checkout/, /account/ (all duplicate per-user), /wishlist/, /?sort=, /?filter=, /?color=, /?size= (faceted navigation parameter variants), /search/, /compare/. Important to keep crawlable: all product pages at canonical URLs, category pages at canonical URLs, blog and content pages, sitemap files. For large product catalogs, ensure faceted navigation parameters are blocked comprehensively — an unblocked parameter like ?color=red can generate thousands of near-duplicate category URLs that dilute crawl budget. Use the $ end anchor to specifically target parameter patterns: Disallow: /*?color=$, Disallow: /*?size=$.

Monitoring robots.txt Over Time

Robots.txt requires ongoing maintenance. CMS updates, plugin changes, and migrations can silently overwrite or corrupt your robots.txt. Set up a monthly monitoring check: visit yourdomain.com/robots.txt and confirm your expected rules are present. Use a free uptime monitoring service to alert you if the robots.txt URL returns anything other than a 200 status code. Set a Google Search Console alert for Coverage errors — a spike in crawl errors often indicates a robots.txt misconfiguration. After any significant site change (platform migration, subdomain restructure, content reorganisation), review robots.txt completely and use WikiPlus Robots Generator to rebuild it from scratch if needed.

Frequently Asked Questions

Should I include the Sitemap directive in robots.txt?: Yes, always. The Sitemap directive in robots.txt tells crawlers where to find your XML sitemap. Place it at the end of the file as a standalone line: Sitemap: https://yourdomain.com/sitemap.xml. This is not inside a user-agent block — it applies globally to all crawlers. Even though Google also discovers sitemaps through Search Console, the robots.txt declaration ensures every crawler that reads your robots.txt also finds your sitemap.
How do I handle robots.txt for a multilingual site?: For a multilingual site hosted on one domain with language subdirectories (e.g., /en/, /fr/, /de/), one robots.txt file at the domain root covers all language versions. Add your sitemap URL which should be a multilingual sitemap index covering all language variants. For sites on country-code top-level domains or subdomains (fr.example.com), each subdomain or domain needs its own robots.txt. Generate separate configurations using WikiPlus Robots Generator for each.
What is the maximum size of a robots.txt file?: Google recommends keeping robots.txt under 500 KB (512 kilobytes). Files larger than 500 KB may be truncated — Google will only read up to 500 KB and ignore rules beyond that limit. Bing has a similar limit. In practice, legitimate robots.txt files are rarely close to this limit unless they contain very large numbers of individual Disallow rules. If you need to block thousands of specific URLs, consider using the noindex meta tag on those pages rather than listing every URL in robots.txt.

The Best robots.txt Configuration for SEO in 2026

The Minimal Valid robots.txt Most Sites Need

Best Practice robots.txt for WordPress Sites

Best Practice robots.txt for E-Commerce Sites

Monitoring robots.txt Over Time

Frequently Asked Questions

Related articles

How to Create a robots.txt File: Step-by-Step Guide [2026]

How to Create a robots.txt File for Free Online — No Installation Needed

How to Add robots.txt to WordPress, Shopify, and Webflow