WikiPlus

Robots.txt for E-commerce: What to Allow and Block

E-commerce sites face unique robots.txt challenges. Faceted navigation creates thousands of URL parameter combinations. Cart and checkout pages should never be indexed. Sorting and filtering URLs duplicate product listings. Without a well-configured robots.txt, your crawl budget disappears into low-value URLs while your actual product pages get crawled less frequently. This guide covers the complete robots.txt strategy for e-commerce, including a ready-to-use template and explanations for each rule.

The Unique robots.txt Challenges of E-commerce

E-commerce sites generate far more URL patterns than brochure websites. A clothing store with 500 products, 10 categories, and faceted navigation filters (size, color, price range, brand) can easily generate tens of thousands of distinct URLs — most of which serve very similar content. Faceted navigation is the biggest challenge. When a user clicks filters on a category page (/shoes/womens/?color=black&size=7&brand=nike), the URL changes but the content is a subset or variation of an existing category page. Crawling every possible filter combination consumes enormous crawl budget and creates near-duplicate content in the index. For a large store with many filter dimensions, the URL space can be virtually unlimited. Sorting and pagination create similar problems. /products/bestsellers?sort=price_asc produces essentially the same product set as /products/bestsellers?sort=price_desc — but as different URLs. Paginated category pages beyond the first two or three pages have minimal SEO value but many of them get crawled. Transactional pages — cart, checkout, order confirmation, account pages — should never appear in search results. A checkout page ranking for a transactional query is useless to users who are not already mid-purchase on your site. Stock management creates URL churn. Products go in and out of stock, are discontinued, or change URLs. Without proper robots.txt and redirect handling, Googlebot wastes crawl budget on unavailable product URLs. A well-configured robots.txt file for e-commerce focuses crawl budget on category pages, product pages, and important landing pages — the URLs that actually drive search traffic and revenue.

What to Block in an E-commerce robots.txt

Here is the comprehensive list of URL patterns typically worth blocking for e-commerce sites. Faceted navigation and filter parameters: Disallow: /*?color= Disallow: /*?size= Disallow: /*?sort= Disallow: /*?ref= Disallow: /*?from= Alternatively, block all parameters: Disallow: /*?* (Use this with caution — only if ALL parameter URLs on your site are low-value) Cart, checkout, and account flows: Disallow: /cart Disallow: /cart/ Disallow: /checkout Disallow: /checkout/ Disallow: /account/ Disallow: /my-account/ Disallow: /order-confirmation/ Disallow: /wishlist/ Internal search results: Disallow: /search Disallow: /search/ Disallow: /*?q= Disallow: /*?query= Disallow: /*?s= Admin and technical pages: Disallow: /admin/ Disallow: /wp-admin/ (WordPress) Disallow: /backend/ Disallow: /login Disallow: /register Pagination beyond page 2: Disallow: /*?page= (If you want Google to crawl paginated pages, omit this — but monitor crawl budget) Duplicate URL variations: Disallow: /products/*/print Disallow: /products/*/embed Disallow: /collections/*/print

E-commerce robots.txt Template

Here is a complete, annotated robots.txt template for a typical e-commerce site. Adjust the paths to match your platform's URL structure. User-agent: * # Admin and system pages Disallow: /admin/ Disallow: /backend/ Disallow: /login Disallow: /register Disallow: /wp-admin/ # Transactional pages Disallow: /cart Disallow: /checkout Disallow: /order-confirmation/ Disallow: /account/ Disallow: /my-account/ Disallow: /wishlist/ # Internal search results Disallow: /search Disallow: /*?q= Disallow: /*?s= # Filter and sort parameters Disallow: /*?sort= Disallow: /*?color= Disallow: /*?size= Disallow: /*?brand= Disallow: /*?min_price= Disallow: /*?max_price= # Pagination (remove this if you want deep pages crawled) Disallow: /*?page= # Print and embed variants Disallow: /*/print Disallow: /*/embed # WooCommerce/Shopify specific Disallow: /wp-json/ Disallow: /cdn/shop/t/ Disallow: /.well-known/ # Allow important bot assets Allow: /wp-admin/admin-ajax.php Sitemap: https://yourstore.com/sitemap.xml Sitemap: https://yourstore.com/sitemap-products.xml Sitemap: https://yourstore.com/sitemap-categories.xml This template keeps the essential product and category pages freely crawlable while preventing crawlers from wasting budget on transactional, duplicate, and low-value URLs.

Monitoring Crawl Budget on E-commerce Sites

After configuring your robots.txt, ongoing monitoring ensures it is working as intended and catches new URL patterns that need to be addressed. Google Search Console Coverage report: Check the 'Excluded' section for URLs with status 'Excluded by robots.txt'. This shows which URLs Google knows about but is not crawling due to your Disallow rules. Review this list periodically to confirm it contains only low-value URLs (parameter pages, cart pages) and not important product or category pages. Google Search Console Crawl Stats: Found under Settings in Search Console, this report shows how many pages Googlebot crawls per day and the file types crawled. Look for a healthy proportion of HTML pages vs other resources. If the crawl count seems low relative to your site size, check whether important pages are being blocked. Server access logs: Your web server logs every request, including crawler requests. Filtering logs for Googlebot user-agent requests and analyzing the URLs accessed gives you a ground-truth view of what is being crawled regardless of what robots.txt specifies. This is particularly useful for identifying URLs that should be blocked but are not yet covered by your rules. Screaming Frog or Sitebulb: These crawler tools can simulate how Googlebot views your site. Configure them to respect your robots.txt and audit which URLs are being included vs excluded. This lets you catch any Disallow rules that are too broad (blocking valuable pages) or too narrow (missing duplicate URL patterns). After any major site changes — URL structure updates, new filter options, new page types — review your robots.txt to ensure the new URLs are handled correctly.

Frequently Asked Questions

Should I block all product filter URLs from Google?
Not necessarily — it depends on whether filtered pages have unique search value. A 'red running shoes' filtered page might rank for that specific query if it has a meaningful number of products and unique content. Blocking all filter pages removes the potential for these long-tail category pages to rank. However, if filters produce near-identical content (sorting by price vs relevance showing the same products) or very thin results (one product), blocking is better. Analyze your filter pages for traffic in Search Console before deciding — some may already be generating organic traffic worth keeping.
How do I handle products that go out of stock or are discontinued?
For temporarily out-of-stock products, keep the page live with a clear 'out of stock' message, maintain the URL, and do not noindex or block it — Google will continue to rank the page for the product name. For permanently discontinued products, implement a 301 redirect to the most relevant replacement product or category page. This passes link equity and sends users to a useful alternative. For products with significant link equity, a 301 redirect is much better than returning a 404 or disallowing in robots.txt, which destroys the accumulated link equity.
Should I add multiple sitemaps to robots.txt for a large e-commerce site?
Yes — specifying multiple sitemaps in robots.txt is perfectly valid and recommended for large sites. You can add as many Sitemap: directives as needed. Organizing sitemaps by content type (products, categories, blog posts) or by update frequency (frequently updated vs static pages) helps Googlebot prioritize which sitemap to check first. Each sitemap should follow the sitemap.org XML format and be listed individually in robots.txt with its full absolute URL.