Robots.txt vs Noindex: Which Should You Use?
Robots.txt Disallow and the noindex meta tag are both ways to keep pages out of search results, but they work at fundamentally different levels and should be used for different situations. Confusing them is one of the most common technical SEO mistakes — and it often means content that should be hidden stays visible, or content you want found gets blocked. This guide explains clearly what each mechanism does, when to use each, and what happens when you combine them incorrectly.
How robots.txt Disallow Works
When you add a Disallow rule to robots.txt, you are telling crawlers not to fetch that URL. If a crawler respects the rule — and all major search engine bots do — it will not download the page content. It will not process the HTML, read the meta tags, follow the links, or index the text. However, Disallow does not make the URL invisible to Google. If other pages on the web link to a disallowed URL, Google learns about the URL through those links. It can still record the URL as a 'known' address and list it in search results with no title, no description, and no snippet — just the bare URL. This is often called an 'unverified' search result listing. This behavior has important implications. If you disallow a URL hoping to keep it out of search results, but external sites link to it, you may still see it appear as a bare URL in Google. The noindex directive is the correct tool for preventing indexation. The primary purpose of robots.txt Disallow is access control at the crawl level: preventing crawlers from accessing pages that have no reason to ever be crawled. Examples: admin panels, internal dashboards, API endpoints, checkout flows, login pages, and staging directories. These pages should not be crawled, and blocking them preserves crawl budget for pages that matter. Disallow is also appropriate for hiding the exact content of pages you do not want scraped at the content level — though remember, it is not secure. A Disallow rule stops compliant crawlers; it does not stop direct access by users or malicious bots.
How noindex Works
The noindex directive takes a different approach. Instead of preventing crawling, it allows the crawler to fetch the page normally — but instructs it not to add the page to its search index. noindex is delivered in one of two ways: 1. HTML meta tag in the page head: <meta name="robots" content="noindex" /> This works for all major search engines and is the most common implementation. 2. HTTP response header: X-Robots-Tag: noindex This is used for non-HTML files (PDFs, images) or when you prefer to set directives at the server level rather than in page HTML. When Googlebot fetches a page and finds noindex, it crawls the page (follows its links, processes its HTML) but excludes it from the search index. The URL will not appear in Google Search results. Crucially, noindex requires the page to be crawlable. If you block the page in robots.txt, Googlebot cannot read the noindex directive because it cannot access the page at all. This is why the combination of robots.txt Disallow + noindex on the same URL is counterproductive — the noindex is never seen. noindex is the right tool for pages that are publicly accessible but should not appear in search results — thin content, duplicate pages, landing pages for specific ad campaigns, user account pages, internal site search results, or print versions of articles. These pages can be visited by users who have the URL, but they do not belong in the public search index.
Decision Framework: Which One to Use
Use this framework to choose the right mechanism for any page. Use robots.txt Disallow when: - The page should never be crawled by anyone (admin area, internal API, login form) - The page cannot be noindexed because it has no HTML head (data endpoints, binary files served from certain paths) - You want to reduce crawl budget consumed by hundreds or thousands of low-value URLs (parameter-based duplicates, old archive pages) - The page should not be accessible to crawlers for any reason (staging environment, internal tool) Use noindex when: - The page is publicly accessible but should not appear in search results (thank-you pages, search results pages, user account pages) - You want the page to be crawlable (so its links are followed and its canonical tags or alternate tags are processed) but not indexed - You have duplicate pages and want the non-canonical version excluded from the index while still allowing crawling - You want to keep content accessible to users but invisible to search (gated content teasers, registration-required content previews) Use both Disallow + noindex on the same URL — never. It is contradictory. Disallow prevents the crawler from ever reading the noindex, so the noindex has no effect. Use neither when: - The page should be fully crawled and indexed (homepage, product pages, blog posts, landing pages) - You want the page to appear in search results
Real-World Examples of Which to Use
These examples from common website architectures illustrate the correct choice for each situation. /wp-admin/ — Use robots.txt Disallow. This is an admin panel. Crawlers should never access it, there is no indexation value, and it should not consume crawl budget. /thank-you-page — Use noindex. This page is accessible to users after a conversion (form submission, purchase). It does not need to be indexed but it must be crawlable because it is a legitimate destination for user traffic. /products?sort=price&page=3 — Use robots.txt Disallow (via parameter pattern) or noindex. Sorted/filtered product pages are thin, duplicate content. Disallow via pattern (Disallow: /*?*) or noindex meta tag are both valid approaches. /cart and /checkout — Use robots.txt Disallow. These are transactional pages with no value in search results. They should not be crawled. /blog/page/2, /blog/page/3, etc. — Use noindex (or a rel=canonical pointing to page 1). Paginated archives are accessible to users but are low-value index content. noindex allows the links on these pages to be followed. /login and /register — Use robots.txt Disallow. These pages do not belong in search results and should not be crawled. /sitemap.xml and /robots.txt — Never block these. They are explicitly meant to be accessed by crawlers. robots.txt in particular must always return 200 and be readable. For large e-commerce sites, a combination approach often works best: robots.txt Disallow for the most crawl-budget-intensive URL patterns (search, filters, cart), and noindex for pages that should be individually accessible but not indexed (user profiles, order confirmations).
Frequently Asked Questions
- What is the difference between noindex and nofollow?
- noindex tells search engines not to include the page in their index. nofollow tells search engines not to follow the links on a page (or on a specific link). They are independent directives. noindex, nofollow used together means: do not index this page and do not follow any of its links. noindex alone means: do not index this page but do follow and process its links. nofollow alone means: index this page but do not follow its outbound links. Most cases where you want to exclude a page use noindex; nofollow is typically used for paid links, user-generated content links, or links you do not want to pass authority to.
- If I use noindex on a page, will it stay out of search results forever?
- As long as the noindex tag is present and the page is crawlable, yes. However, if you later remove the noindex tag, Googlebot will eventually re-crawl the page and add it to the index. The timing depends on how frequently Googlebot crawls your site. If you want a page to remain permanently excluded from search, keep the noindex directive in place. You can also use Google Search Console's URL Removal Tool for urgent, temporary removal of a specific URL, but this is a 6-month suppression, not a permanent index removal.
- Can I use the robots meta tag to control individual crawlers separately?
- Yes. The robots meta tag supports per-crawler targeting using the crawler's name instead of the generic 'robots'. For example: <meta name="googlebot" content="noindex" /> applies only to Googlebot, while <meta name="bingbot" content="noindex" /> applies only to Bingbot. This allows you to have a page indexed by one search engine but not another — an unusual but occasionally useful configuration. The generic <meta name="robots"> applies to all crawlers that recognize standard meta robots directives.