Robots.txt and Sitemaps

How robots.txt and sitemaps work together: the Sitemap directive, correct syntax, common mistakes, and the difference between controlling crawling and controlling indexing.

Your robots.txt file and your XML sitemap serve different purposes, but they're connected in an important way. Robots.txt controls what search engines are allowed to crawl. Your sitemap tells them what you want them to find. And robots.txt is one of the primary ways search engines discover your sitemap in the first place.

Here's how the two work together, the correct syntax for referencing sitemaps in robots.txt, and the mistakes that trip people up.

How Robots.txt References Sitemaps

The robots.txt standard supports a Sitemap: directive that tells search engine crawlers where to find your XML sitemap. When a crawler reads your robots.txt (which it does before crawling anything else on your site), it picks up the sitemap URL and adds it to its processing queue.

# robots.txt
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

That's it. One line. The crawler reads it, fetches the sitemap, and processes the URLs inside.

The Sitemap directive is case-sensitive (sort of)

The original specification defines the directive as Sitemap: with a capital S. In practice, Google and Bing are forgiving about casing, but use Sitemap: with the capital S to be safe. The URL itself is case-sensitive, as all URLs are.

Why This Matters for Discovery

Search engines need to find your sitemap before they can use it. There are three main ways this happens:

  1. robots.txt Sitemap directive -- The crawler reads your robots.txt and follows the Sitemap URL.
  2. Search Console / Webmaster Tools submission -- You manually submit the sitemap URL through Google Search Console or Bing Webmaster Tools.
  3. Convention -- Crawlers check the default location /sitemap.xml even without being told.

The robots.txt method is the most reliable passive discovery mechanism. Unlike Search Console submission, it doesn't require you to log into a tool. Unlike the convention method, it works even if your sitemap isn't at the default /sitemap.xml path.

If your sitemap lives at /sitemaps/main-sitemap.xml or uses a non-standard name, the robots.txt directive is the only way crawlers will find it automatically.

Correct Syntax

Basic Sitemap Reference

User-agent: *
Disallow: /admin/
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

Key rules:

  • Use the full absolute URL including the protocol (https://)
  • The Sitemap: directive is not tied to any User-agent block -- it applies globally
  • Place it anywhere in the file, but convention puts it at the bottom
  • One sitemap URL per Sitemap: line

Multiple Sitemaps

You can list multiple sitemaps. This is useful if you split your sitemaps by content type or if you use a sitemap index alongside individual sitemaps.

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap-index.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml

There's no limit to how many Sitemap: directives you can include. Each one will be processed independently.

Use a sitemap index instead of listing many sitemaps

If you have more than two or three sitemaps, consider using a sitemap index file and referencing just that one URL in robots.txt. It's cleaner and easier to maintain.

Cross-Domain Sitemaps

The Sitemap: directive can point to a sitemap hosted on a different domain. This is valid per the specification and is useful if you host your sitemap on a CDN or a centralized location.

# robots.txt on example.com
Sitemap: https://cdn.example.com/sitemap.xml

However, the sitemap itself must list URLs for the domain that references it. A sitemap on cdn.example.com that lists example.com URLs and is referenced from example.com/robots.txt is fine. A sitemap that lists URLs for domains it has no authority over is not.

Common Mistakes

Mistake 1: Using a Relative URL

# Wrong
Sitemap: /sitemap.xml

# Right
Sitemap: https://example.com/sitemap.xml

The Sitemap: directive requires an absolute URL with the full protocol and domain. A relative path won't work. Some crawlers might tolerate it, but the specification requires an absolute URL, and you shouldn't rely on lenient parsing.

Mistake 2: Putting Sitemap Inside a User-Agent Block

# Misleading (but technically works)
User-agent: Googlebot
Disallow: /private/
Sitemap: https://example.com/sitemap.xml

User-agent: Bingbot
Disallow: /private/

The Sitemap: directive is globally scoped regardless of where you place it in the file. Putting it inside a User-agent: Googlebot block looks like it's only for Google, but all crawlers will see it. For clarity, place your Sitemap: directives outside of any User-agent block, typically at the end of the file.

Mistake 3: HTTP/HTTPS Mismatch

# Your site is HTTPS but robots.txt says:
Sitemap: http://example.com/sitemap.xml

If your site runs on HTTPS, your sitemap URL in robots.txt should also use HTTPS. A mismatch can cause the crawler to fetch the HTTP version, which might redirect to HTTPS (adding unnecessary latency) or might not resolve at all.

Mistake 4: Referencing a Sitemap That Returns an Error

If the sitemap URL in your robots.txt returns a 404, 500, or any non-200 status, the crawler simply ignores it. No error is reported to you unless you check Search Console. This is a silent failure that can persist for months.

Catch sitemap errors before search engines do

Validate your sitemap URL, check for HTTP errors, and confirm your robots.txt references are correct.

Mistake 5: Blocking Sitemap Access in Robots.txt

# Contradictory
User-agent: *
Disallow: /sitemaps/

Sitemap: https://example.com/sitemaps/sitemap.xml

If your robots.txt blocks the directory where your sitemap lives, crawlers won't be able to fetch it. You're pointing them to a resource and simultaneously telling them they can't access it. Make sure your sitemap URL isn't covered by any Disallow rules.

Robots.txt vs Sitemap: Controlling Access

This is where people get confused. Robots.txt and sitemaps serve different functions, and neither one controls indexing the way most people think.

What Robots.txt Controls

Robots.txt controls crawling -- whether a search engine bot is allowed to fetch a URL. It does not control indexing.

User-agent: *
Disallow: /private-page/

This tells crawlers: don't fetch /private-page/. But if Google discovers that URL through an external link, it might still index the URL (showing it in search results with a "No information is available for this page" message) even though it can't crawl the content. Robots.txt prevents crawling, not indexing.

What a Sitemap Controls

A sitemap controls discovery -- which URLs you want search engines to know about. It does not control crawling permissions or indexing.

Listing a URL in your sitemap is a suggestion. The search engine decides whether to crawl and index it based on its own signals.

The Overlap

Here's where it gets interesting: if you list a URL in your sitemap but block it in robots.txt, the search engine knows the URL exists but isn't allowed to fetch it. The result is typically that the URL doesn't get indexed, but the contradiction can cause confusion in Search Console reporting.

The rule: Every URL in your sitemap should be crawlable (not blocked by robots.txt) and should return a 200 status code. If you don't want a URL crawled, don't put it in your sitemap.

AspectRobots.txtXML Sitemap
PurposeControl crawl accessSuggest URLs for discovery
TypeDirective (permission)Hint (suggestion)
Controls crawlingYesNo
Controls indexingNoNo
Read by crawlersBefore crawling beginsDuring or after crawling
FormatPlain textXML
RequiredNo (but recommended)No (but recommended)

How to Set It Up Right

1

Create or verify your robots.txt

Your robots.txt should be at the root of your domain: https://example.com/robots.txt. If it doesn't exist, create it. If it returns a non-200 status, fix that first.

2

Add your Sitemap directive

Add Sitemap: https://yourdomain.com/sitemap.xml (using your actual sitemap URL) at the bottom of the file. Use the full absolute URL with HTTPS.

3

Verify the sitemap is accessible

Fetch the sitemap URL in your browser. It should return valid XML with a 200 status code. If it returns an error, fix the sitemap before referencing it.

4

Check for contradictions

Make sure no Disallow rule in your robots.txt blocks access to the sitemap file itself. Also verify that URLs listed in your sitemap aren't blocked by robots.txt.

5

Submit to Search Console as well

While robots.txt enables passive discovery, also submit your sitemap through Google Search Console and Bing Webmaster Tools. Belt and suspenders.

A Complete Example

Here's a well-structured robots.txt that properly references sitemaps:

# robots.txt for example.com

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Disallow: /account/

# Search-specific directives
User-agent: GPTBot
Disallow: /

# Sitemaps
Sitemap: https://example.com/sitemap-index.xml

Clean, readable, no contradictions. The sitemap index handles the complexity of multiple sub-sitemaps. The Disallow rules keep private areas from being crawled. And the sitemap reference ensures crawlers find your URL list automatically.


Robots.txt tells crawlers what they can access. Your sitemap tells them what you want found. Get both right.

Validate your XML sitemap

Check your sitemap for errors, broken URLs, and indexing issues. Free instant validation.