Robots.txt and Sitemaps
How robots.txt and sitemaps work together: the Sitemap directive, correct syntax, common mistakes, and the difference between controlling crawling and controlling indexing.
Your robots.txt file and your XML sitemap serve different purposes, but they're connected in an important way. Robots.txt controls what search engines are allowed to crawl. Your sitemap tells them what you want them to find. And robots.txt is one of the primary ways search engines discover your sitemap in the first place.
Here's how the two work together, the correct syntax for referencing sitemaps in robots.txt, and the mistakes that trip people up.
How Robots.txt References Sitemaps
The robots.txt standard supports a Sitemap: directive that tells search engine crawlers where to find your XML sitemap. When a crawler reads your robots.txt (which it does before crawling anything else on your site), it picks up the sitemap URL and adds it to its processing queue.
# robots.txt
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml
That's it. One line. The crawler reads it, fetches the sitemap, and processes the URLs inside.
The Sitemap directive is case-sensitive (sort of)
The original specification defines the directive as Sitemap: with a capital S. In practice, Google and Bing are forgiving about casing, but use Sitemap: with the capital S to be safe. The URL itself is case-sensitive, as all URLs are.
Why This Matters for Discovery
Search engines need to find your sitemap before they can use it. There are three main ways this happens:
- robots.txt Sitemap directive -- The crawler reads your robots.txt and follows the Sitemap URL.
- Search Console / Webmaster Tools submission -- You manually submit the sitemap URL through Google Search Console or Bing Webmaster Tools.
- Convention -- Crawlers check the default location
/sitemap.xmleven without being told.
The robots.txt method is the most reliable passive discovery mechanism. Unlike Search Console submission, it doesn't require you to log into a tool. Unlike the convention method, it works even if your sitemap isn't at the default /sitemap.xml path.
If your sitemap lives at /sitemaps/main-sitemap.xml or uses a non-standard name, the robots.txt directive is the only way crawlers will find it automatically.
Correct Syntax
Basic Sitemap Reference
User-agent: *
Disallow: /admin/
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
Key rules:
- Use the full absolute URL including the protocol (
https://) - The
Sitemap:directive is not tied to any User-agent block -- it applies globally - Place it anywhere in the file, but convention puts it at the bottom
- One sitemap URL per
Sitemap:line
Multiple Sitemaps
You can list multiple sitemaps. This is useful if you split your sitemaps by content type or if you use a sitemap index alongside individual sitemaps.
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap-index.xml
Sitemap: https://example.com/sitemap-pages.xml
Sitemap: https://example.com/sitemap-posts.xml
Sitemap: https://example.com/sitemap-products.xml
There's no limit to how many Sitemap: directives you can include. Each one will be processed independently.
Use a sitemap index instead of listing many sitemaps
If you have more than two or three sitemaps, consider using a sitemap index file and referencing just that one URL in robots.txt. It's cleaner and easier to maintain.
Cross-Domain Sitemaps
The Sitemap: directive can point to a sitemap hosted on a different domain. This is valid per the specification and is useful if you host your sitemap on a CDN or a centralized location.
# robots.txt on example.com
Sitemap: https://cdn.example.com/sitemap.xml
However, the sitemap itself must list URLs for the domain that references it. A sitemap on cdn.example.com that lists example.com URLs and is referenced from example.com/robots.txt is fine. A sitemap that lists URLs for domains it has no authority over is not.
Common Mistakes
Mistake 1: Using a Relative URL
# Wrong
Sitemap: /sitemap.xml
# Right
Sitemap: https://example.com/sitemap.xml
The Sitemap: directive requires an absolute URL with the full protocol and domain. A relative path won't work. Some crawlers might tolerate it, but the specification requires an absolute URL, and you shouldn't rely on lenient parsing.
Mistake 2: Putting Sitemap Inside a User-Agent Block
# Misleading (but technically works)
User-agent: Googlebot
Disallow: /private/
Sitemap: https://example.com/sitemap.xml
User-agent: Bingbot
Disallow: /private/
The Sitemap: directive is globally scoped regardless of where you place it in the file. Putting it inside a User-agent: Googlebot block looks like it's only for Google, but all crawlers will see it. For clarity, place your Sitemap: directives outside of any User-agent block, typically at the end of the file.
Mistake 3: HTTP/HTTPS Mismatch
# Your site is HTTPS but robots.txt says:
Sitemap: http://example.com/sitemap.xml
If your site runs on HTTPS, your sitemap URL in robots.txt should also use HTTPS. A mismatch can cause the crawler to fetch the HTTP version, which might redirect to HTTPS (adding unnecessary latency) or might not resolve at all.
Mistake 4: Referencing a Sitemap That Returns an Error
If the sitemap URL in your robots.txt returns a 404, 500, or any non-200 status, the crawler simply ignores it. No error is reported to you unless you check Search Console. This is a silent failure that can persist for months.
Catch sitemap errors before search engines do
Validate your sitemap URL, check for HTTP errors, and confirm your robots.txt references are correct.
Mistake 5: Blocking Sitemap Access in Robots.txt
# Contradictory
User-agent: *
Disallow: /sitemaps/
Sitemap: https://example.com/sitemaps/sitemap.xml
If your robots.txt blocks the directory where your sitemap lives, crawlers won't be able to fetch it. You're pointing them to a resource and simultaneously telling them they can't access it. Make sure your sitemap URL isn't covered by any Disallow rules.
Robots.txt vs Sitemap: Controlling Access
This is where people get confused. Robots.txt and sitemaps serve different functions, and neither one controls indexing the way most people think.
What Robots.txt Controls
Robots.txt controls crawling -- whether a search engine bot is allowed to fetch a URL. It does not control indexing.
User-agent: *
Disallow: /private-page/
This tells crawlers: don't fetch /private-page/. But if Google discovers that URL through an external link, it might still index the URL (showing it in search results with a "No information is available for this page" message) even though it can't crawl the content. Robots.txt prevents crawling, not indexing.
What a Sitemap Controls
A sitemap controls discovery -- which URLs you want search engines to know about. It does not control crawling permissions or indexing.
Listing a URL in your sitemap is a suggestion. The search engine decides whether to crawl and index it based on its own signals.
The Overlap
Here's where it gets interesting: if you list a URL in your sitemap but block it in robots.txt, the search engine knows the URL exists but isn't allowed to fetch it. The result is typically that the URL doesn't get indexed, but the contradiction can cause confusion in Search Console reporting.
The rule: Every URL in your sitemap should be crawlable (not blocked by robots.txt) and should return a 200 status code. If you don't want a URL crawled, don't put it in your sitemap.
| Aspect | Robots.txt | XML Sitemap |
|---|---|---|
| Purpose | Control crawl access | Suggest URLs for discovery |
| Type | Directive (permission) | Hint (suggestion) |
| Controls crawling | Yes | No |
| Controls indexing | No | No |
| Read by crawlers | Before crawling begins | During or after crawling |
| Format | Plain text | XML |
| Required | No (but recommended) | No (but recommended) |
How to Set It Up Right
Create or verify your robots.txt
Your robots.txt should be at the root of your domain: https://example.com/robots.txt. If it doesn't exist, create it. If it returns a non-200 status, fix that first.
Add your Sitemap directive
Add Sitemap: https://yourdomain.com/sitemap.xml (using your actual sitemap URL) at the bottom of the file. Use the full absolute URL with HTTPS.
Verify the sitemap is accessible
Fetch the sitemap URL in your browser. It should return valid XML with a 200 status code. If it returns an error, fix the sitemap before referencing it.
Check for contradictions
Make sure no Disallow rule in your robots.txt blocks access to the sitemap file itself. Also verify that URLs listed in your sitemap aren't blocked by robots.txt.
Submit to Search Console as well
While robots.txt enables passive discovery, also submit your sitemap through Google Search Console and Bing Webmaster Tools. Belt and suspenders.
A Complete Example
Here's a well-structured robots.txt that properly references sitemaps:
# robots.txt for example.com
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /checkout/
Disallow: /account/
# Search-specific directives
User-agent: GPTBot
Disallow: /
# Sitemaps
Sitemap: https://example.com/sitemap-index.xml
Clean, readable, no contradictions. The sitemap index handles the complexity of multiple sub-sitemaps. The Disallow rules keep private areas from being crawled. And the sitemap reference ensures crawlers find your URL list automatically.
Related Articles
Robots.txt tells crawlers what they can access. Your sitemap tells them what you want found. Get both right.
Validate your XML sitemap
Check your sitemap for errors, broken URLs, and indexing issues. Free instant validation.