What Is an XML Sitemap?
A technical breakdown of XML sitemaps: the format, required and optional tags, size limits, the sitemap protocol, and how Googlebot actually uses them.
An XML sitemap is a structured file that lists URLs on your website in a format search engines can read. It follows a specific protocol, uses specific tags, and has specific limits. If you've ever opened a sitemap.xml file and seen angle brackets and URLs, you've seen one. But understanding what each piece does -- and what's actually required versus optional -- makes the difference between a sitemap that helps your SEO and one that just takes up server space.
The Sitemap Protocol
XML sitemaps follow the Sitemap Protocol, which was jointly developed by Google, Yahoo, and Microsoft in 2006. The protocol defines the structure, the allowed tags, and the rules for how sitemaps should work.
The protocol is intentionally simple. It's an XML file with a root element, a list of URLs, and optional metadata for each URL. That's it. The simplicity is the point -- it needs to work for a five-page portfolio site and a ten-million-page e-commerce platform alike.
Every major search engine supports the Sitemap Protocol: Google, Bing, Yahoo, Yandex, and others. If you build a valid sitemap once, every crawler can read it.
The Structure of an XML Sitemap
Here's a complete, valid XML sitemap:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/</loc>
<lastmod>2026-02-10</lastmod>
<changefreq>weekly</changefreq>
<priority>1.0</priority>
</url>
<url>
<loc>https://example.com/about</loc>
<lastmod>2026-01-05</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://example.com/blog/xml-sitemaps</loc>
<lastmod>2026-02-18</lastmod>
<changefreq>yearly</changefreq>
<priority>0.6</priority>
</url>
</urlset>
Let's break down every element.
Required Elements
There are only two things your sitemap absolutely must have:
<urlset>
The root element that wraps everything. It must include the namespace declaration:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
This tells parsers that the file follows the Sitemap Protocol schema. Without this namespace, validators and crawlers may reject the file.
<url> and <loc>
Each page gets a <url> element containing at least a <loc> tag with the full URL:
<url>
<loc>https://example.com/pricing</loc>
</url>
The <loc> value must be a fully qualified URL including the protocol (https://). Relative URLs are not allowed. The URL must also be properly encoded -- spaces become %20, ampersands become &, and so on.
A minimal valid sitemap only needs urlset, url, and loc
Everything else -- lastmod, changefreq, priority -- is optional. A sitemap with nothing but URLs is perfectly valid and useful.
Optional Elements
These provide extra information to crawlers. Some are more useful than others.
<lastmod> -- Last Modified Date
<lastmod>2026-02-18</lastmod>
Tells the crawler when the page was last meaningfully changed. This is the most useful optional tag. Google has confirmed they use lastmod when it's accurate -- but they also ignore it when it's not. If you set every page's lastmod to today's date on every sitemap regeneration, Google will learn to distrust your lastmod values and stop paying attention.
Accepted formats: YYYY-MM-DD (date only) or full W3C Datetime format like 2026-02-18T14:30:00+00:00.
<changefreq> -- Change Frequency
<changefreq>weekly</changefreq>
A hint about how often the page changes. Valid values: always, hourly, daily, weekly, monthly, yearly, never.
Here's the reality: Google ignores this tag entirely. Google's own documentation says so. Googlebot determines crawl frequency based on its own observations, not your suggestions. Bing's documentation is less explicit, but there's little evidence they rely on it either. You can include it, but don't expect it to change crawler behavior.
<priority> -- Page Priority
<priority>0.8</priority>
A value from 0.0 to 1.0 indicating relative priority compared to other pages on your site. The default is 0.5.
Like changefreq, Google ignores this tag. It was an interesting idea in 2006, but in practice every webmaster set their homepage to 1.0 and everything else to 0.8, which made the data meaningless. Google learned to ignore it years ago.
| Tag | Required? | Used by Google? | Recommendation |
|---|---|---|---|
| <loc> | Yes | Yes | Always include |
| <lastmod> | No | Yes (when accurate) | Include with real dates |
| <changefreq> | No | No | Skip it |
| <priority> | No | No | Skip it |
Size and Count Limits
The Sitemap Protocol defines hard limits:
- Maximum 50,000 URLs per sitemap file. If you have more, you need multiple sitemap files referenced by a sitemap index.
- Maximum 50MB per file (uncompressed). In practice, you'll hit the URL limit long before the size limit unless your URLs are very long.
If your site has more than 50,000 URLs, you split them across multiple sitemap files and create a sitemap index file that references them all:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-pages.xml</loc>
<lastmod>2026-02-18</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-blog.xml</loc>
<lastmod>2026-02-17</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-02-15</lastmod>
</sitemap>
</sitemapindex>
Gzip compression is allowed and recommended
You can serve your sitemap as sitemap.xml.gz. This reduces bandwidth and is supported by all major search engines. Just make sure the uncompressed size stays under 50MB.
Validate your sitemap structure
Check that your XML sitemap follows the protocol correctly, stays within size limits, and contains valid URLs.
Encoding Requirements
XML sitemaps must be UTF-8 encoded. URLs containing special characters need to be entity-escaped:
| Character | Escape Sequence | Example |
|---|---|---|
| Ampersand (&) | &amp; | /page?a=1&amp;b=2 |
| Single quote (') | &apos; | /page?name=O&apos;Brien |
| Double quote (") | &quot; | Rarely needed in URLs |
| Greater than (>) | &gt; | Rarely needed in URLs |
| Less than (<) | &lt; | Rarely needed in URLs |
Non-ASCII characters in URLs should be percent-encoded (e.g., spaces as %20), and the URL itself must be entity-escaped within the XML.
How Googlebot Actually Uses Your Sitemap
Understanding this process helps you make better decisions about what to include:
Googlebot treats your sitemap as a suggestion list. It adds every URL to its crawl queue, but it doesn't necessarily crawl them all immediately. Pages are crawled based on Google's assessment of their importance, the site's overall crawl budget, and how frequently the content changes.
Lastmod accelerates re-crawling. If you update a page and the lastmod date changes, Googlebot is more likely to revisit that page sooner. This is especially valuable for news sites and e-commerce sites where content changes frequently.
Sitemaps help with discovery, not ranking. Being in a sitemap doesn't give a page any ranking advantage. It just ensures the page gets crawled. What happens after crawling depends entirely on the page's content, authority, and relevance.
Google cross-references your sitemap with other signals. If your sitemap lists URLs that return 404 errors, redirect to other pages, or have noindex tags, Google notes the inconsistency. Too many of these issues and Google may reduce its trust in your sitemap data overall.
Specialized XML Sitemaps
The standard sitemap handles regular web pages, but there are extensions for specific content types:
- Image sitemaps -- Include image-specific tags within your URL entries to help Google discover images that might not be found through page crawling (especially images loaded via JavaScript).
- Video sitemaps -- Provide metadata about video content including title, description, duration, and thumbnail URL.
- News sitemaps -- Required for Google News inclusion. List articles published in the last 48 hours with publication date and name.
- Hreflang sitemaps -- Use
xhtml:linkelements to indicate language and regional variants of pages.
Each extension adds its own namespace and tags to the standard sitemap format.
Common Mistakes
A few errors show up repeatedly in XML sitemaps:
- Including non-canonical URLs. If a page has a canonical tag pointing elsewhere, don't include the non-canonical version in your sitemap. Include the canonical URL only.
- Including noindex pages. If you don't want a page indexed, don't put it in the sitemap. Sending mixed signals confuses crawlers.
- Including redirected URLs. Every URL in your sitemap should return a 200 status code. Redirects (301, 302) waste crawl budget.
- Stale lastmod dates. Setting lastmod to the current date every time you regenerate the sitemap, regardless of actual changes, teaches Google to ignore your lastmod values.
- Exceeding size limits. If your sitemap has more than 50,000 URLs or exceeds 50MB, it will be rejected. Use a sitemap index instead.
The Bottom Line
An XML sitemap is a straightforward file with a simple job: tell search engines what pages exist on your site. The format is well-defined, the rules are clear, and the implementation is not complicated. Focus on the tags that matter (loc and lastmod), skip the ones that don't (changefreq and priority), keep your URLs clean and current, and your sitemap will do exactly what it's supposed to.
Related Articles
An XML sitemap is the most boring, most useful file on your website. Get it right and forget about it.
Validate your XML sitemap
Check your sitemap for errors, broken URLs, and indexing issues. Free instant validation.