Preferences

Privacy is important to us, so you have the option of disabling certain types of storage that may not be necessary for the basic functioning of the website. Blocking categories may impact your experience on the website. More information

Accept all cookies

XML Sitemap: Help Search Engines Discover Your Pages

An XML sitemap lists your site's URLs so search engines and AI crawlers find and index them faster. Learn the format, limits, and 2026 best practices.

Man with dark hair and beard wearing a light brown shirt speaks in front of a microphone on a podcast or recording setup.Portrait of a man with short dark hair wearing a white shirt and dark jacket, looking directly at the camera with a neutral expression.Man with short dark hair, beard, and clear glasses wearing a black t-shirt with a white circular logo, standing in front of a stone wall.Celio fabianoSmiling young woman with long brown hair wearing a red top and necklace, outdoors in a tree-filled background.photo de profil du client Xavier Breull
+ 9'000 subscribers
A screenshot of an XML sitemap file open in a text editor, showing URL entries with priority and change-frequency tags.
Upload UI element
Thibault Besson-Magdelain fondateur de Sorank

About Author

Thibault Besson-Magdelain

Founder of Sorank, 5+ years of experience in SEO, GEO enthusiast.
Share on

Summary: An XML sitemap is an XML file that lists the URLs you want search engines to crawl, along with metadata like the last modification date. It speeds up discovery and indexing, especially for large or frequently updated sites.

An XML sitemap (Extensible Markup Language sitemap) is a structured file that tells search engines which pages exist on your website and which ones you want them to find. Each entry holds a URL and optional metadata, wrapped in tags that machines parse instantly. Rather than relying on crawlers to follow every internal link, a sitemap hands them a clean, prioritized list of your canonical pages.

While a sitemap does not guarantee indexing, it is one of the most reliable signals you can send about your site structure. Google's sitemap documentation recommends one for sites that are large, new, have few external links, or contain rich media. In 2026, with AI crawlers from OpenAI, Anthropic, and Perplexity competing for the same server resources as Googlebot, a clean sitemap is more valuable than ever.

What an XML Sitemap Contains

The core element is the URL. Each page you want crawled sits inside a <url> block with a <loc> tag holding the fully qualified, absolute address (the complete https:// path, not a relative one). The file opens with a <urlset> declaration and must use UTF-8 encoding.

Beyond the URL, the protocol supports three optional tags: <lastmod>, <changefreq>, and <priority>. The <lastmod> value records when a page last changed in a meaningful way. The <changefreq> tag suggests how often a page updates, and <priority> hints at relative importance.

It is critical to know how Google treats these. Google uses <lastmod> if the value is consistently and verifiably accurate, but it ignores <priority> and <changefreq> entirely. The lesson: do not waste time tuning priority scores. Spend it on choosing the right URLs and keeping <lastmod> honest.

Size Limits and Sitemap Index Files

A single sitemap file is capped at 50.000 URLs or 50MB uncompressed, whichever comes first. You can compress the file with gzip to save bandwidth, but the uncompressed size still counts against the limit. Most websites never approach these ceilings, but large ecommerce and publishing sites do.

When you exceed either limit, you split your URLs across multiple sitemap files and reference them from a single sitemap index file. The index is itself an XML file that lists the location of each child sitemap, letting you submit one master URL while organizing thousands or millions of pages logically (for example, one sitemap per product category or content type).

Splitting sitemaps by section also makes diagnostics easier. When you check the Google Search Console Sitemaps report, you can see exactly which group of pages has discovery or indexing problems instead of staring at one giant undifferentiated list.

Which URLs to Include

List only canonical, indexable pages you actually want in search results. A sitemap is not a dump of every URL your CMS can generate. Exclude duplicate pages, paginated archives, parameter variations, thank-you pages, and anything blocked by a noindex directive. Including non-canonical URLs sends mixed signals and dilutes the file's usefulness.

A common mistake is letting a sitemap drift out of sync with the live site. If it lists URLs that now return a 404 or redirect, you erode the trust crawlers place in the file. Audit regularly so every entry resolves to a live, canonical, 200-status page. This discipline ties directly into crawl budget: a tight sitemap points crawlers at what matters and away from waste.

Specialized Sitemap Types

Beyond the standard page sitemap, the protocol supports media-specific extensions. Image and video sitemaps let you surface visual content that crawlers might otherwise miss, providing details like caption, title, and duration. News publishers can use a Google News sitemap to flag time-sensitive articles for faster pickup.

You generally do not need every type. If your site is image-heavy or runs a video library, the relevant extension helps those assets get discovered. For most content sites, a clean standard sitemap covering your articles, landing pages, and key resources is enough to support strong crawling and indexing.

How to Submit Your Sitemap

There are three main ways to make search engines aware of your sitemap. The most direct is Google Search Console, where the Sitemaps report lets you submit the file URL and then monitor how many pages were discovered, crawled, and indexed over time. Bing offers the same through Bing Webmaster Tools.

The second method is referencing the sitemap inside your robots.txt file with a single line: Sitemap: https://example.com/sitemap.xml. Any crawler that reads your robots.txt then knows where to find it. The third is the Search Console API for programmatic submission, useful when you regenerate sitemaps automatically.

Note that Google deprecated the old unauthenticated ping endpoint, so the legacy method of pinging a URL no longer works. Submit through Search Console or robots.txt instead. IndexNow complements sitemaps by pushing instant URL change notifications to participating engines rather than waiting for the next crawl.

XML Sitemaps and AI Crawlers in 2026

Sitemaps now matter for more than Google. AI engines like ChatGPT, Perplexity, and Gemini deploy their own crawlers (GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot) to gather and refresh the content they cite. A current, accurate sitemap helps these systems discover your pages and surface them in AI-generated answers.

Server log data from 2025 showed AI crawler traffic rising sharply, with some bots growing several hundred percent year over year. That traffic competes for the same resources Googlebot uses. A focused sitemap that points crawlers at your best content, paired with accurate <lastmod> dates, helps every engine spend its limited crawl effort on the pages you most want discovered and cited.

Common XML Sitemap Mistakes

Obsessing over priority and changefreq is the classic error: Google ignores both, so the effort is wasted. Other frequent problems include listing non-canonical or redirected URLs, leaving 404s in the file, forgetting to update <lastmod> when content actually changes, and exceeding the 50.000 URL limit without splitting into an index.

Another subtle issue is faking <lastmod> dates. If you stamp every URL with today's date on each regeneration, the signal becomes meaningless and Google learns to distrust it. Only update the date when the page content genuinely changes. Accuracy is what makes the tag valuable for prioritizing recrawls.

Monitoring and Maintaining Your Sitemap

Treat your sitemap as a living file, not a one-time setup. Check the Search Console Sitemaps report monthly to confirm the file is being read, see the discovered-versus-indexed gap, and catch parsing errors early. A widening gap between submitted and indexed URLs often points to content quality or duplicate-content issues, not a sitemap fault.

Most modern content platforms generate and update sitemaps automatically, which keeps them fresh as you publish. If you manage one manually, build a regeneration step into your publishing workflow so new pages appear quickly and removed pages disappear. Pair this with a regular technical audit to keep crawl signals clean across the whole site.

Conclusion

An XML sitemap is a simple, high-leverage file that lists the canonical URLs you want search engines and AI crawlers to find. It does not force indexing, but it accelerates discovery, especially on large, new, or frequently updated sites. Keep it limited to indexable canonical pages, respect the 50.000 URL and 50MB limits with index files, maintain honest <lastmod> dates, and ignore priority and changefreq since Google does too. Submit through Search Console and reference it in robots.txt, then monitor the discovered-versus-indexed gap each month. In 2026, a clean sitemap helps every engine, from Google to ChatGPT, spend its crawl effort on the pages that matter most. See how Sorank's GEO SEO audit checks your sitemap and crawl health automatically.

Frequently questions asked

Does an XML sitemap guarantee my pages will be indexed?

No. A sitemap helps search engines discover your URLs faster, but it does not force indexing. Google still decides whether each page is worth indexing based on content quality, duplication, and relevance. Think of a sitemap as a strong discovery signal, not a command. If pages stay in the discovered-but-not-indexed state, the issue is usually content quality or duplicate content, not the sitemap itself.

How many URLs can an XML sitemap contain?

A single sitemap file is limited to 50.000 URLs or 50MB uncompressed, whichever comes first. If your site has more pages, you split them across multiple sitemap files and list those files in a sitemap index file. You then submit only the index URL. Most sites never reach these limits, but large ecommerce and publishing sites organize their pages this way for easier monitoring.

Should I set priority and changefreq values in my sitemap?

There is no benefit, because Google ignores both the priority and changefreq tags. The only optional tag Google uses is lastmod, and only if the date is consistently and verifiably accurate. Focus your effort on listing the right canonical URLs and keeping lastmod honest, updating it only when a page genuinely changes. Faking dates on every regeneration teaches Google to distrust the value entirely.

Our Blog for Ambitious Company