With this solution, you would need to spin up your own Node server in your infrastructure and then use a couple of packages to generate multiple sitemaps.
Use the sitemap-urls npm package to get all of your URLs from Webflow’s auto-generated sitemap. This will extract all the URLs and return them as an array that you can store.
Then you can use the sitemap npm package and feed it that array to generate multiple sitemaps. As you’ll see at the page for the package, it does support more than 50k URLs. Here’s an example code snippet:
const sms = new SitemapAndIndexStream({
limit: 50000, // defaults to 45k
lastmodDateOnly: false, // print date not time
// SitemapAndIndexStream will call this user provided function every time
// it needs to create a new sitemap file. You merely need to return a stream
// for it to write the sitemap urls to and the expected url where that sitemap will be hosted
getSitemapStream: (i) => {
const sitemapStream = new SitemapStream({
hostname: "https://example.com",
});
// if your server automatically serves sitemap.xml.gz when requesting sitemap.xml leave this line be
// otherwise you will need to add .gz here and remove it a couple lines below so that both the index
// and the actual file have a .gz extension
const path = `./sitemap-${i}.xml`;
const ws = sitemapStream
.pipe(createGzip()) // compress the output of the sitemap
.pipe(createWriteStream(resolve(path + ".gz"))); // write it to sitemap-NUMBER.xml
return [
new URL(path, "https://example.com/subdir/").toString(),
sitemapStream,
ws,
];
},
});
As you’ll see in the example above, if you set the limit to 50k, then it will default to 45k just to ensure you don’t break any limits.
Once your sitemaps are created, you can use the proxy you’re building for 301s to proxy this sitemap into place and then update Google Search Console with the new sitemap address.
With our API v2, we have a Pages API which would allow you to take a similar approach but instead of relying on the auto-generated sitemap, you could use the API. This would allow you to write logic to leave pages/items out of your map.
This approach would be similar. You would use the API to get a list of static pages, as well as your CMS items and then store them in an array. The URLs would be relative, so you may need to modify them as you store them by prepending your desired URL.
Now that you have a list of pages, you can use the same npm package (sitemap) and then feed it the list of URLs and let it generate multiple sitemaps that you can then proxy into place.
Once in place, you can use cron jobs to have these run on a regular basis to keep your sitemap up to date and the process should be fairly hands off.