feat(utils): add `discoverValidSitemaps` utility #3339

foxt451 · 2026-01-15T13:04:52Z

Related to apify/apify-sdk-js#486. I'm developing generic sitemap scraper and it's going to share a big utility function (main chunk of logic) with wcc - discoverValidSitemaps. I've asked @barjin if I could factor it out and he told this util could fit into crawlee. It's mainly copied from wcc, but to keep the dependencies unchanged, it's using got-scraping to check for url existence instead of impit (I think it doesn't matter for sitemaps), and urlExists is inlined (until we don't add http client to these utils in v4 as @barjin told me). It's also turned into an async generator. Let me know if you see a better place for this util.

barjin

lgtm @foxt451 , thank you!

I'll let @janbuchar have a second look as the original author of this, but if it's a direct port from WCC, I think we can merge safely.

I have just these nits:

barjin · 2026-01-16T10:38:10Z

packages/utils/test/sitemap.test.ts

+    it('extracts sitemap from robots.txt', async () => {
+        nock('http://sitemap-discovery.com')
+            .get('/robots.txt')
+            .reply(200, 'Sitemap: http://sitemap-discovery.com/sitemap.xml')


Can we change this so the robots.txt-referenced sitemap is not the well-known /sitemap.xml? This example passes even if robots.txt is missing (see test below).

barjin · 2026-01-16T10:43:54Z

packages/utils/src/internals/sitemap.ts

+
+/**
+ * Given a list of URLs, discover related sitemap files for these domains by checking the `robots.txt` file,
+ * the default `sitemap.xml` file and the URLs themselves.


nit: this doesn't mention sitemap.txt

foxt451 added 5 commits January 15, 2026 15:01

feat(utils): add discoverValidSitemaps utility

d5cec84

fix: remove circular deps

8c3de80

fix: remove unused imports

8c20634

fix: use reduce instead of groupBy for Nodev18

aa2161c

chore: formatting

6ef28e3

foxt451 marked this pull request as ready for review January 16, 2026 10:31

foxt451 requested a review from B4nan January 16, 2026 10:31

barjin reviewed Jan 16, 2026

View reviewed changes

chore: pr remarks

7388f72

foxt451 requested review from janbuchar and removed request for B4nan January 16, 2026 18:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(utils): add `discoverValidSitemaps` utility #3339

feat(utils): add `discoverValidSitemaps` utility #3339

foxt451 commented Jan 15, 2026 •

edited

Loading

Uh oh!

barjin left a comment

Uh oh!

barjin Jan 16, 2026

Uh oh!

barjin Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(utils): add discoverValidSitemaps utility #3339

Are you sure you want to change the base?

feat(utils): add discoverValidSitemaps utility #3339

Conversation

foxt451 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barjin left a comment

Choose a reason for hiding this comment

Uh oh!

barjin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

barjin Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(utils): add `discoverValidSitemaps` utility #3339

feat(utils): add `discoverValidSitemaps` utility #3339

foxt451 commented Jan 15, 2026 •

edited

Loading