Dedup Initial Implementation #889

ikreymer · 2025-10-01T06:17:01Z

Fixes #884

Support for hash-based deduplication via a Redis provided with --redisDedupUrl (can be same as default redis)
Support for writing WARC revisit records for duplicates
Support for new indexer mode which imports CDXJ from one or more WACZs (refactored from replay) to populate the dedup index
Initial support for page-level dedup (preempt loading entire pages) if HTML is a dupe/matches exactly.
Option to set --minPageDedupDepth for page-level dedup

- resource dedup via page digest - page dedup via page digest check, blocking of dupe page

indexing prep: - move WACZLoader to wacz for reuse

- populate dedup index from remote wacz/multi wacz/multiwacz json refactor: - move WACZLoader to wacz to be shared with indexer - state: move hash-based dedup to RedisDedupIndex cli args: - add --minPageDedupDepth to indicate when pages are skipped for dedup - skip same URLs by same hash within same crawl

- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response

bump version to 1.9.0 fix typo

…nly size

…t records === number of response records

tests: add index import + dedup crawl to ensure digests match fully

use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries

update timestamp after import

tw4l · 2025-10-01T14:44:46Z

src/util/recorder.ts

+    if (
+      url === this.pageUrl &&
+      reqresp.payload &&
+      this.minPageDedupDepth >= 0 &&
+      this.pageSeedDepth >= this.minPageDedupDepth
+    ) {
+      const hash =
+        "sha256:" + createHash("sha256").update(reqresp.payload).digest("hex");
+      const { origUrl } = await this.crawlState.getHashDupe(hash);
+      if (origUrl) {
+        const errorReason = "BlockedByResponse";
+        await cdp.send("Fetch.failRequest", {
+          requestId,
+          errorReason,
+        });
+        return true;
+      }
+    }


Help me understand how this works in relation to the revisits that we're otherwise writing - when do we want to write a revisit vs. block the page from loading? (might be good to add a comment here as well)

tw4l · 2025-10-01T14:46:51Z

src/util/recorder.ts

+    // if (!(await this.crawlState.addIfNoDupe(WRITE_DUPE_KEY, url, hash))) {
+    //   serializer.externalBuffer?.purge();
+    //   return false;
+    // }


Safe to remove?

tw4l · 2025-10-01T14:47:30Z

src/util/redis.ts

+      return await initRedis(redisUrl);
+      break;
+    } catch (e) {
+      //logger.fatal("Unable to connect to state store Redis: " + redisUrl);


Suggested change

//logger.fatal("Unable to connect to state store Redis: " + redisUrl);

tw4l · 2025-10-01T14:48:04Z

src/util/redis.ts

 }

+export async function initRedisWaitForSuccess(redisUrl: string, retrySecs = 1) {
+  while (true) {


Wonder if we want to potentially add a timeout to this so it doesn't hang if redis can't resolve?

tw4l · 2025-10-01T14:49:01Z

src/util/state.ts

+  sourceDone = "src:d";
+  sourceQ = "src:q";
+  pendingQ = "pending:q";
+  sourceP = "src:p";


Is sourceP used? I'm not seeing it.

…-crawler into hash-based-dedup

…<date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires

ikreymer added 14 commits September 14, 2025 12:27

dedup work:

8691625

- resource dedup via page digest - page dedup via page digest check, blocking of dupe page

args: add separate --dedupIndexUrl to support separate redis for dedup

cf2d766

indexing prep: - move WACZLoader to wacz for reuse

keep skipping dupe URLs as before

35d43a5

warc writing:

0ab43db

- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response

rename --dedupStoreUrl -> redisDedupUrl

ab4b19f

bump version to 1.9.0 fix typo

update to latest warcio (2.4.7) to fix issus when returning payload o…

b24b2f2

…nly size

bump to 2.4.7

1213ed0

tests: add dedup-basic.test for simple dedup, ensure number of revisi…

e059949

…t records === number of response records

deps update

03bbf69

Merge branch 'main' into hash-based-dedup

fc3f9b4

dedup indexing: strip hash prefix from digest, as cdx does not have it

aa7b8a1

tests: add index import + dedup crawl to ensure digests match fully

use dedup redis for queue up wacz files that need to be updated

3428b16

use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries

dedup post requests and non-404s as well!

6ca191b

update timestamp after import

ikreymer requested a review from tw4l October 1, 2025 06:17

Merge branch 'main' into hash-based-dedup

79c9327

tw4l reviewed Oct 1, 2025

View reviewed changes

ikreymer added 3 commits October 13, 2025 17:13

Merge branch 'main' into hash-based-dedup

a39eea0

Merge branch 'hash-based-dedup' of github.com:webrecorder/browsertrix…

3397eb1

…-crawler into hash-based-dedup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Dedup Initial Implementation #889

Dedup Initial Implementation #889

Uh oh!

ikreymer commented Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Dedup Initial Implementation #889

Are you sure you want to change the base?

Dedup Initial Implementation #889

Uh oh!

Conversation

ikreymer commented Oct 1, 2025

Uh oh!

tw4l Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

tw4l Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants