Skip to content

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Oct 1, 2025

Fixes #884

  • Support for hash-based deduplication via a Redis provided with --redisDedupUrl (can be same as default redis)
  • Support for writing WARC revisit records for duplicates
  • Support for new indexer mode which imports CDXJ from one or more WACZs (refactored from replay) to populate the dedup index
  • Initial support for page-level dedup (preempt loading entire pages) if HTML is a dupe/matches exactly.
  • Option to set --minPageDedupDepth for page-level dedup

- resource dedup via page digest
- page dedup via page digest check, blocking of dupe page
indexing prep:
- move WACZLoader to wacz for reuse
- populate dedup index from remote wacz/multi wacz/multiwacz json

refactor:
- move WACZLoader to wacz to be shared with indexer
- state: move hash-based dedup to RedisDedupIndex

cli args:
- add --minPageDedupDepth to indicate when pages are skipped for dedup

- skip same URLs by same hash within same crawl
- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists
- copy additional custom WARC headers to revisit from response
bump version to 1.9.0
fix typo
tests: add index import + dedup crawl to ensure digests match fully
use pending queue to support retries in case of failure
store both id and actual URL in case URL changes in subsequent retries
update timestamp after import
@ikreymer ikreymer requested a review from tw4l October 1, 2025 06:17
Comment on lines 826 to 843
if (
url === this.pageUrl &&
reqresp.payload &&
this.minPageDedupDepth >= 0 &&
this.pageSeedDepth >= this.minPageDedupDepth
) {
const hash =
"sha256:" + createHash("sha256").update(reqresp.payload).digest("hex");
const { origUrl } = await this.crawlState.getHashDupe(hash);
if (origUrl) {
const errorReason = "BlockedByResponse";
await cdp.send("Fetch.failRequest", {
requestId,
errorReason,
});
return true;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help me understand how this works in relation to the revisits that we're otherwise writing - when do we want to write a revisit vs. block the page from loading? (might be good to add a comment here as well)

Comment on lines +1644 to +1647
// if (!(await this.crawlState.addIfNoDupe(WRITE_DUPE_KEY, url, hash))) {
// serializer.externalBuffer?.purge();
// return false;
// }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safe to remove?

return await initRedis(redisUrl);
break;
} catch (e) {
//logger.fatal("Unable to connect to state store Redis: " + redisUrl);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//logger.fatal("Unable to connect to state store Redis: " + redisUrl);

}

export async function initRedisWaitForSuccess(redisUrl: string, retrySecs = 1) {
while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if we want to potentially add a timeout to this so it doesn't hang if redis can't resolve?

sourceDone = "src:d";
sourceQ = "src:q";
pendingQ = "pending:q";
sourceP = "src:p";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is sourceP used? I'm not seeing it.

…<date> <url>'

- entry for source index can contain the crawl id (or possibly wacz and crawl id)
- also store dependent sources in relation.requires in datapackage.json
- tests: update tests to check for relation.requires
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Deduplication (Initial Support).

2 participants