-
-
Notifications
You must be signed in to change notification settings - Fork 118
Dedup Initial Implementation #889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- resource dedup via page digest - page dedup via page digest check, blocking of dupe page
indexing prep: - move WACZLoader to wacz for reuse
- populate dedup index from remote wacz/multi wacz/multiwacz json refactor: - move WACZLoader to wacz to be shared with indexer - state: move hash-based dedup to RedisDedupIndex cli args: - add --minPageDedupDepth to indicate when pages are skipped for dedup - skip same URLs by same hash within same crawl
- update to warcio 2.4.6, write WARC-Payload-Digest along with WARC-Block-Digest for revisists - copy additional custom WARC headers to revisit from response
bump version to 1.9.0 fix typo
…t records === number of response records
tests: add index import + dedup crawl to ensure digests match fully
use pending queue to support retries in case of failure store both id and actual URL in case URL changes in subsequent retries
update timestamp after import
if ( | ||
url === this.pageUrl && | ||
reqresp.payload && | ||
this.minPageDedupDepth >= 0 && | ||
this.pageSeedDepth >= this.minPageDedupDepth | ||
) { | ||
const hash = | ||
"sha256:" + createHash("sha256").update(reqresp.payload).digest("hex"); | ||
const { origUrl } = await this.crawlState.getHashDupe(hash); | ||
if (origUrl) { | ||
const errorReason = "BlockedByResponse"; | ||
await cdp.send("Fetch.failRequest", { | ||
requestId, | ||
errorReason, | ||
}); | ||
return true; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Help me understand how this works in relation to the revisits that we're otherwise writing - when do we want to write a revisit vs. block the page from loading? (might be good to add a comment here as well)
// if (!(await this.crawlState.addIfNoDupe(WRITE_DUPE_KEY, url, hash))) { | ||
// serializer.externalBuffer?.purge(); | ||
// return false; | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Safe to remove?
return await initRedis(redisUrl); | ||
break; | ||
} catch (e) { | ||
//logger.fatal("Unable to connect to state store Redis: " + redisUrl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//logger.fatal("Unable to connect to state store Redis: " + redisUrl); |
} | ||
|
||
export async function initRedisWaitForSuccess(redisUrl: string, retrySecs = 1) { | ||
while (true) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we want to potentially add a timeout to this so it doesn't hang if redis can't resolve?
sourceDone = "src:d"; | ||
sourceQ = "src:q"; | ||
pendingQ = "pending:q"; | ||
sourceP = "src:p"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is sourceP
used? I'm not seeing it.
…-crawler into hash-based-dedup
…<date> <url>' - entry for source index can contain the crawl id (or possibly wacz and crawl id) - also store dependent sources in relation.requires in datapackage.json - tests: update tests to check for relation.requires
Fixes #884