Passing crawl metadata to identify the context #286

kasvith · 2024-01-21T22:31:20Z

kasvith
Jan 21, 2024

Hi, we have a usecase where we have a set of urls stored in a DB. There is a process which fetch set of urls from db and send it to crawler. After crawler finishes the job we want to mark the fetched at timestamp in db.

lets imagine this job id is 1

But the problem is, one of this page may contain links that we need to crawl more to obtain further information. Imagine we find 6 links on first page, we need to crawl all 6 pages and extract information to mark this job(job id 1) done.

Extracting links from first page and crawling them with Crawly is not a problem. My problem is passing our context(we have an id in DB for each entity, each crawl is associated with this entity) so at some point we can mark this job done in the DB

What should be my approach with Crawly?

dvolkow · 2025-03-16T22:08:09Z

dvolkow
Mar 16, 2025

Hi, in a similar situation I used ETS tables for keep addition info by url. In pipelines, you can match it by response (HTTPoison.Response contains "request").

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Passing crawl metadata to identify the context #286

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Passing crawl metadata to identify the context #286

Uh oh!

Uh oh!

kasvith Jan 21, 2024

Replies: 1 comment

Uh oh!

dvolkow Mar 16, 2025

kasvith
Jan 21, 2024

dvolkow
Mar 16, 2025