Replies: 1 comment
-
Hi, in a similar situation I used ETS tables for keep addition info by url. In pipelines, you can match it by response (HTTPoison.Response contains "request"). |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, we have a usecase where we have a set of urls stored in a DB. There is a process which fetch set of urls from db and send it to crawler. After crawler finishes the job we want to mark the fetched at timestamp in db.
lets imagine this job id is
1
But the problem is, one of this page may contain links that we need to crawl more to obtain further information. Imagine we find 6 links on first page, we need to crawl all 6 pages and extract information to mark this job(job id
1
) done.Extracting links from first page and crawling them with Crawly is not a problem. My problem is passing our context(we have an id in DB for each entity, each crawl is associated with this entity) so at some point we can mark this job done in the DB
What should be my approach with Crawly?
Beta Was this translation helpful? Give feedback.
All reactions