WARCHdfsBolt forwarding WARC file path to StatusUpdaterBolt #1566
Replies: 4 comments
-
Hi @michaeldinzinger, this overlaps with #567 and recently I started to explore potential ways to implement a CDX indexer:
Given that there is a more general interest, I'd continue to explore variant 1 - but I cannot promise when and whether this will be successful. Any suggestions or help are welcome! |
Beta Was this translation helpful? Give feedback.
-
Hello @sebastian-nagel, thank you for your answer!:) Personally, I would really appreciate this, because being aware of the WARC record location is an important (but not central) aspect on our use of the StormCrawler. Therefore, I would also be willed to investigate into this issue someday. What a pity, that the HdfsBolt is constructed as dead-end.. |
Beta Was this translation helpful? Give feedback.
-
Another thing that came up on our end regarding this issue: Background of this question is that we want to trigger further processing of the WARC files when the WARC file is completely written. So I'm wondering whether the crawler can provide us with the info "Now WARC file ready". |
Beta Was this translation helpful? Give feedback.
-
Could just check the filesystem for new files from time to time. This seems reasonable since WARC files usually hold several 10,000 records and, consequently, aren't finished too often. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello all,
as far as I understood, the WARCHdfsBolt produces a continous stream of records in WARC format. The resulting WARC files are written into e.g. an S3-compliant storage with respect to some RotationPolicy and FilenameFormat. Regarding the Storm topology, the WARCHdfsBolt is a dead-end and is not emitting any tuples.
However, we are especially interested in the information, in which file (filename) a certain web page / WARC record is written, and we would like to forward this information to the index, e.g. an OpenSearch/Elasticsearch instance.
So that we know in the end: https://stormcrawler.net/faq/ --is_stored_in--> s3://path/to/file/WARC_file_0815.warc.gz
Is this reasonable and technically possible? Probably only when the WARCHdfsBolt emits the corresponding info to the StatusUpdaterBolt and is not a dead-end anymore.
Beta Was this translation helpful? Give feedback.
All reactions