Replies: 1 comment
-
Yes, you can modify URLs and generate additional links using JavaScript, by modifying the DOM. I’ve had a similar use case myself, I archive RFC pages, and many sites link to different renderers or variations. Ideally, I'd like a way to apply URL transformations dynamically while browsing rather than during crawling, so new rewrite rules could be added without needing to re-crawl everything. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Another feature request (unless this can already be done, I couldn't find it in the docs but maybe the javascript thing can do it).
I'd love to be able to specify a set of regular expression transformation to apply to a URL before it is crawled and scraped. By this, I mean that when a page is crawled and a list of links are produced, before those links are loaded into the crawler queue, there is the option to "transform" that url based on rules I create.
Here are a few examples of how I would use this:
Such a thing could also be used to remove tracking query parameters I suppose.
I imagine a spot to put a list of rules, that looks like:
URL Rewrite: "https://old.reddit.com/$1"
Bonus points if I can use this to take one URL and create multiple that are queued for crawling:
URL Rewrite:
^ That would let me archive both the rendered HTML as well as the JSON from a single rule
Beta Was this translation helpful? Give feedback.
All reactions