Katana doesn't check all input URLs for dn, rdn, fqdn modes #1383

mrschyte · 2025-08-10T16:00:10Z

mrschyte
Aug 10, 2025

katana version:

[INF] Current version: v1.2.1

Current Behavior:

Katana only checks the currently crawled domain when evaluating dn, rdn, fqdn checks in the scope manager.

This means, that if two URLs are in scope: (https://a.com, https://b.com) and there are cross domain links such as https://a.com/b -> https://b.com/resourceB or https://b.com/a -> https://a.com/resourceA, katana will fail to detect both resourceA and resourceB.

Expected Behavior:

Katana should loop through all URLs read through stdin and from the URLs list / file and check if any of the URLs match the dn, rdn or fqdn based on the current mode.

Steps To Reproduce:

Run katana with two input urls, where there are cross-domain links. Katana will fail to list the linked resources.

Anything else:

oxqnd · 2025-08-30T12:29:07Z

oxqnd
Aug 30, 2025

Hi, I would like to work on this issue if no one else is actively fixing it.
Can I take this up?

0 replies

ehsandeep · 2025-08-30T12:45:04Z

ehsandeep
Aug 30, 2025
Maintainer

@oxqnd Thanks for the help! I'm not entirely sure about the issue itself.

@mrschyte, by default, crawling is scoped to the input domain to keep the output relevant.
If you want to include non-scoped URLs in the output, you can use:

-do, -display-out-scope Display external endpoints from scoped crawling

Or, if you want to disable host-based scoping entirely:

-ns, -no-scope Disable default scoped crawling

Let me know if these options don’t cover what you're trying to achieve.
I'm moving this to Discussion for now, if any new information comes up, we can convert it back to an Issue.

4 replies

mrschyte Aug 30, 2025
Author

I have a large number of domains that may have references between them, but I don't want to crawl outside these domains.
If I disable scoped crawling, katana would take forever to run.

The only workaround I could find is to create a long regular expression containing all input domains. It was surprising to me that links are evaluated against the currently scanned domain only. I thought it would make sense for katana to allow all domains in the input list when checking. This is essential in my use case as there are links that cannot be discovered from the same domain.

Implementing this is easy when the list of URLs are passed from a text file, but I'm not sure what the best option is when katana is run in "stdin mode" as this would require reading the whole input stream to get the domain list.

Perhaps adding a new command line flag for specifying the list of domains from a text file would be a good option.

ehsandeep Aug 30, 2025
Maintainer

The only workaround I could find is to create a long regular expression containing all input domains. It was surprising to me that links are evaluated against the currently scanned domain only. I thought it would make sense for katana to allow all domains in the input list when checking. This is essential in my use case as there are links that cannot be discovered from the same domain.

Not sure I’m fully following, but you can configure the crawl scope or output based on your use case. The defaults support a common workflow, but you can customize them as needed.

I'm not clear on what specific implementation you're referring to, I suggest reviewing the available options below. If those don’t cover your needs, please share an example of the flow you're trying to achieve with existing flags.

SCOPE OPTIONS:
   -cs, -crawl-scope string[]        In-scope URL regex to be followed by the crawler
   -cos, -crawl-out-scope string[]   Out-of-scope URL regex to be excluded by the crawler
   -fs, -field-scope string          Predefined scope field (dn, rdn, fqdn) or custom regex (e.g. '(staging.com|prod.com)') (default: "rdn")
   -ns, -no-scope                    Disable default host-based scope
   -do, -display-out-scope           Display external endpoints discovered outside the defined scope

mrschyte Aug 30, 2025
Author

Yes, I'm aware of these options. This is what I was referring to above.

The problem is that if you want to crawl exampleA.com and exampleB.com which have cross links, you have to add a scope regex "(exampleA.com|exampleB.com)".

This is because the default "rdn" scope will only list links that point to the same domain, cross-links between input domains will be skipped.

When checking the scope, katana should check whether any of the input domains match to allow cross links.

ehsandeep Aug 30, 2025
Maintainer

@mrschyte thanks, I got what you mean, I've created issue with details - #1385 let me know if you have any feedback!

@oxqnd if you are still open to pick this one, I've created issue with details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Katana doesn't check all input URLs for dn, rdn, fqdn modes #1383

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Katana doesn't check all input URLs for dn, rdn, fqdn modes #1383

Uh oh!

mrschyte Aug 10, 2025

katana version:

Current Behavior:

Expected Behavior:

Steps To Reproduce:

Anything else:

Replies: 2 comments · 4 replies

Uh oh!

oxqnd Aug 30, 2025

Uh oh!

ehsandeep Aug 30, 2025 Maintainer

Uh oh!

mrschyte Aug 30, 2025 Author

Uh oh!

ehsandeep Aug 30, 2025 Maintainer

Uh oh!

mrschyte Aug 30, 2025 Author

Uh oh!

ehsandeep Aug 30, 2025 Maintainer

mrschyte
Aug 10, 2025

Replies: 2 comments 4 replies

oxqnd
Aug 30, 2025

ehsandeep
Aug 30, 2025
Maintainer

mrschyte Aug 30, 2025
Author

ehsandeep Aug 30, 2025
Maintainer

mrschyte Aug 30, 2025
Author

ehsandeep Aug 30, 2025
Maintainer