Skip to content

Wrong indices and repeated matches when hostname contains the TLD #155

@carton-of-mice

Description

@carton-of-mice

Although In gen_urls() a call to _get_tld_pos() determines the correct position of the TLD using rfind(), this correction has no bearing on on tld_pos, leading to returned incorrect indices and an invalid offset on the next loop..
Should the same TLD appear multiple times within a hostname, it may match repeatedly.
For example

>>> txt = "String bbb.aaa.bbb.aaa.aaa test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
...     print(out, txt[out[1][0] : out[1][1]])
...
('bbb.aaa.bbb.aaa.aaa', (-5, 14))
('bbb.aaa.bbb.aaa.aaa', (3, 22)) ing bbb.aaa.bbb.aaa
('bbb.aaa.bbb.aaa.aaa', (7, 26)) bbb.aaa.bbb.aaa.aaa

Should there be a query part in the string, further matches will possibly be skipped.

>>> txt = "String http://bbb.aaa.aaa/tests test string"
>>> for out in urlextract.URLExtract().gen_urls(txt, get_indices=1):
...     print(out, txt[out[1][0] : out[1][1]])
...
('http://bbb.aaa.aaa/tests', (3, 27)) ing http://bbb.aaa.aaa/t

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions