Skip to content

URLs with # in them get prematurely cut off - Mistaken as Fragments #151

@reece394

Description

@reece394

Describe the bug
As mentioned here #114 (comment) the issue occurs on URLs with # and it gets cut off.

I traced this issue to DirectoryParser.cs and specifically the CleanFragments function.

The CleanFragments function believes that the URL provided has a URI fragment rather than a legitimate file hence being cut off. This is most likely due to some URL decode and manipulation further up before it hits this CleanFragments function. The fix would probably be to make sure any non URI fragments are %23 encoded before hitting this function

For anyone else hit by this issue as a workaround if you are not likely going to hit a URI Fragment when scraping you can comment out the following section of the code in the CheckParsedResults function in the file DirectoryParser.cs :

if (webDirectory.Uri.Scheme != Constants.UriScheme.Ftp && webDirectory.Uri.Scheme != Constants.UriScheme.Ftps)
		{
			//CleanFragments(webDirectory);
		}

To Reproduce
Steps to reproduce the behavior:
See here for examples of this #114 (comment)

Expected behavior
Instead of http://mrclancy.ca/Film%20and%20TV/Movies/MST%20Clips/He-Man%20and%20the%20Masters%20of%20the%20Universe%20-%20

It should be https://mrclancy.ca/Film%20and%20TV/Movies/MST%20Clips/He-Man%20and%20the%20Masters%20of%20the%20Universe%20-%20%2330.m4v

Desktop (please complete the following information):

  • OS: macOS
  • Version: Latest Master Build (v3.5.0.0 + 2 commits)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions