Skip to content

More efficient encode_url #439

@D-Sketon

Description

@D-Sketon

Check List

  • I have already read README.
  • I have already searched existing issues.
  • I have already searched existing pull requrests.

Feature Request

The current encode_url function in hexo-util is based on an implementation from six years ago, which redundantly uses both url.parse and new URL() for double parsing. This redundant parsing logic leads to significant performance bottlenecks. In a benchmark test using a content-heavy environment (8x hexo-many-posts) with the hexo-theme-reimu theme, the total build time reached 36 seconds, with the encode_url function alone accounting for 5 seconds of that time.

Image

After reading the implementation of url.parse in Node.js (https://github.com/nodejs/node/blob/main/lib/url.js), I believe we can replace the conditional check if (parse(str).protocol) with a simpler logic.

The parsing of the protocol by parse is roughly as follows:

  • Asserts that the input must be a string
  • Ignores leading and trailing whitespace characters
  • Replaces backslashes before the query symbol with forward slashes
  • Uses the regex /^[a-z0-9.+-]+:/i to extract the protocol

During the parsing process, three types of errors may be thrown: ERR_INVALID_ARG_TYPE (non-string input), ERR_INVALID_URL (invalid hostname), and ERR_INVALID_ARG_VALUE (invalid port). The first error can be thrown proactively by our implementation. The second error will also be thrown by new URL(). As for the third error, I have not been able to reproduce it successfully, and I suspect it might be a product of defensive programming.

I suppose we could implement the original protocol parsing logic as follows:

const PROTOCOL_RE = /^[a-z0-9.+-]+:/i;

const hasProtocolLikeNode = (str: unknown): boolean => {
  if (typeof str !== 'string') throw new TypeError('url must be a string');
  return PROTOCOL_RE.test(str.trim());
};

The implementation above passes all existing unit tests and is theoretically similar to the current implementation. Most importantly, it delivers a massive performance improvement, reducing the total execution time from 5 seconds to around 300 ms.

Image

Do you think this modification is worthwhile?

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions