Skip to content

Conversation

knguyen1
Copy link
Contributor

@knguyen1 knguyen1 commented Aug 29, 2025

Summary

  • This is a…
    • Bug fix
    • Refactoring
    • Feature addition
    • Minor / simple change (like a typo)
    • Other
  • Describe this change in 1-2 sentences:
    • Fix date sanitization shifting of unknown components and add robust normalization for multiple input formats; refactor into small helpers and add comprehensive pytest coverage.

Problem

Philipp Wolfer: The cases 2005-00-12" => "2005-12" and "0000-00-12" => "0012" are a bug, though. This should be fixed.

  • Dates with unknown month/day were incorrectly shifted:
    • "2005-00-12" → "2005-12" and "0000-00-12" → "0012".
  • Tests lacked focused, parametric coverage for date normalization and bug cases.

Solution

  • Implement picard.util.sanitize_date that preserves unknown components ("00") without shifting, drops only truly unknown trailing parts, and supports: "YYYY", "YYYY-MM", "YYYY-MM-DD", unambiguous "YYYY-DD-MM", "DD/MM/YYYY" vs "MM/DD/YYYY" (heuristic), and compact "YYYYMMDD"/"YYYYDDMM".
  • Refactor into helpers: _parse_pure_year, _parse_iso_like, _parse_slash_separated, _parse_compact_eight, _format_from_components.
  • Add new pytest module test/test_dateutil_normalize.py using fixtures and pytest.mark.parametrize.
  • Update sanitize_date docstring (NumPy style) to match behavior and examples.

Action

Additional actions required:

  • Update Picard documentation (please include a reference to this PR)
  • Other (please specify below)

Note: Will need a follow-up PR to add option UI's and to close #2406. refactored sanitize_date sets up ability to sanitize dates or not.

@knguyen1 knguyen1 force-pushed the fix/PICARD-2729/date-sanitizations-of-formats branch from dd13241 to f93efb5 Compare August 29, 2025 04:17
@knguyen1 knguyen1 changed the title PICARD-2729: Add normalize_date; fix bug with shifting month PICARD-2729: Fix/refactor sanitize_date; fix bug with shifting unknown month Aug 29, 2025
Copy link
Collaborator

@zas zas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a lot of more code to handle just few more cases.
Old code was only bothering with "-" separator, new one handles "/" and no separator, Where are those new features used?

y = _clamp_year(int(m.group(1)))
a = int(m.group(2))
b = int(m.group(3))
if y in (None, 0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this test just after y declaration since it doesn't depend on a or b.
Also it would perhaps be clearer to have:

y = _clamp_year(int(m.group(1))) or None
if y is None:
    return ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def _parse_slash_separated(value: str) -> str | None:
if "/" not in value:
return None
tokens = value.split("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We expect 2 slashes right? Case with more isn't really handled (1/2/3/4) and is catched by if len(tokens) != 3 below (None is returned)

I think we can simplify things a bit here:

def _parse_slash_separated(value: str) -> str | None:
    try:
        tokens = value.split("/", 2)
        a = _int_or_none(tokens[0])
        b = _int_or_none(tokens[1])
        c = _int_or_none(tokens[2])
    except IndexError:
         return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c = _int_or_none(tokens[2])
if a is None or b is None or c is None or not (0 < c <= 9999):
return None
y = _clamp_year(c)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer:

y = _clamp_year(c) or None
if y is None:
    return ""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knguyen1
Copy link
Contributor Author

Old code was only bothering with "-" separator, new one handles "/" and no separator, Where are those new features used?

They are common formats I've seen in the wild and added for completeness. Does MusicBrainz sanitize them upstream? If so, can be removed.

@zas
Copy link
Collaborator

zas commented Aug 29, 2025

They are common formats I've seen in the wild and added for completeness. Does MusicBrainz sanitize them upstream? If so, can be removed.

AFAIK, sanitize_date() in the current code only have to handle - as separator. And we parse mainly dates from tags read from files.

value = sanitize_date(value)

value = sanitize_date(value)

self._sanitize_date(metadata)

def _sanitize_date(self, metadata):

References:
https://www.cl.cam.ac.uk/~mgk25/iso-time.html
https://en.wikipedia.org/wiki/ISO_8601

In ISO_8601 it seems hyphens are optional, though:
"Although the standard allows both the "YYYY-MM-DD" and YYYYMMDD formats for complete calendar date representations, if the day [DD] is omitted then only the YYYY-MM format is allowed. By disallowing dates of the form YYYYMM, the standard avoids confusion with the truncated representation[25][4] YYMMDD (still often used). The 2000 version also allowed writing the truncation "--04-05" to mean "April 5"[23] but the 2004 version does not allow omitting the year when a month is present."

So supporting no hyphen notation is perhaps a good idea but apparently that's only for YYYYMMDD case.
Standard also allows week format (2020-W03) but I don't think we'll ever see that in date tags, so no need to bother.

It is important to note that current code only support hyphen format (YYYY, YYYY-MM, YYYY-MM-DD)
Also it is unclear to me which version of ISO8601 is used for tags.

@phw probably knows more about dates & tags, insight welcome.

Regarding this patch, I still think that's overcomplicated and it doesn't address the issue at all, as https://tickets.metabrainz.org/browse/PICARD-2729 is about "Allow disabling date sanitazation for APE and Vorbis tags" not about sanitizing more (or differently).

We should create a specific ticket for "The cases 2005-00-12" => "2005-12" and "0000-00-12" => "0012" are a bug, though. This should be fixed." and only address this in this PR.

@knguyen1
Copy link
Contributor Author

I still think that's overcomplicated and it doesn't address the issue at all, as https://tickets.metabrainz.org/browse/PICARD-2729 is about "Allow disabling date sanitazation for APE and Vorbis tags" not about sanitizing more (or differently).

Right, I planned on addressing that next PR, the new sanitize_date accepts a "should sanitize" disable_sanitization argument.

https://github.com/metabrainz/picard/blob/bc6a3e2babce016ef957938b79fb7256ba0237f3/picard/util/datestr_util.py#L170C1-L184C46

It would close out stale #2406

@phw
Copy link
Member

phw commented Aug 29, 2025

Sorry, only jumping in quickly due to lack of time.

We should create a specific ticket for "The cases 2005-00-12" => "2005-12" and "0000-00-12" => "0012" are a bug, though. This should be fixed." and only address this in this PR.

This is the part I don't understand, because this specific issue had been already fixed in PICARD-2712, see PR #2283. Hence I don't understand why this PR claims to fix those.

@knguyen1
Copy link
Contributor Author

My mistake. I thought it hasn't been addressed yet; was looking at the un-closed ticket. Should we close without merging?

@zas
Copy link
Collaborator

zas commented Aug 29, 2025

My mistake. I thought it hasn't been addressed yet; was looking at the un-closed ticket. Should we close without merging?

Yes, closing, since the ticket in title isn't actually addressed here anyway (and the other issue was already fixed as said by @phw)

@zas zas closed this Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants