Check that the local part is valid after Unicode NFC normalization to prevent injection of invalid characters

JoshData · JoshData · commit 9ef1f829aa5d · 2024-06-19T12:02:49.000-04:00
We encourage callers to use the normalized email address returned by validate_email (in the `normalized` attribute). This form has had Unicode NFC normalization applied to the local part. However, all of the syntactic validation on the local part was performed before the normalization. Consequently, the normalization could change the local part to become invalid by the replacement of valid characters with invalid characters or by changing the length of the local part to exceed the maximum length. Callers who use the normalized form may then unexpectedly be using an invalid address. To ensure that callers do not get an invalid address, local part syntax checks are now repeated after Unicode normalization has been applied. A user submitted one case where NFC normalization changes a local part from valid to invalid: U+037E (Greek Question Mark)'s NFC normalization is the ASCII semicolon. The former is otherwise a permitted character, but ASCII semicolons are not permitted in local parts. The user noted that the semicolon could cause the address to be reinterpreted as a list and change the recipient of a message. No other Unicode character on its own is valid (in a local part) before normalization and invalid after --- I checked every character. I am not sure if there are character sequences that are valid before but not after normalization, but I can't yet find any: I checked that no Unicode character's NFD decomposition, when valid in a local part, normalizes under NFC to a sequence that is not valid. I also could not find any examples where NFC normalization changes something to or from a period, which could also change the validity of a local part. (The string '<' or '>' plus U+0338 (Combining Long Solidus Overlay) normalizes under NFC to ≮ U+226E (Not Less-Than) and ≯ U+226F (Not Greater-Than). The two-character sequences are not valid in a local part because < and > are not valid, although they are valid after NFC normalization. These addresses were rejected before and continue to be rejected. Although < could be the start of a bracketed email address if display names are permitted, the two-character sequence is now (in an earlier commit) is ignored for the purposes of parsing display names.) There are a small number of characters whose NFC normalization increases the string length, including U+FB2C (Hebrew Letter Shin With Dagesh And Shin Dot). This could also cause the local part to become invalid after normalization where it is valid before. This is now also caught by performing the syntax check again after normalization. (The whole-address length check is similarly fixed in a later commit.) Some checks that were previously only applied after normalization, for checking safe Unicode characters, are now also applied to the un-normalized form, which also may protect callers that ignore the normalized form and use the original email address string. However, I could not find an example where normalization turns an unsafe string into a safe string. See #142.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,7 @@
 In Development
 --------------
 
+* Email addresses with internationalized local parts could, with rare Unicode characters, be returned as valid but actually be invalid in their normalized form (returned in the `normalized` field). Local parts now re-validated after Unicode NFC normalization to ensure that invalid characters cannot be injected into the normalized address and that characters with length-increasing NFC normalizations cannot cause a local part to exceed the maximum length after normalization.
 * A new option to parse `My Name <address@domain>` strings, i.e. a display name plus an email address in angle brackets, is now available. It is off by default.
 
 2.1.2 (June 16, 2024)
diff --git a/README.md b/README.md
@@ -20,10 +20,11 @@ Key features:
 * Supports internationalized domain names (like `@ツ.life`),
   internationalized local parts (like `ツ@example.com`),
   and optionally parses display names (e.g. `"My Name" <me@example.com>`).
-* Rejects addresses with unsafe Unicode characters, obsolete email address
-  syntax that you'd find unexpected, special use domain names like
-  `@localhost`, and domains without a dot by default. This is an
-  opinionated library!
+* Rejects addresses with invalid or unsafe Unicode characters,
+  obsolete email address syntax that you'd find unexpected,
+  special use domain names like `@localhost`,
+  and domains without a dot by default.
+  This is an opinionated library!
 * Normalizes email addresses (important for internationalized
   and quoted-string addresses! see below).
 * Python type annotations are used.
@@ -235,13 +236,9 @@ cannot combine with something outside of the email address string or with
 the @-sign). See https://qntm.org/safe and https://trojansource.codes/
 for relevant prior work. (Other than whitespace, these are checks that
 you should be applying to nearly all user inputs in a security-sensitive
-context.)
-
-These character checks are performed after Unicode normalization (see below),
-so you are only fully protected if you replace all user-provided email addresses
-with the normalized email address string returned by this library. This does not
-guard against the well known problem that many Unicode characters look alike
-(or are identical), which can be used to fool humans reading displayed text.
+context.) This does not guard against the well known problem that many
+Unicode characters look alike, which can be used to fool humans reading
+displayed text.
 
 
 Normalization
@@ -257,7 +254,7 @@ address.
 
 For example, the CJK fullwidth Latin letters are considered semantically
 equivalent in domain names to their ASCII counterparts. This library
-normalizes them to their ASCII counterparts:
+normalizes them to their ASCII counterparts (as required by IDNA):
 
 ```python
 emailinfo = validate_email("me@Ｄｏｍａｉｎ.com")
@@ -270,9 +267,7 @@ Because an end-user might type their email address in different (but
 equivalent) un-normalized forms at different times, you ought to
 replace what they enter with the normalized form immediately prior to
 going into your database (during account creation), querying your database
-(during login), or sending outbound mail. Normalization may also change
-the length of an email address, and this may affect whether it is valid
-and acceptable by your SMTP provider.
+(during login), or sending outbound mail.
 
 The normalizations include lowercasing the domain part of the email
 address (domain names are case-insensitive), [Unicode "NFC"
@@ -286,6 +281,11 @@ in the domain part, possibly other
 [UTS46](http://unicode.org/reports/tr46) mappings on the domain part,
 and conversion from Punycode to Unicode characters.
 
+Normalization may change the characters in the email address and the
+length of the email address, such that a string might be a valid address
+before normalization but invalid after, or vice versa. This library only
+permits addresses that are valid both before and after normalization.
+
 (See [RFC 6532 (internationalized email) section
 3.1](https://tools.ietf.org/html/rfc6532#section-3.1) and [RFC 5895
 (IDNA 2008) section 2](http://www.ietf.org/rfc/rfc5895.txt).)
diff --git a/email_validator/syntax.py b/email_validator/syntax.py
@@ -315,12 +315,8 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
         valid = "quoted"
 
     # If the local part matches the internationalized dot-atom form or was quoted,
-    # perform normalization and additional checks for Unicode strings.
+    # perform additional checks for Unicode strings.
     if valid:
-        # RFC 6532 section 3.1 says that Unicode NFC normalization should be applied,
-        # so we'll return the normalized local part in the return value.
-        local = unicodedata.normalize("NFC", local)
-
         # Check that the local part is a valid, safe, and sensible Unicode string.
         # Some of this may be redundant with the range U+0080 to U+10FFFF that is checked
         # by DOT_ATOM_TEXT_INTL and QTEXT_INTL. Other characters may be permitted by the
@@ -385,7 +381,7 @@ def check_unsafe_chars(s: str, allow_space: bool = False) -> None:
             # Combining character in first position would combine with something
             # outside of the email address if concatenated, so they are not safe.
             # We also check if this occurs after the @-sign, which would not be
-            # sensible.
+            # sensible because it would modify the @-sign.
             if i == 0:
                 bad_chars.add(c)
         elif category == "Zs":
diff --git a/email_validator/validate_email.py b/email_validator/validate_email.py
@@ -1,4 +1,5 @@
 from typing import Optional, Union, TYPE_CHECKING
+import unicodedata
 
 from .exceptions_types import EmailSyntaxError, ValidatedEmail
 from .syntax import split_email, validate_email_local_part, validate_email_domain_name, validate_email_domain_literal, validate_email_length
@@ -86,6 +87,20 @@ def validate_email(
     ret.ascii_local_part = local_part_info["ascii_local_part"]
     ret.smtputf8 = local_part_info["smtputf8"]
 
+    # RFC 6532 section 3.1 says that Unicode NFC normalization should be applied,
+    # so we'll return the NFC-normalized local part. Since the caller may use that
+    # string in place of the original string, ensure it is also valid.
+    normalized_local_part = unicodedata.normalize("NFC", ret.local_part)
+    if normalized_local_part != ret.local_part:
+        try:
+            validate_email_local_part(normalized_local_part,
+                                      allow_smtputf8=allow_smtputf8,
+                                      allow_empty_local=allow_empty_local,
+                                      quoted_local_part=is_quoted_local_part)
+        except EmailSyntaxError as e:
+            raise EmailSyntaxError("After Unicode normalization: " + str(e)) from e
+        ret.local_part = normalized_local_part
+
     # If a quoted local part isn't allowed but is present, now raise an exception.
     # This is done after any exceptions raised by validate_email_local_part so
     # that mandatory checks have highest precedence.
diff --git a/tests/test_syntax.py b/tests/test_syntax.py
@@ -398,14 +398,16 @@ def test_domain_literal() -> None:
         ('\nmy@example.com', 'The email address contains invalid characters before the @-sign: U+000A.'),
         ('m\ny@example.com', 'The email address contains invalid characters before the @-sign: U+000A.'),
         ('my\n@example.com', 'The email address contains invalid characters before the @-sign: U+000A.'),
+        ('me.\u037e@example.com', 'After Unicode normalization: The email address contains invalid characters before the @-sign: \';\'.'),
         ('test@\n', 'The part after the @-sign contains invalid characters: U+000A.'),
         ('bad"quotes"@example.com', 'The email address contains invalid characters before the @-sign: \'"\'.'),
         ('obsolete."quoted".atom@example.com', 'The email address contains invalid characters before the @-sign: \'"\'.'),
         ('11111111112222222222333333333344444444445555555555666666666677777@example.com', 'The email address is too long before the @-sign (1 character too many).'),
         ('111111111122222222223333333333444444444455555555556666666666777777@example.com', 'The email address is too long before the @-sign (2 characters too many).'),
-        ('meme@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.com', 'The email address is too long (4 characters too many).'),
+        ('\uFB2C111111122222222223333333333444444444455555555556666666666777777@example.com', 'After Unicode normalization: The email address is too long before the @-sign (2 characters too many).'),
         ('me@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333333344444444445555555555.com', 'The email address is too long after the @-sign (1 character too many).'),
         ('me@中1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444.com', 'The email address is too long after the @-sign.'),
+        ('meme@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.com', 'The email address is too long (4 characters too many).'),
         ('my.long.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333333344444.info', 'The email address is too long (2 characters too many).'),
         ('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333.info', 'The email address is too long (when converted to IDNA ASCII).'),
         ('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (at least 1 character too many).'),
@@ -439,7 +441,7 @@ def test_email_invalid_syntax(email_input: str, error_msg: str) -> None:
     # Since these all have syntax errors, deliverability
     # checks do not arise.
     with pytest.raises(EmailSyntaxError) as exc_info:
-        validate_email(email_input)
+        validate_email(email_input, check_deliverability=False)
     assert str(exc_info.value) == error_msg