Skip to content

Commit f8709e8

Browse files
committed
Check that email address length is valid on the original email address string since callers may continue to use that string
Previously, we checked that the ASCII email address (with IDNA ASCII) and the normalized email address satisfied the whole-address length limit. However, callers may use the original input string. Since Unicode NFC normalization typically reduces string length (if it changes the string), this can cause the post-normalization check to pass when the pre-normalization length is not valid. So we should additionally check that the original input also meets the maximum length requirement. Callers might also construct an address that has an internationalized local part and ASCII domain, maybe? So that's now checked too. The whole-address length test is revised to test each possible address format, first the original email address string (with any display name removed) so that exception messages correspond to the input string where possible. Then the normalized address is checked, since we encourage callers to use it. Then the ASCII address is checked since callers who send email without a SMTPUTF8-enabled stack will use this, or the normalized internationalized local part (there won't be an ASCII local part in this case) combined with the ASCII domain. Some length tests are added with a Unicode character whose NFC normalization is actually a decomposition: U+FB2C (Hebrew Letter Shin With Dagesh And Shin Dot) is unusual in that its NFC normalization actually expands it to multiple code points (https://www.unicode.org/faq/normalization.html). In these cases, the address will be valid before normalization but not valid after. See #142.
1 parent 9ef1f82 commit f8709e8

File tree

5 files changed

+75
-53
lines changed

5 files changed

+75
-53
lines changed

CHANGELOG.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@ In Development
22
--------------
33

44
* Email addresses with internationalized local parts could, with rare Unicode characters, be returned as valid but actually be invalid in their normalized form (returned in the `normalized` field). Local parts now re-validated after Unicode NFC normalization to ensure that invalid characters cannot be injected into the normalized address and that characters with length-increasing NFC normalizations cannot cause a local part to exceed the maximum length after normalization.
5+
* The length check for email addresses with internationalized local parts is now also applied to the original address string prior to Unicode NFC normalization, which may be longer and could exceed the maximum email address length, to protect callers who do not use the returned normalized address.
56
* A new option to parse `My Name <address@domain>` strings, i.e. a display name plus an email address in angle brackets, is now available. It is off by default.
67

78
2.1.2 (June 16, 2024)
@@ -10,7 +11,7 @@ In Development
1011
* The domain name length limit is corrected from 255 to 253 IDNA ASCII characters. I misread the RFCs.
1112
* When a domain name has no MX record but does have an A or AAAA record, if none of the IP addresses in the response are globally reachable (i.e. not Private-Use, Loopback, etc.), the response is treated as if there was no A/AAAA response and the email address will fail the deliverability check.
1213
* When a domain name has no MX record but does have an A or AAAA record, the mx field in the object returned by validate_email incorrectly held the IP addresses rather than the domain itself.
13-
* Fixes in tests.
14+
* Fixes in tests. Some additional tests added.
1415

1516
2.1.1 (February 26, 2024)
1617
-------------------------

README.md

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -300,13 +300,6 @@ they are unnecessary. For IPv6 domain literals, the IPv6 address is
300300
normalized to condensed form. [RFC 2142](https://datatracker.ietf.org/doc/html/rfc2142)
301301
also requires lowercase normalization for some specific mailbox names like `postmaster@`.
302302

303-
### Length checks
304-
305-
This library checks that the length of the email address is not longer than
306-
the maximum length. The check is performed on the normalized form of the
307-
address, which might be different from a string provided by a user. If you
308-
send email to the original string and not the normalized address, the email
309-
might be rejected because the original address could be too long.
310303

311304
Examples
312305
--------

email_validator/syntax.py

Lines changed: 61 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -176,12 +176,11 @@ def unquote_quoted_string(text: str) -> Tuple[str, bool]:
176176
return display_name, local_part, domain_part, is_quoted_local_part
177177

178178

179-
def get_length_reason(addr: str, utf8: bool = False, limit: int = EMAIL_MAX_LENGTH) -> str:
179+
def get_length_reason(addr: str, limit: int) -> str:
180180
"""Helper function to return an error message related to invalid length."""
181181
diff = len(addr) - limit
182-
prefix = "at least " if utf8 else ""
183182
suffix = "s" if diff > 1 else ""
184-
return f"({prefix}{diff} character{suffix} too many)"
183+
return f"({diff} character{suffix} too many)"
185184

186185

187186
def safe_character_display(c: str) -> str:
@@ -609,44 +608,66 @@ def validate_email_domain_name(domain: str, test_environment: bool = False, glob
609608

610609

611610
def validate_email_length(addrinfo: ValidatedEmail) -> None:
612-
# If the email address has an ASCII representation, then we assume it may be
613-
# transmitted in ASCII (we can't assume SMTPUTF8 will be used on all hops to
614-
# the destination) and the length limit applies to ASCII characters (which is
615-
# the same as octets). The number of characters in the internationalized form
616-
# may be many fewer (because IDNA ASCII is verbose) and could be less than 254
617-
# Unicode characters, and of course the number of octets over the limit may
618-
# not be the number of characters over the limit, so if the email address is
619-
# internationalized, we can't give any simple information about why the address
620-
# is too long.
621-
if addrinfo.ascii_email and len(addrinfo.ascii_email) > EMAIL_MAX_LENGTH:
622-
if addrinfo.ascii_email == addrinfo.normalized:
623-
reason = get_length_reason(addrinfo.ascii_email)
624-
elif len(addrinfo.normalized) > EMAIL_MAX_LENGTH:
625-
# If there are more than 254 characters, then the ASCII
626-
# form is definitely going to be too long.
627-
reason = get_length_reason(addrinfo.normalized, utf8=True)
628-
else:
629-
reason = "(when converted to IDNA ASCII)"
630-
raise EmailSyntaxError(f"The email address is too long {reason}.")
631-
632-
# In addition, check that the UTF-8 encoding (i.e. not IDNA ASCII and not
633-
# Unicode characters) is at most 254 octets. If the addres is transmitted using
634-
# SMTPUTF8, then the length limit probably applies to the UTF-8 encoded octets.
635-
# If the email address has an ASCII form that differs from its internationalized
636-
# form, I don't think the internationalized form can be longer, and so the ASCII
637-
# form length check would be sufficient. If there is no ASCII form, then we have
638-
# to check the UTF-8 encoding. The UTF-8 encoding could be up to about four times
639-
# longer than the number of characters.
611+
# There are three forms of the email address whose length must be checked:
640612
#
641-
# See the length checks on the local part and the domain.
642-
if len(addrinfo.normalized.encode("utf8")) > EMAIL_MAX_LENGTH:
643-
if len(addrinfo.normalized) > EMAIL_MAX_LENGTH:
644-
# If there are more than 254 characters, then the UTF-8
645-
# encoding is definitely going to be too long.
646-
reason = get_length_reason(addrinfo.normalized, utf8=True)
647-
else:
648-
reason = "(when encoded in bytes)"
649-
raise EmailSyntaxError(f"The email address is too long {reason}.")
613+
# 1) The original email address string. Since callers may continue to use
614+
# this string, even though we recommend using the normalized form, we
615+
# should not pass validation when the original input is not valid. This
616+
# form is checked first because it is the original input.
617+
# 2) The normalized email address. We perform Unicode NFC normalization of
618+
# the local part, we normalize the domain to internationalized characters
619+
# (if originaly IDNA ASCII) which also includes Unicode normalization,
620+
# and we may remove quotes in quoted local parts. We recommend that
621+
# callers use this string, so it must be valid.
622+
# 3) The email address with the IDNA ASCII representation of the domain
623+
# name, since this string may be used with email stacks that don't
624+
# support UTF-8. Since this is the least likely to be used by callers,
625+
# it is checked last. Note that ascii_email will only be set if the
626+
# local part is ASCII, but conceivably the caller may combine a
627+
# internationalized local part with an ASCII domain, so we check this
628+
# on that combination also. Since we only return the normalized local
629+
# part, we use that (and not the unnormalized local part).
630+
#
631+
# In all cases, the length is checked in UTF-8 because the SMTPUTF8
632+
# extension to SMTP validates the length in bytes.
633+
634+
addresses_to_check = [
635+
(addrinfo.original, None),
636+
(addrinfo.normalized, "after normalization"),
637+
((addrinfo.ascii_local_part or addrinfo.local_part or "") + "@" + addrinfo.ascii_domain, "when the part after the @-sign is converted to IDNA ASCII"),
638+
]
639+
640+
for addr, reason in addresses_to_check:
641+
addr_len = len(addr)
642+
addr_utf8_len = len(addr.encode("utf8"))
643+
diff = addr_utf8_len - EMAIL_MAX_LENGTH
644+
if diff > 0:
645+
if reason is None and addr_len == addr_utf8_len:
646+
# If there is no normalization or transcoding,
647+
# we can give a simple count of the number of
648+
# characters over the limit.
649+
reason = get_length_reason(addr, limit=EMAIL_MAX_LENGTH)
650+
elif reason is None:
651+
# If there is no normalization but there is
652+
# some transcoding to UTF-8, we can compute
653+
# the minimum number of characters over the
654+
# limit by dividing the number of bytes over
655+
# the limit by the maximum number of bytes
656+
# per character.
657+
mbpc = max(len(c.encode("utf8")) for c in addr)
658+
mchars = max(1, diff // mbpc)
659+
suffix = "s" if diff > 1 else ""
660+
if mchars == diff:
661+
reason = f"({diff} character{suffix} too many)"
662+
else:
663+
reason = f"({mchars}-{diff} character{suffix} too many)"
664+
else:
665+
# Since there is normalization, the number of
666+
# characters in the input that need to change is
667+
# impossible to know.
668+
suffix = "s" if diff > 1 else ""
669+
reason += f" ({diff} byte{suffix} too many)"
670+
raise EmailSyntaxError(f"The email address is too long {reason}.")
650671

651672

652673
class DomainLiteralValidationResult(TypedDict):

email_validator/validate_email.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,9 @@ def validate_email(
7272

7373
# Collect return values in this instance.
7474
ret = ValidatedEmail()
75-
ret.original = email
75+
ret.original = ((local_part if not is_quoted_local_part
76+
else ('"' + local_part + '"'))
77+
+ "@" + domain_part) # drop the display name, if any, for email length tests at the end
7678
ret.display_name = display_name
7779

7880
# Validate the email address's local part syntax and get a normalized form.

tests/test_syntax.py

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -409,10 +409,15 @@ def test_domain_literal() -> None:
409409
('me@中1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444.com', 'The email address is too long after the @-sign.'),
410410
('meme@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.com', 'The email address is too long (4 characters too many).'),
411411
('my.long.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333333344444.info', 'The email address is too long (2 characters too many).'),
412-
('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333.info', 'The email address is too long (when converted to IDNA ASCII).'),
413-
('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (at least 1 character too many).'),
414-
('my.λong.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.111111111122222222223333333333444.info', 'The email address is too long (when encoded in bytes).'),
415-
('my.λong.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (at least 1 character too many).'),
412+
('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (1-2 characters too many).'),
413+
('my.long.address@\uFB2C111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (1-3 characters too many).'),
414+
('my.λong.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.111111111122222222223333333333444.info', 'The email address is too long (1 character too many).'),
415+
('my.λong.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (1-2 characters too many).'),
416+
('my.\u0073\u0323\u0307.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444.info', 'The email address is too long (1-2 characters too many).'),
417+
('my.\uFB2C.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333333344444.info', 'The email address is too long (1 character too many).'),
418+
('my.\uFB2C.address@1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333333344.info', 'The email address is too long after normalization (1 byte too many).'),
419+
('my.long.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333.info', 'The email address is too long when the part after the @-sign is converted to IDNA ASCII (1 byte too many).'),
420+
('my.λong.address@λ111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.1111111111222222222233333333334444444444555555555.6666666666777777777788888888889999999999000000000.11111111112222222222333333.info', 'The email address is too long when the part after the @-sign is converted to IDNA ASCII (2 bytes too many).'),
416421
('me@bad-tld-1', 'The part after the @-sign is not valid. It should have a period.'),
417422
('me@bad.tld-2', 'The part after the @-sign is not valid. It is not within a valid top-level domain.'),
418423
('me@xn--0.tld', 'The part after the @-sign is not valid IDNA (Invalid A-label).'),

0 commit comments

Comments
 (0)