Skip to content

Commit 1b9e867

Browse files
committed
Parse quoted-string local parts but by default keep them disallowed with better exception messages
People have opened issues several times about quoted local parts being incorrectly rejected. We can give a better error when it happens to head-off questions about it by parsing them so that we know when they occur. * Detect when a quoted-string local part might be present when splitting the address into a local part and domain part when the address has quoted @-signs in the local part rather than giving an error message about multiple @-signs. * Remove the surrounding quotes and un-escape the string before checking the syntax of the local part. Return the un-quoted and un-escaped string as the normalized local_part in the returned ValidatedEmail object if it's valid as an unquoted local part. * Check for invalid characters in the quoted-string (per the spec and our additional Unicode character checks) and raise exceptions. * Add a new option to accept quoted-string local parts which is off by default. When accepting them, apply Unicode normalization as per dot-atom internationalized addresses and apply minimal backslash escaping. * Update tests. See #54, #92.
1 parent 8fdbaba commit 1b9e867

File tree

7 files changed

+201
-58
lines changed

7 files changed

+201
-58
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ There are no significant changes to which email addresses are considered valid/i
99
* The dnspython package is no longer required if DNS checks are not used, although it will install automatically.
1010
* NoNameservers and NXDOMAIN DNS errors are now handled differently: NoNameservers no longer fails validation, and NXDOMAIN now skips checking for an A/AAAA fallback and goes straight to failing validation.
1111
* Some syntax error messages have changed because they are now checked explicitly rather than as a part of other checks.
12+
* The quoted-string local part syntax (e.g. multiple @-signs, spaces, etc. if surrounded by quotes) is now parsed but not considered valid by default. Better error messages are now given for quoted-string syntax since it can be confusing for a technically valid address to be rejected, and a new allow_quoted_local option is added to allow these addresses if you really need them.
1213
* Some other error messages have changed to not repeat the email address in the error message.
1314
* The library has been reorganized internally into smaller modules.
1415
* The tests have been reorganized and expanded. Deliverability tests now mostly use captured DNS responses so they can be run off-line.

README.md

Lines changed: 20 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,18 @@ Key features:
1919
can display to end-users.
2020
* Checks deliverability (optional): Does the domain name resolve?
2121
(You can override the default DNS resolver to add query caching.)
22-
* Supports internationalized domain names and internationalized local parts.
22+
* Supports internationalized domain names and internationalized local parts,
23+
and with an option deprecated quoted-string local parts.
2324
Blocks unsafe characters for your safety.
2425
* Normalizes email addresses (important for internationalized
25-
addresses! see below).
26+
and quoted-string addresses! see below).
2627
* Python type annotations are used.
2728

28-
This library does NOT permit obsolete forms of email addresses, so
29-
if you need strict validation against the email specs exactly, use
29+
This library does NOT permit obsolete forms of email addresses by default,
30+
so if you need strict validation against the email specs exactly, use
3031
[pyIsEmail](https://github.com/michaelherold/pyIsEmail) or try
3132
[flanker](https://github.com/mailgun/flanker) if you are parsing the
32-
To: line of an email.
33+
"To:" line of an email.
3334

3435
[![Build Status](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml/badge.svg)](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml)
3536

@@ -103,8 +104,8 @@ But when an email address is valid, an object is returned containing
103104
a normalized form of the email address (which you should use!) and
104105
other information.
105106

106-
The validator doesn't permit obsoleted forms of email addresses that no
107-
one uses anymore even though they are still valid and deliverable, since
107+
The validator doesn't, by default, permit obsoleted forms of email addresses
108+
that no one uses anymore even though they are still valid and deliverable, since
108109
they will probably give you grief if you're using email for login. (See
109110
later in the document about that.)
110111

@@ -134,6 +135,8 @@ The `validate_email` function also accepts the following keyword arguments
134135
require the
135136
[SMTPUTF8](https://tools.ietf.org/html/rfc6531) extension. You can also set `email_validator.ALLOW_SMTPUTF8` to `False` to turn it off for all calls by default.
136137

138+
`allow_quoted_local=False`: Set to `True` to allow obscure and potentially problematic email addresses in which the part of the address before the @-sign contains spaces, @-signs, or other surprising characters when the local part is surrounded in quotes (so-called quoted-string local parts). In the object returned by `validate_email`, the normalized local part removes any unnecessary backslash-escaping and even removes the surrounding quotes if the address would be valid without them. You can also set `email_validator.ALLOW_QUOTED_LOCAL` to `True` to turn this on for all calls by default.
139+
137140
`allow_empty_local=False`: Set to `True` to allow an empty local part (i.e.
138141
`@example.com`), e.g. for validating Postfix aliases.
139142

@@ -288,6 +291,11 @@ and conversion from Punycode to Unicode characters.
288291
3.1](https://tools.ietf.org/html/rfc6532#section-3.1) and [RFC 5895
289292
(IDNA 2008) section 2](http://www.ietf.org/rfc/rfc5895.txt).)
290293

294+
Normalization is also applied to quoted-string local parts if you have
295+
allowed them by the `allow_quoted_local` option. Unnecessary backslash
296+
escaping is removed and even the surrounding quotes are removed if they
297+
are unnecessary.
298+
291299
Examples
292300
--------
293301

@@ -355,9 +363,9 @@ are:
355363

356364
| Field | Value |
357365
| -----:|-------|
358-
| `email` | The normalized form of the email address that you should put in your database. This merely combines the `local_part` and `domain` fields (see below). |
366+
| `email` | The normalized form of the email address that you should put in your database. This combines the `local_part` and `domain` fields (see below). |
359367
| `ascii_email` | If set, an ASCII-only form of the email address by replacing the domain part with [IDNA](https://tools.ietf.org/html/rfc5891) [Punycode](https://www.rfc-editor.org/rfc/rfc3492.txt). This field will be present when an ASCII-only form of the email address exists (including if the email address is already ASCII). If the local part of the email address contains internationalized characters, `ascii_email` will be `None`. If set, it merely combines `ascii_local_part` and `ascii_domain`. |
360-
| `local_part` | The local part of the given email address (before the @-sign) with Unicode NFC normalization applied. |
368+
| `local_part` | The normalized local part of the given email address (before the @-sign). Normalization includes Unicode NFC normalization and removing unnecessary quoted-string quotes and backslashes. If `allow_quoted_local` is True and the surrounding quotes are necessary, the quotes _will_ be present in this field. |
361369
| `ascii_local_part` | If set, the local part, which is composed of ASCII characters only. |
362370
| `domain` | The canonical internationalized Unicode form of the domain part of the email address. If the returned string contains non-ASCII characters, either the [SMTPUTF8](https://tools.ietf.org/html/rfc6531) feature of your mail relay will be required to transmit the message or else the email address's domain part must be converted to IDNA ASCII first: Use `ascii_domain` field instead. |
363371
| `ascii_domain` | The [IDNA](https://tools.ietf.org/html/rfc5891) [Punycode](https://www.rfc-editor.org/rfc/rfc3492.txt)-encoded form of the domain part of the given email address, as it would be transmitted on the wire. |
@@ -383,9 +391,9 @@ or likely to cause trouble:
383391
(except see the `test_environment` parameter above).
384392
* Obsolete email syntaxes are rejected:
385393
The "quoted string" form of the local part of the email address (RFC
386-
5321 4.1.2) is not permitted.
387-
Quoted forms allow multiple @-signs, space characters, and other
388-
troublesome conditions. The unusual [(comment) syntax](https://github.com/JoshData/python-email-validator/issues/77)
394+
5321 4.1.2) is not permitted unless `allow_quoted_local=True` is given
395+
(see above).
396+
The unusual ["(comment)" syntax](https://github.com/JoshData/python-email-validator/issues/77)
389397
is also rejected. The "literal" form for the domain part of an email address (an
390398
IP address in brackets) is rejected. Other obsolete and deprecated syntaxes are
391399
rejected. No one uses these forms anymore.

email_validator/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,10 @@ def caching_resolver(*args, **kwargs):
2525
# Default values for keyword arguments.
2626

2727
ALLOW_SMTPUTF8 = True
28+
ALLOW_QUOTED_LOCAL = False
29+
GLOBALLY_DELIVERABLE = True
2830
CHECK_DELIVERABILITY = True
2931
TEST_ENVIRONMENT = False
30-
GLOBALLY_DELIVERABLE = True
3132
DEFAULT_TIMEOUT = 15 # secs
3233

3334
# IANA Special Use Domain Names

email_validator/rfc_constants.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,15 @@
2727
DOT_ATOM_TEXT_HOSTNAME = re.compile(HOSTNAME_LABEL + r'(?:\.' + HOSTNAME_LABEL + r')*\Z')
2828
DOMAIN_NAME_REGEX = re.compile(r"[A-Za-z]\Z") # all TLDs currently end with a letter
2929

30+
# Quoted-string local part (RFC 5321 4.1.2, internationalized by RFC 6531 section 3.3)
31+
# The permitted characters in a quoted string are the characters in the range
32+
# 32-126, except that quotes and (literal) backslashes can only appear when escaped
33+
# by a backslash. When internationalized, UTF8 strings are also permitted except
34+
# the ASCII characters that are not previously permitted (see above).
35+
# QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[\u0020-\u0021\u0023-\u005B\u005D-\u007E]|\\[\u0020-\u007E])*)\"@(.*)")
36+
QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[^\"\\]|\\.)*)\"@(.*)")
37+
QTEXT_INTL = re.compile(r"[\u0020-\u007E\u0080-\U0010FFFF]")
38+
3039
# Length constants
3140
# RFC 3696 + errata 1003 + errata 1690 (https://www.rfc-editor.org/errata_search.php?rfc=3696&eid=1690)
3241
# explains the maximum length of an email address is 254 octets.

email_validator/syntax.py

Lines changed: 93 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
from .exceptions_types import EmailSyntaxError
22
from .rfc_constants import EMAIL_MAX_LENGTH, LOCAL_PART_MAX_LENGTH, DOMAIN_MAX_LENGTH, \
3-
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_RE, ATEXT_HOSTNAME_INTL, DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX
3+
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_RE, ATEXT_HOSTNAME_INTL, QTEXT_INTL, \
4+
DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX
45

56
import re
67
import unicodedata
78
import idna # implements IDNA 2008; Python's codec is only IDNA 2003
9+
from typing import Optional
810

911

1012
def get_length_reason(addr, utf8=False, limit=EMAIL_MAX_LENGTH):
@@ -32,7 +34,8 @@ def safe_character_display(c):
3234
return unicodedata.name(c, h)
3335

3436

35-
def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_empty_local: bool = False):
37+
def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_empty_local: bool = False,
38+
quoted_local_part: bool = False):
3639
"""Validates the syntax of the local part of an email address."""
3740

3841
if len(local) == 0:
@@ -61,24 +64,32 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
6164
# Check the local part against the non-internationalized regular expression.
6265
# Most email addresses match this regex so it's probably fastest to check this first.
6366
# (RFC 2822 3.2.4)
67+
# All local parts matching the dot-atom rule are also valid as a quoted string
68+
# so if it was originally quoted (quoted_local_part is True) and this regex matches,
69+
# it's ok.
70+
# (RFC 5321 4.1.2).
6471
m = DOT_ATOM_TEXT.match(local)
6572
if m:
66-
# It's valid.
73+
# It's valid. And since it's just the permitted ASCII characters,
74+
# it's normalized and safe. If the local part was originally quoted,
75+
# the quoting was unnecessary and it'll be returned as normalized to
76+
# non-quoted form.
6777

68-
# Return the local part unchanged and flag that SMTPUTF8 is not needed.
78+
# Return the local part and flag that SMTPUTF8 is not needed.
6979
return {
7080
"local_part": local,
7181
"ascii_local_part": local,
7282
"smtputf8": False,
7383
}
7484

75-
# The local part failed the ASCII check. Try the extended character set
85+
# The local part failed the basic dot-atom check. Try the extended character set
7686
# for internationalized addresses. It's the same pattern but with additional
7787
# characters permitted.
88+
# RFC 6531 section 3.3.
89+
valid: Optional[str] = None
90+
requires_smtputf8 = False
7891
m = DOT_ATOM_TEXT_INTL.match(local)
7992
if m:
80-
# It's valid.
81-
8293
# But international characters in the local part may not be permitted.
8394
if not allow_smtputf8:
8495
# Check for invalid characters against the non-internationalized
@@ -95,15 +106,56 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
95106
# Although the check above should always find something, fall back to this just in case.
96107
raise EmailSyntaxError("Internationalized characters before the @-sign are not supported.")
97108

98-
# RFC 6532 section 3.1 also says that Unicode NFC normalization should be applied,
109+
# It's valid.
110+
valid = "dot-atom"
111+
requires_smtputf8 = True
112+
113+
# There are no syntactic restrictions on quoted local parts, so if
114+
# it was originally quoted, it is probably valid. More characters
115+
# are allowed, like @-signs, spaces, and quotes, and there are no
116+
# restrictions on the placement of dots, as in dot-atom local parts.
117+
elif quoted_local_part:
118+
# Check for invalid characters in a quoted string local part.
119+
# (RFC 5321 4.1.2. RFC 5322 lists additional permitted *obsolete*
120+
# characters which are *not* allowed here. RFC 6531 section 3.3
121+
# extends the range to UTF8 strings.)
122+
bad_chars = set(
123+
safe_character_display(c)
124+
for c in local
125+
if not QTEXT_INTL.match(c)
126+
)
127+
if bad_chars:
128+
raise EmailSyntaxError("The email address contains invalid characters in quotes before the @-sign: " + ", ".join(sorted(bad_chars)) + ".")
129+
130+
# See if any characters are outside of the ASCII range.
131+
bad_chars = set(
132+
safe_character_display(c)
133+
for c in local
134+
if not (32 <= ord(c) <= 126)
135+
)
136+
if bad_chars:
137+
requires_smtputf8 = True
138+
139+
# International characters in the local part may not be permitted.
140+
if not allow_smtputf8:
141+
raise EmailSyntaxError("Internationalized characters before the @-sign are not supported: " + ", ".join(sorted(bad_chars)) + ".")
142+
143+
# It's valid.
144+
valid = "quoted"
145+
146+
# If the local part matches the internationalized dot-atom form or was quoted,
147+
# perform normalization and additional checks for Unicode strings.
148+
if valid:
149+
# RFC 6532 section 3.1 says that Unicode NFC normalization should be applied,
99150
# so we'll return the normalized local part in the return value.
100151
local = unicodedata.normalize("NFC", local)
101152

102153
# Check that the local part is a valid, safe, and sensible Unicode string.
103154
# Some of this may be redundant with the range U+0080 to U+10FFFF that is checked
104-
# by DOT_ATOM_TEXT_INTL. Other characters may be permitted by the email specs, but
105-
# they may not be valid, safe, or sensible Unicode strings.
106-
check_unsafe_chars(local)
155+
# by DOT_ATOM_TEXT_INTL and QTEXT_INTL. Other characters may be permitted by the
156+
# email specs, but they may not be valid, safe, or sensible Unicode strings.
157+
# See the function for rationale.
158+
check_unsafe_chars(local, allow_space=(valid == "quoted"))
107159

108160
# Try encoding to UTF-8. Failure is possible with some characters like
109161
# surrogate code points, but those are checked above. Still, we don't
@@ -113,15 +165,22 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
113165
except ValueError:
114166
raise EmailSyntaxError("The email address contains an invalid character.")
115167

116-
# Flag that SMTPUTF8 will be required for deliverability.
168+
# If this address passes only by the quoted string form, re-quote it
169+
# and backslash-escape quotes and backslashes (removing any unnecessary
170+
# escapes). Per RFC 5321 4.1.2, "all quoted forms MUST be treated as equivalent,
171+
# and the sending system SHOULD transmit the form that uses the minimum quoting possible."
172+
if valid == "quoted":
173+
local = '"' + re.sub(r'(["\\])', r'\\\1', local) + '"'
174+
117175
return {
118176
"local_part": local,
119-
"ascii_local_part": None, # no ASCII form is possible
120-
"smtputf8": True,
177+
"ascii_local_part": local if not requires_smtputf8 else None,
178+
"smtputf8": requires_smtputf8,
121179
}
122180

123-
# It's not a valid local part either non-internationalized or internationalized.
124-
# Let's find out why.
181+
# It's not a valid local part. Let's find out why.
182+
# (Since quoted local parts are all valid or handled above, these checks
183+
# don't apply in those cases.)
125184

126185
# Check for invalid characters.
127186
# (RFC 2822 Section 3.2.4 / RFC 5322 Section 3.2.3, plus RFC 6531 section 3.3)
@@ -142,7 +201,7 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
142201
raise EmailSyntaxError("The email address contains invalid characters before the @-sign.")
143202

144203

145-
def check_unsafe_chars(s):
204+
def check_unsafe_chars(s, allow_space=False):
146205
# Check for unsafe characters or characters that would make the string
147206
# invalid or non-sensible Unicode.
148207
bad_chars = set()
@@ -158,13 +217,25 @@ def check_unsafe_chars(s):
158217
# sensible.
159218
if i == 0:
160219
bad_chars.add(c)
220+
elif category == "Zs":
221+
# Spaces outside of the ASCII range are not specifically disallowed in
222+
# internationalized addresses as far as I can tell, but they violate
223+
# the spirit of the non-internationalized specification that email
224+
# addresses do not contain ASCII spaces when not quoted. Excluding
225+
# ASCII spaces when not quoted is handled directly by the atom regex.
226+
#
227+
# In quoted-string local parts, spaces are explicitly permitted, and
228+
# the ASCII space has category Zs, so we must allow it here, and we'll
229+
# allow all Unicode spaces to be consistent.
230+
if not allow_space:
231+
bad_chars.add(c)
161232
elif category[0] == "Z":
162-
# Spaces and line/paragraph characters (Z) outside of the ASCII range
163-
# are not specifically disallowed as far as I can tell, but they
164-
# violate the spirit of the non-internationalized specification that
165-
# email addresses do not contain spaces or line breaks when not quoted.
233+
# The two line and paragraph separator characters (in categories Zl and Zp)
234+
# are not specifically disallowed in internationalized addresses
235+
# as far as I can tell, but they violate the spirit of the non-internationalized
236+
# specification that email addresses do not contain line breaks when not quoted.
166237
bad_chars.add(c)
167-
elif category[0] == "C":
238+
elif category[0] in ("C", "Z"):
168239
# Control, format, surrogate, private use, and unassigned code points (C)
169240
# are all unsafe in various ways. Control and format characters can affect
170241
# text rendering if the email address is concatenated with other text.

0 commit comments

Comments
 (0)