Skip to content

Commit 4691a62

Browse files
committed
Parse display name <addr> syntax
Per request in #116, parse display name syntax also, but don't allow it unless a new allow_display_name option is set. Parsing according to the MIME specification probably isn't what's generally wanted since the use case is probably parsing inputs in email composition-like user interfaces. So it's in the spirit of a MIME message but not the letter. If display name syntax is permitted, return the unquoted/unescaped display name in the returned object.
1 parent 3b1b45c commit 4691a62

File tree

8 files changed

+220
-43
lines changed

8 files changed

+220
-43
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
In Development
22
--------------
33

4+
* A new option to parse `My Name <address@domain>` strings, i.e. a display name plus an email address in angle brackets, is now available. It is off by default.
45
* When a domain name has no MX record but does have an A or AAAA record, if none of the IP addresses in the response are globally reachable (i.e. not Private-Use, Loopback, etc.), the response is treated as if there was no A/AAAA response and the email address will fail the deliverability check.
56
* When a domain name has no MX record but does have an A or AAAA record, the mx field in the object returned by validate_email incorrectly held the IP addresses rather than the domain itself.
67
* Fixes in tests.

README.md

Lines changed: 10 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,7 @@ Python 3.8+ by [Joshua Tauberer](https://joshdata.me).
77
This library validates that a string is of the form `name@example.com`
88
and optionally checks that the domain name is set up to receive email.
99
This is the sort of validation you would want when you are identifying
10-
users by their email address like on a registration/login form (but not
11-
necessarily for composing an email message, see below).
10+
users by their email address like on a registration form.
1211

1312
Key features:
1413

@@ -18,7 +17,9 @@ Key features:
1817
can display to end-users.
1918
* Checks deliverability (optional): Does the domain name resolve?
2019
(You can override the default DNS resolver to add query caching.)
21-
* Supports internationalized domain names and internationalized local parts.
20+
* Supports internationalized domain names (like `@ツ.life`),
21+
internationalized local parts (like `ツ@example.com`),
22+
and optionally parses display names (e.g. `"My Name" <me@example.com>`).
2223
* Rejects addresses with unsafe Unicode characters, obsolete email address
2324
syntax that you'd find unexpected, special use domain names like
2425
`@localhost`, and domains without a dot by default. This is an
@@ -28,9 +29,8 @@ Key features:
2829
* Python type annotations are used.
2930

3031
This is an opinionated library. You should definitely also consider using
31-
the less-opinionated [pyIsEmail](https://github.com/michaelherold/pyIsEmail) and
32-
[flanker](https://github.com/mailgun/flanker) if they are better for your
33-
use case.
32+
the less-opinionated [pyIsEmail](https://github.com/michaelherold/pyIsEmail)
33+
if it works better for you.
3434

3535
[![Build Status](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml/badge.svg)](https://github.com/JoshData/python-email-validator/actions/workflows/test_and_build.yaml)
3636

@@ -144,6 +144,8 @@ The `validate_email` function also accepts the following keyword arguments
144144

145145
`allow_domain_literal=False`: Set to `True` to allow bracketed IPv4 and "IPv6:"-prefixd IPv6 addresses in the domain part of the email address. No deliverability checks are performed for these addresses. In the object returned by `validate_email`, the normalized domain will use the condensed IPv6 format, if applicable. The object's `domain_address` attribute will hold the parsed `ipaddress.IPv4Address` or `ipaddress.IPv6Address` object if applicable. You can also set `email_validator.ALLOW_DOMAIN_LITERAL` to `True` to turn this on for all calls by default.
146146

147+
`allow_display_name=False`: Set to `True` to allow a display name and bracketed address in the input string, like `My Name <me@example.org>`. It's implemented in the spirit but not the letter of RFC 5322 3.4, so it may be stricter or more relaxed than what you want. The display name, if present, is provided in the returned object's `display_name` field after being unquoted and unescaped. You can also set `email_validator.ALLOW_DISPLAY_NAME` to `True` to turn this on for all calls by default.
148+
147149
`allow_empty_local=False`: Set to `True` to allow an empty local part (i.e.
148150
`@example.com`), e.g. for validating Postfix aliases.
149151

@@ -395,6 +397,7 @@ are:
395397
| `domain` | The canonical internationalized Unicode form of the domain part of the email address. If the returned string contains non-ASCII characters, either the [SMTPUTF8](https://tools.ietf.org/html/rfc6531) feature of your mail relay will be required to transmit the message or else the email address's domain part must be converted to IDNA ASCII first: Use `ascii_domain` field instead. |
396398
| `ascii_domain` | The [IDNA](https://tools.ietf.org/html/rfc5891) [Punycode](https://www.rfc-editor.org/rfc/rfc3492.txt)-encoded form of the domain part of the given email address, as it would be transmitted on the wire. |
397399
| `domain_address` | If domain literals are allowed and if the email address contains one, an `ipaddress.IPv4Address` or `ipaddress.IPv6Address` object. |
400+
| `display_name` | If no display name was present and angle brackets do not surround the address, this will be `None`; otherwise, it will be set to the display name, or the empty string if there were angle brackets but no display name. If the display name was quoted, it will be unquoted and unescaped. |
398401
| `smtputf8` | A boolean indicating that the [SMTPUTF8](https://tools.ietf.org/html/rfc6531) feature of your mail relay will be required to transmit messages to this address because the local part of the address has non-ASCII characters (the local part cannot be IDNA-encoded). If `allow_smtputf8=False` is passed as an argument, this flag will always be false because an exception is raised if it would have been true. |
399402
| `mx` | A list of (priority, domain) tuples of MX records specified in the DNS for the domain (see [RFC 5321 section 5](https://tools.ietf.org/html/rfc5321#section-5)). May be `None` if the deliverability check could not be completed because of a temporary issue like a timeout. |
400403
| `mx_fallback_type` | `None` if an `MX` record is found. If no MX records are actually specified in DNS and instead are inferred, through an obsolete mechanism, from A or AAAA records, the value is the type of DNS record used instead (`A` or `AAAA`). May be `None` if the deliverability check could not be completed because of a temporary issue like a timeout. |
@@ -458,4 +461,4 @@ git push --tags
458461
License
459462
-------
460463

461-
This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to Feb. 4, 2024, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).
464+
This project is free of any copyright restrictions per the [Unlicense](https://unlicense.org/). (Prior to Feb. 4, 2024, the project was made available under the terms of the [CC0 1.0 Universal public domain dedication](http://creativecommons.org/publicdomain/zero/1.0/).) See [LICENSE](LICENSE) and [CONTRIBUTING.md](CONTRIBUTING.md).

email_validator/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ def caching_resolver(*args, **kwargs):
2525
ALLOW_SMTPUTF8 = True
2626
ALLOW_QUOTED_LOCAL = False
2727
ALLOW_DOMAIN_LITERAL = False
28+
ALLOW_DISPLAY_NAME = False
2829
GLOBALLY_DELIVERABLE = True
2930
CHECK_DELIVERABILITY = True
3031
TEST_ENVIRONMENT = False

email_validator/exceptions_types.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,9 @@ class ValidatedEmail:
6262
mechanism, from A or AAAA records, the value is the type of DNS record used instead (`A` or `AAAA`)."""
6363
mx_fallback_type: str
6464

65+
"""The display name in the original input text, unquoted and unescaped, or None."""
66+
display_name: str
67+
6568
"""Tests use this constructor."""
6669
def __init__(self, **kwargs):
6770
for k, v in kwargs.items():
@@ -120,6 +123,7 @@ def __eq__(self, other):
120123
and repr(sorted(self.mx) if getattr(self, 'mx', None) else None)
121124
== repr(sorted(other.mx) if getattr(other, 'mx', None) else None)
122125
and getattr(self, 'mx_fallback_type', None) == getattr(other, 'mx_fallback_type', None)
126+
and getattr(self, 'display_name', None) == getattr(other, 'display_name', None)
123127
)
124128

125129
"""This helps producing the README."""
@@ -128,7 +132,8 @@ def as_constructor(self):
128132
+ ",".join(f"\n {key}={repr(getattr(self, key))}"
129133
for key in ('normalized', 'local_part', 'domain',
130134
'ascii_email', 'ascii_local_part', 'ascii_domain',
131-
'smtputf8', 'mx', 'mx_fallback_type')
135+
'smtputf8', 'mx', 'mx_fallback_type',
136+
'display_name')
132137
if hasattr(self, key)
133138
) \
134139
+ ")"

email_validator/rfc_constants.py

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
# RFC 3629 section 4, which appear to be the Unicode code points from
1414
# U+0080 to U+10FFFF.
1515
ATEXT_INTL = ATEXT + "\u0080-\U0010FFFF"
16-
ATEXT_INTL_RE = re.compile('[.' + ATEXT_INTL + ']') # ATEXT_INTL plus dots
16+
ATEXT_INTL_DOT_RE = re.compile('[.' + ATEXT_INTL + ']') # ATEXT_INTL plus dots
1717
DOT_ATOM_TEXT_INTL = re.compile('[' + ATEXT_INTL + ']+(?:\\.[' + ATEXT_INTL + r']+)*\Z')
1818

1919
# The domain part of the email address, after IDNA (ASCII) encoding,
@@ -30,10 +30,9 @@
3030
# Quoted-string local part (RFC 5321 4.1.2, internationalized by RFC 6531 3.3)
3131
# The permitted characters in a quoted string are the characters in the range
3232
# 32-126, except that quotes and (literal) backslashes can only appear when escaped
33-
# by a backslash. When internationalized, UTF8 strings are also permitted except
33+
# by a backslash. When internationalized, UTF-8 strings are also permitted except
3434
# the ASCII characters that are not previously permitted (see above).
3535
# QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[\u0020-\u0021\u0023-\u005B\u005D-\u007E]|\\[\u0020-\u007E])*)\"@(.*)")
36-
QUOTED_LOCAL_PART_ADDR = re.compile(r"^\"((?:[^\"\\]|\\.)*)\"@(.*)")
3736
QTEXT_INTL = re.compile(r"[\u0020-\u007E\u0080-\U0010FFFF]")
3837

3938
# Length constants

email_validator/syntax.py

Lines changed: 140 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
11
from .exceptions_types import EmailSyntaxError
22
from .rfc_constants import EMAIL_MAX_LENGTH, LOCAL_PART_MAX_LENGTH, DOMAIN_MAX_LENGTH, \
3-
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_RE, ATEXT_HOSTNAME_INTL, QTEXT_INTL, \
4-
DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX, DOMAIN_LITERAL_CHARS, \
5-
QUOTED_LOCAL_PART_ADDR
3+
DOT_ATOM_TEXT, DOT_ATOM_TEXT_INTL, ATEXT_RE, ATEXT_INTL_DOT_RE, ATEXT_HOSTNAME_INTL, QTEXT_INTL, \
4+
DNS_LABEL_LENGTH_LIMIT, DOT_ATOM_TEXT_HOSTNAME, DOMAIN_NAME_REGEX, DOMAIN_LITERAL_CHARS
65

76
import re
87
import unicodedata
@@ -12,31 +11,148 @@
1211

1312

1413
def split_email(email):
15-
# Return the local part and domain part of the address and
16-
# whether the local part was quoted as a three-tuple.
14+
# Return the display name, unescaped local part, and domain part
15+
# of the address, and whether the local part was quoted. If no
16+
# display name was present and angle brackets do not surround
17+
# the address, display name will be None; otherwise, it will be
18+
# set to the display name or the empty string if there were
19+
# angle brackets but no display name.
20+
21+
# Typical email addresses have a single @-sign and no quote
22+
# characters, but the awkward "quoted string" local part form
23+
# (RFC 5321 4.1.2) allows @-signs and escaped quotes to appear
24+
# in the local part if the local part is quoted.
25+
26+
# A `display name <addr>` format is also present in MIME messages
27+
# (RFC 5322 3.4) and this format is also often recognized in
28+
# mail UIs. It's not allowed in SMTP commands or in typical web
29+
# login forms, but parsing it has been requested, so it's done
30+
# here as a convenience. It's implemented in the spirit but not
31+
# the letter of RFC 5322 3.4 because MIME messages allow newlines
32+
# and comments as a part of the CFWS rule, but this is typically
33+
# not allowed in mail UIs (although comment syntax was requested
34+
# once too).
35+
#
36+
# Display names are either basic characters (the same basic characters
37+
# permitted in email addresses, but periods are not allowed and spaces
38+
# are allowed; see RFC 5322 Appendix A.1.2), or or a quoted string with
39+
# the same rules as a quoted local part. (Multiple quoted strings might
40+
# be allowed? Unclear.) Optional space (RFC 5322 3.4 CFWS) and then the
41+
# email address follows in angle brackets.
42+
#
43+
# An initial quote is ambiguous between starting a display name or
44+
# a quoted local part --- fun.
45+
#
46+
# We assume the input string is already stripped of leading and
47+
# trailing CFWS.
48+
49+
def split_string_at_unquoted_special(text, specials):
50+
# Split the string at the first character in specials (an @-sign
51+
# or left angle bracket) that does not occur within quotes.
52+
inside_quote = False
53+
escaped = False
54+
left_part = ""
55+
for c in text:
56+
if inside_quote:
57+
left_part += c
58+
if c == '\\' and not escaped:
59+
escaped = True
60+
elif c == '"' and not escaped:
61+
# The only way to exit the quote is an unescaped quote.
62+
inside_quote = False
63+
escaped = False
64+
else:
65+
escaped = False
66+
elif c == '"':
67+
left_part += c
68+
inside_quote = True
69+
elif c in specials:
70+
# When unquoted, stop before a special character.
71+
break
72+
else:
73+
left_part += c
74+
75+
# The right part is whatever is left.
76+
right_part = text[len(left_part):]
77+
78+
return left_part, right_part
79+
80+
def unquote_quoted_string(text):
81+
# Remove surrounding quotes and unescape escaped backslashes
82+
# and quotes. Escapes are parsed liberally. I think only
83+
# backslashes and quotes can be escaped but we'll allow anything
84+
# to be.
85+
quoted = False
86+
escaped = False
87+
value = ""
88+
for i, c in enumerate(text):
89+
if quoted:
90+
if escaped:
91+
value += c
92+
escaped = False
93+
elif c == '\\':
94+
escaped = True
95+
elif c == '"':
96+
if i != len(text) - 1:
97+
raise EmailSyntaxError("Extra character(s) found after close quote: "
98+
+ ", ".join(safe_character_display(c) for c in text[i + 1:]))
99+
break
100+
else:
101+
value += c
102+
elif i == 0 and c == '"':
103+
quoted = True
104+
else:
105+
value += c
106+
107+
return value, quoted
108+
109+
# Split the string at the first unquoted @-sign or left angle bracket.
110+
left_part, right_part = split_string_at_unquoted_special(email, ("@", "<"))
111+
112+
# If the right part starts with an angle bracket,
113+
# then the left part is a display name and the rest
114+
# of the right part up to the final right angle bracket
115+
# is the email address, .
116+
if right_part.startswith("<"):
117+
# Remove space between the display name and angle bracket.
118+
left_part = left_part.rstrip()
119+
120+
# Unquote and unescape the display name.
121+
display_name, display_name_quoted = unquote_quoted_string(left_part)
122+
123+
# Check that only basic characters are present in a
124+
# non-quoted display name.
125+
if not display_name_quoted:
126+
bad_chars = {
127+
safe_character_display(c)
128+
for c in display_name
129+
if (not ATEXT_RE.match(c) and c != ' ') or c == '.'
130+
}
131+
if bad_chars:
132+
raise EmailSyntaxError("The display name contains invalid characters when not quoted: " + ", ".join(sorted(bad_chars)) + ".")
17133

18-
# Typical email addresses have a single @-sign, but the
19-
# awkward "quoted string" local part form (RFC 5321 4.1.2)
20-
# allows @-signs (and escaped quotes) to appear in the local
21-
# part if the local part is quoted. If the address is quoted,
22-
# split it at a non-escaped @-sign and unescape the escaping.
23-
if m := QUOTED_LOCAL_PART_ADDR.match(email):
24-
local_part, domain_part = m.groups()
134+
# Check for other unsafe characters.
135+
check_unsafe_chars(display_name, allow_space=True)
25136

26-
# Since backslash-escaping is no longer needed because
27-
# the quotes are removed, remove backslash-escaping
28-
# to return in the normalized form.
29-
local_part = re.sub(r"\\(.)", "\\1", local_part)
137+
# Remove the initial and trailing angle brackets.
138+
addr_spec = right_part[1:].rstrip(">")
30139

31-
return local_part, domain_part, True
140+
# Split the email address at the first unquoted @-sign.
141+
local_part, domain_part = split_string_at_unquoted_special(addr_spec, ("@",))
32142

143+
# Otherwise there is no display name. The left part is the local
144+
# part and the right part is the domain.
33145
else:
34-
# Split at the one and only at-sign.
35-
parts = email.split('@')
36-
if len(parts) != 2:
37-
raise EmailSyntaxError("The email address is not valid. It must have exactly one @-sign.")
38-
local_part, domain_part = parts
39-
return local_part, domain_part, False
146+
display_name = None
147+
local_part, domain_part = left_part, right_part
148+
149+
if domain_part.startswith("@"):
150+
domain_part = domain_part[1:]
151+
152+
# Unquote the local part if it is quoted.
153+
local_part, is_quoted_local_part = unquote_quoted_string(local_part)
154+
155+
return display_name, local_part, domain_part, is_quoted_local_part
40156

41157

42158
def get_length_reason(addr, utf8=False, limit=EMAIL_MAX_LENGTH):
@@ -215,7 +331,7 @@ def validate_email_local_part(local: str, allow_smtputf8: bool = True, allow_emp
215331
bad_chars = {
216332
safe_character_display(c)
217333
for c in local
218-
if not ATEXT_INTL_RE.match(c)
334+
if not ATEXT_INTL_DOT_RE.match(c)
219335
}
220336
if bad_chars:
221337
raise EmailSyntaxError("The email address contains invalid characters before the @-sign: " + ", ".join(sorted(bad_chars)) + ".")

0 commit comments

Comments
 (0)