Local email addresses are of course easier to identify. You can just check if the user exists on the server. But that's not much use to websites and other online services.
Not necessarily, "checking if the user exists on the server" is ambiguous. You could check /etc/passwd, but the mail system may use virtual users for "local" delivery, where "local" is defined in this case as not requiring a domain portion. The only way to even check if a user exists even locally is to try to send it mail.
Okay, so Perl Compatible Regular Expressions can parse context-free grammars. And context sensitive grammars. And who knows what more.
I understand there's a difference between theory and practice, but this is a plain misuse of the word "regular". PCRE should be renamed "Perl Compatible Parsing Facility" or something.
I regularly tested the "email regex du jour" at my previous job whenever these types of articles came up. IIRC, it was against 15+MM known good email addresses, and probably double that in known bads and nearly every one tested had its issues. [edit: we had something like 150,000 distinct active domains, and probably 1/2 that of distinct MXes (if you rolled up all the google-biz and microsoft hosted stuff)... if you think getting your email delivered by gmail is difficult, try a school district in Wyoming that appeared to have a 300baud connecting it to world running an ancient version of Groupware that rejected email according to the weather report as far as we could tell...]
Most people working on the code for that sign-up page (/what have you) neither have the regex-fu necessary nor the understanding of email to write the regex correctly... So you get a lot of shitty regexes (especially large corporations) that don't support apostrophes or dashes/plus signs in the local parts. And it doesn't matter how good your regex-fu and RFC comprehension abilities are, there are a lot of broken implementations out there and blocking a subscriber because of their broken system isn't a great business.
It took awhile, but eventually we switched our signup forms to do a couple of very effective things beyond a very simple address regex:
1) auto-suggest for common misspellings of our most common domains (gmal.com, yaho.com, etc.)
2) while the "please re-type your email" gave us enough user delay, we did a DNS lookup of the domain, then an MX lookup. If there was a problem with either, we passed an error to the user like "Please double check the domain of your email address..."
3) check for domains you know have moved. We were B2B, so if you watched your bounces closely, you'd know that asdf.com was moving to hjkl.com, so you could update your existing records, but people have serious muscle memory, and it's worth reminding them on the signup page.
I was working on tying in our bounce database (you are keeping a record of all your bounces, right?) so that automatically flagged domains would prompt the user with an error like "We've been unable to deliver to your email domain recently, if your email address is typed correctly, we recommend using a secondary email address if you have one..."
I worry about people putting things like this on the Internet. Any experienced developer knows it's a joke and that there are better ways to validate e-mail addresses; but there are plenty of inexperienced -- copy-and-paste -- developers out there. A colleague of mine did something similar, for example: he didn't even know what a regular expression was and I could see, as it was a much simpler pattern than this one, that it would fall quite far from the mark.
As well as traditional verification, I often use a perl script that takes an email as `ARG[1]`, runs this, and exits with `0` or `1` (for easy cross-language usage) because: 1) I don't like the idea of my frontend giving the impression my software takes garbage; 2) poking around my database and seeing obviously wrong, maliciously entered 'emails' makes the OCD in me flare up. Works well for me.
I worked on SaaS product with a largely non-technical audience, and we had a frequent issue with people mistyping their email addresses.
We tried several things. Turned out that both confirmation email and asking to type email twice hurt conversion rates badly (in our case - all audiences are different).
However, checking email for potential typos worked really well. We had a small set of rules:
1. Domain is very close to a popular email provider (@gnail.com, or @yaho.com, etc.).
2. Email contains a fragment very close to user name: pol@rodgers.tld for Poul Rodgers.
3. We had universities as customers, and a lot of students would enter "name@university-domain.com" instead of "name@university-domain.edu". We had a special check for it.
Overall, a couple lines of JavaScript helped us to get rid of 97% of mistyped email addresses.
Surely if you refuse TLDs (which I guess is what rule 2 does, checks that there's a dot in the domain part and refuses local domains and TLDs), you need at least 5 characters (a@b.c), and since TLDs are at least 2 letters that's 6. Although it'll still let through "@ff.cc"
A few TLDs have MX records. There's no reason to reject an address like, say, "postmaster@ws" - it's a perfectly valid address that could be actually working.
And why bother validating anything beyond the fact string's non-empty and there's "@" character? Shoot an email, if they receive it — it's a valid address (no matter how weird it may look), if they don't — well, it's not like something bad happened.
Depends on what you're doing. If it's data capture on some kind of competition page, then a more complicated email regex can increase data quality a lot.
I have built quite a lot of data capture forms, and while no or limited regex increased number of entries, a more complicated regex combined with checking MX records improved conversion rate, because there were fewer junk entries. I used `\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b` (taken from regular-expressions.info).
It has a few false positives and a few false negatives, but overall, it optimises conversion rates, which is what I was being paid for.
This, of course, becomes an issue when you want to do everything in javascript and go down the rabbit hole of regular expressions. Sort of like deciding on a pattern from assumptions as one might make the mistake of doing with names: http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-b...
I would suspect that anyone whose email did not match that regex would have such a miserable time generally getting it rejected as invalid left right and centre that they would just cave in and get a simpler one that did.
The regexp looks for exactly one '@'-character preceded and followed by at least one character that's not an '@'-character. Or, in other words, it does not allow for an email address with more than one '@'-character.
This is not really "validating" emails in the sense most people think of it. The RFC is about addressing SMTP envelopes, not entering email addresses. This would not be appropriate for e.g. checking if an address entered in a signup form is "valid." This includes a bunch of things that make no sense and aren't really email addresses (like embedded comments) and meanwhile has no idea that bogus@example.com is not an address that will actually receive mail. The only way to know an address is valid is to email it.
It's mostly a joke. One might want to use this if writing a mail server, but even then...
This isn't even about SMTP envelopes. RFC 2822 is about email headers, so it's even worse. Totally invalid for any real world usage outside perhaps an email client.
My most recent encounters with idiotic email validation is that many apps don't accept anything on a recent TLD. Even f-ing AWS SNS web console didn't let me add a perfectly valid address in a notification topic.
In brief, the address specification may look like the simple "local" @ "domain", but those subparts can be non-regular (i.e., making them hard/impossible for a regular expression engine to parse) or contain a lot of exceptions (e.g., the domain could be google.com, or it could be 12.34.56.78, or localhost, or a number of other things).
It's not a regular grammar. Regular expressions were designed to handle regular grammar, and there are cases where something that looks valid is actually invalid:
ex@256.255.255.255.
I think this partly overstates the complexity of validating an e-mail address in a registration form or similar.
If your aim is only to get a syntactically correct address to which you can try to deliver mail to, you don't need to accept stuff like:
* "Name surname" <address@example.com>
* Name surname <address@example.com>
* Group name: Member 1 <one@member.com>, "2, member2"<two@member.com>, three@member.com
* guy@nonpubliclyresolvabledomain
There are many other RFC2822-valid kind of addresses that you don't need to accept if you are not writing an e-mail client, SMTP server, or similia.
Surely it's exactly why you might ask that — admittedly as a semi-trick question, which you should probably only direct at experienced people who really ought to recognise it as such.
Email validation is a problem with a lot of plausible answers — many of them wrong — so it has the potential to be quite a good discriminant (depending on whom you're trying to hire, of course).
RFC5322-compliant regex:
Also, if you want to validate a mail address, send a mail. There is no other way.