Monday, February 5, 2007

Internationalized Emailing with Python

It turns out that creating internationalized emails (emails including characters not in US-ASCII) is quite a bit trickier than one would hope. There are a few relevant RFCs:
http://www.faqs.org/rfcs/rfc2822.html
http://www.faqs.org/rfcs/rfc2047.html
http://www.faqs.org/rfcs/rfc2231.html

All dealing with how you encode the content, as well as the headers. Mostly because everything was assumed to run on only 7-bit compliant systems so you have to encode the heck out of everything. For the body of the email, you just add a "Content-Type: ... charset=" field, which explains what charset (codepage, encoding, etc) the content is in.

However, that doesn't work for headers, because that would require another header to define the encoding of this header. So instead they decided that "=?utf-8?b?ZsK1?=" was a good way to encode "".

This also wouldn't have been so bad, except they also decided that email addresses could not be escaped in this way, so you must use "=?utf-8?q?Joe_f=C2=B5?= " and not "=?utf-8?b?Sm9lIGbCtSA8am9lQGZvby5jb20+?=". (That is "Joe fµ " encoded as a single string).

So it turns out that the python standard library provides most of what you need, but you end up needing a bit of work to get it all together. So in the bzr-email plugin (which generates an email whenever you commit your changes) I did some work to make a nicer interface for creating and sending an email. Basically, it just assumes that you are going to use a Unicode string for the username + email address, splits them, and sets up all the right headers. It also handles connecting to a SMTP host, so you end up doing:

conn = SMTPConnection(config)
conn.send_text_email(u'Joe
',
[u'Måry '],
u'Subject Hellø',
u'Body text\n')

This is especially important because Bazaar supports fully Unicode user names, commit messages, filenames, etc. (Which was tricky enough to get right because of all the complexities of Unicode.)

But I'm happy to say it all seems to be working. Now we just need to figure out how we would change the python standard library "email" library to make it easier for everyone else. (A further complication is that they changed the naming scheme between python2.4 and python2.5. It used to be "email.Utils" and it is now "email.utils". "email.MIMEText.MIMEText" became "email.mime.text.MIMEText".) Overall, I prefer the new layout, but it does mean you need a test import to work around it.


After doing the work, Marius Gedminas showed me where he had also run into the same situation:
http://mg.pov.lt/blog/unicode-emails-in-python.html

No comments: