Confusable character detection in Prosody

Quite some time ago, during the XMPP WG meeting at IETF 84, there was some discussion on how JID mimicking could be prevented.

If you are not familiar with JID mimicking there is a good introduction in RFC 6122. The TL;DR version is: There are some codepoints in Unicode that look alike. Attackers can register a JID that looks just like that of a pre-existing user, but is still different. For example <test@example.com> looks just like <tеѕt@example.com>, however the second one uses Cyrillic characters in place of the 'e' and 's'. One way to mitigate this risk, which was suggested by Joe Hildebrand, is using the Unicode confusables table. I was intrigued by that approach and decided to implement a prototype of it for the Prosody XMPP server. The following describes what I learned in the process.

The confusables table was originally designed for use on internationalized domain names. This makes sense since the same issue that exists for XMPP user account registration obviously exists for domainname registration.

Unlike what one might expect, the confusables table is (while often described as such) not per se a list of confusable characters. For example, there is an entry which maps '1' (DIGIT ONE) to 'l' (LATIN SMALL LETTER L), and an entry mapping 'I' (LATIN CAPITAL LETTER I) to 'l' (LATIN SMALL LETTER L). However, an entry mapping '1' (DIGIT ONE) to 'I' (LATIN CAPITAL LETTER I), or vice versa, does not exist.

The reason for this become apparent once you learn about the intended usage. What this table enables you to do is not computing all other strings that might be confusable with a given one. Instead it enables you to determine whether two given strings are confusable. In order to do this both strings are converted to a so called skeleton, by applying the mappings specified in the confusable table. Two strings X and Y are said to be confusable if skeleton(X) == skeleton(Y).

In order to use this algorithm to prevent JID mimicking, a skeletons of each registered username needs to be saved and indexed. In Prosody I implemented a prosodyctl command that generates this data for all pre-existing users. When someone attempts to register a new account, the skeleton of the requested username is computed, and matched against the existing skeletons. If an account with the same skeleton already exists registration is denied.

Once I had understood how this would all work together, the actual implementation was quite straight forward. The code is available in a Mercurial export right now. To enable the detection, load the mod_mimicking module. After that execute prosodyctl mod_mimicking bootstrap <host> to generate the initial skeleton database, and you are set.

Overall I'm quite pleased with the result. I can not provide any results based on real life deployment, but from my limited testing it seems this method should reasonably well help to prevent JID mimicking.

Babelmonkeys

Confusable character detection in Prosody