If I accept full Unicode for passwords, how should I normalize the string before passing it to the hash function?
(Please pardon the unusual format; I originally posted this on Stack Overflow in 2013, but I moved it here in 2019 when SO's leadership became unpleasant.)
GoalsWithout normalization, if someone sets their password to "mañana" (
ma\u00F1ana) on one computer and tries to log in with "mañana" (
ma\u006E\u0303ana) on another computer, the hashes will be different and the login will fail. This is under the control of the user-agent or its operating system.
- I'd like to ensure that those hash to the same thing.
- I am not concerned about homoglyphs such as Α, А, and A (Greek, Cyrillic, Latin).
ReferenceUnicode normalization forms: https://unicode.org/reports/tr15/#Norm_Forms
- Any normalization procedure may cause collisions, e.g.
"oﬃce" == "office".
- Normalization can change the number of bytes in the string.
- What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it can't be normalized?
- What happens if the server receives characters that are unassigned in its version of Unicode?
Normalization is undefined in case of malformed inputs, such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be interpreted differently in different environments: Rejection, replacement, or omission.
Recommendation #1: If possible, reject inputs that do not conform to the expected encoding. (This may be out of the application's control, however.)
The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:
11.1 Stability of Normalized Forms
For all versions, even prior to Unicode 4.1, the following policy is followed:
A normalized string is guaranteed to be stable; that is, once normalized, a string is normalized according to all future versions of Unicode.
More precisely, if a string has been normalized according to a particular version of Unicode and contains only characters allocated in that version, it will qualify as normalized according to any future version of Unicode.
Recommendation #2: Whichever normalization form is used must use the Normalization Process for Stabilized Strings, i.e., reject any password inputs that contain unassigned characters, since their normalization is not guaranteed stable under server upgrades.
The compatibility normalization forms seem to handle Japanese better, collapsing several decompositions into the same output where the canonical forms do not.
The spec warns:
Normalization Forms KC and KD must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text.
However, semantics and round-tripping are not of concern here.
Recommendation #3: Apply NFKC or NFKD before hashing.
alextgordon responded and recommended NFKD since it should be stable in the face of new precomposed characters, unlike NFCD. devstuff pointed me to RFC 8264 and RFC 8265, the PRECIS framework and "Preparation, Enforcement, and Comparison of Internationalized Strings Representing Usernames and Passwords" respectively. These were finalized some time after my post, but have roots going back to at least 2005.