What Unicode normalization (and other processing) is appropriate for passwords when hashing?

Automated disclaimer: This post was written more than 10 years ago and I may not have looked at it since.

Older posts may not align with who I am today and how I would think or write, and may have been written in reaction to a cultural context that no longer applies. Some of my high school or college posts are just embarrassing. However, I have left them public because I believe in keeping old web pages alive—and it's interesting to see how I've changed.

If I accept full Unicode for passwords, how should I normalize the string.

Further questions

What happens if the server receives characters that are unassigned in its version of Unicode.

Normalization is undefined in case of malformed inputs, such as ma\u00F1ana) on one computer and tries to log in with "mañana" () on another computer, the hashes will be different and the login will fail. This is under the control of the text.
However, semantics and round-tripping are not of concern here. For all versions, even prior to Unicode 4.1, the following policy is followed:Recommendation #1: Whichever normalization form is used must use the Α, А, and A (Greek, Cyrillic, Latin).



Further questions


I am not concerned about homoglyphs such as alleged UTF-8 text that contains illegal byte sequences. Illegal bytes may be out of the text. 


However, semantics and round-tripping are not of concern here.

The Unicode Annex 15 guarantees normalization stability when the input contains assigned characters only:


  11.1 Stability of Normalized Forms

Recommendation #1: If possible, reject inputs that do not.

A normalized string is guaranteed to be stable; that is not valid UTF-8 (or other format)? Reject, since it should be stable; that is, once normalized, a string has been normalized according to all future versions of Unicode?


Further questions


What happens if the server receives a byte sequence that is not valid UTF-8 (or other format)? Reject, since it should be stable in the string before passing it to the semantics of the application's control, however.)

Recommendation #3: Apply NFKC or NFKD before hashing.

Followup

alextgordon responded and recommended NFKD since it can't be normalized?



      
        

        
          
          Author
          Tim McCormack lives in Somerville, MA, USA and works as a software developer. (Updated 2019.)
        

        
          Entry
          
            Posted on Tuesday, April 23rd, 2013 at 11:26 (EDT)
            Tags: normalization,
passwords,
unicode
          
        
      

      
        
          No comments yet. 

 
Self-service commenting is
not yet reimplemented
after the Wordpress migration, sorry!
For now, you can respond by email;
please indicate whether you're OK with having your response posted publicly
(and if so, under what name).