NSString and Unicode from objc.io issue 9:
At its most basic level, the Unicode standard defines a unique number for every character or symbol that is used in writing, for nearly all1 of the world’s writing systems. The numbers are called code points and are written in the form U+xxxx where the xxxx are four to six hexadecimal digits. For example, the code point U+0041 (65decimal) stands for the letter A in the Latin alphabet (same as ASCII) and U+1F61B represents the emoji named FACE WITH STUCK-OUT TONGUE, or 😛.
Unicode was originally conceived as a 16-bit encoding, providing room for 65,536 characters.
The Unicode code space was later extended to 21 bits (U+0000 to U+10FFFF) to allow for the encoding of historic scripts and rarely-used Kanji or Chinese characters.
The code space is divided into 17 planes with 65,536 characters each. Plane 0 is called the Basic Multilingual Plane (BMP) and it is where almost all characters you will encounter in the wild reside, with the notable exception of emoji.
UTF-16 itself is a variable-width encoding. Each code point in the BMP is directly mapped to one code unit. Since the BMP encompasses almost all common characters, UTF-16 typically requires only half the memory of UTF-32. The rarely used code points in other planes are encoded with two 16-bit code units. The two code units that together represent one code point are called a surrogate pair.
Since virtually all characters in modern use reside in the BMP, surrogate pairs were very rare encounters in the real world. However, this has changed a few years ago, with the inclusion of emoji into Unicode, which are in Plane 1. Emoji have become so common that your code must be able to handle them correctly.
NSString *s = @"\U0001F30D"; // earth globe emoji 🌍 NSLog(@"The length of %@ is %lu", s, [s length]); // => The length of 🌍 is 2