Each of the 128 blocks of 64K characters from this set are called a plane. The first plane agrees with the 16 bit Unicode character set. The following diagram is adapted from the linux man page by Markus Kuhn mailto:mskuhn@cip.informatik.uni-erlangen.de and shows how the encoding works clearly.
0x00000000 - 0x0000007F:
0xxxxxxx
00-7F
80/7F
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
C0-DF 80-BF
E0/1F C0/3F
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
E0-EF
F0/0F
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
F0-F7
F8/07
0x00200000 - 0x03FFFFFF:
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
F8-FB
FC/03
0x04000000 - 0x7FFFFFFF:
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
FC-FD
FE/01
The first row shows the unicode range in hex.
The second row shows the utf8 encoding in binary:
the xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent
the code number of the character can be used. The third
row shows this byte ranges in hex. The fourth row
shows the mask required to select the fixed bits position
of each byte, and the mask required to select the variable (xxx)
positions.
The python function utf8 encodes an integer in a string using the utf-8 encoding. The function seq_to_utf8 translates a unicode string, represented by a sequence of integers, into a utf8 string.
1: #line 63 "utf8.ipk" 2: def utf8(i): 3: if i < 0x80: 4: return chr(i) 5: if i < 0x800: 6: return chr(0xC0 | (i>>6) & 0x1F)+\ 7: chr(0x80 | i & 0x3F) 8: if i < 0x10000L: 9: return chr(0xE0 | (i>>12) & 0xF)+\ 10: chr(0x80 | (i>>6) & 0x3F)+\ 11: chr(0x80 | i & 0x3F) 12: if i < 0x200000L: 13: return chr(0xF0 | (i>>18) & 0x7)+\ 14: chr(0x80 | (i>>12) & 0x3F)+\ 15: chr(0x80 | (i>>6) & 0x3F)+\ 16: chr(0x80 | i & 0x3F) 17: if i < 0x4000000L: 18: return chr(0xF8 | (i>>24) & 0x3)+\ 19: chr(0x80 | (i>>18) & 0x3F)+\ 20: chr(0x80 | (i>>12) & 0x3F)+\ 21: chr(0x80 | (i>>6) & 0x3F)+\ 22: chr(0x80 | i & 0x3F) 23: return chr(0xFC | (i>>30) & 0x1)+\ 24: chr(0x80 | (i>>24) & 0x3F)+\ 25: chr(0x80 | (i>>18) & 0x3F)+\ 26: chr(0x80 | (i>>12) & 0x3F)+\ 27: chr(0x80 | (i>>6) & 0x3F)+\ 28: chr(0x80 | i & 0x3F) 29: 30: def seq_to_utf8(a): 31: s = '' 32: for ch in a: s = s + utf8(ch) 33: return s 34: 35: def parse_utf8(s,i): 36: lead = ord(s[i]) 37: if lead & 0x80 == 0: 38: return lead & 0x7F,i+1 # ASCII 39: if lead & 0xE0 == 0xC0: 40: return ((lead & 0x1F) << 6)|\ 41: (ord(s[i+1]) & 0x3F),i+2 42: if lead & 0xF0 == 0xE0: 43: return ((lead & 0x1F)<<12)|\ 44: ((ord(s[i+1]) & 0x3F) <<6)|\ 45: (ord(s[i+2]) & 0x3F),i+3 46: if lead & 0xF8 == 0xF0: 47: return ((lead & 0x1F)<<18)|\ 48: ((ord(s[i+1]) & 0x3F) <<12)|\ 49: ((ord(s[i+2]) & 0x3F) <<6)|\ 50: (ord(s[i+3]) & 0x3F),i+4 51: if lead & 0xFC == 0xF8: 52: return ((lead & 0x1F)<<24)|\ 53: ((ord(s[i+1]) & 0x3F) <<18)|\ 54: ((ord(s[i+2]) & 0x3F) <<12)|\ 55: ((ord(s[i+3]) & 0x3F) <<6)|\ 56: (ord(s[i+4]) & 0x3F),i+5 57: if lead & 0xFE == 0xFC: 58: return ((lead & 0x1F)<<30)|\ 59: ((ord(s[i+1]) & 0x3F) <<24)|\ 60: ((ord(s[i+2]) & 0x3F) <<18)|\ 61: ((ord(s[i+3]) & 0x3F) <<12)|\ 62: ((ord(s[i+4]) & 0x3F) <<6)|\ 63: (ord(s[i+5]) & 0x3F),i+6 64: return lead, i+1 # error, just use bad character 65: