diff options
Diffstat (limited to 'Documentation/unicode.txt')
-rw-r--r-- | Documentation/unicode.txt | 175 |
1 files changed, 175 insertions, 0 deletions
diff --git a/Documentation/unicode.txt b/Documentation/unicode.txt new file mode 100644 index 000000000000..4a33f81cadb1 --- /dev/null +++ b/Documentation/unicode.txt | |||
@@ -0,0 +1,175 @@ | |||
1 | Last update: 2005-01-17, version 1.4 | ||
2 | |||
3 | This file is maintained by H. Peter Anvin <unicode@lanana.org> as part | ||
4 | of the Linux Assigned Names And Numbers Authority (LANANA) project. | ||
5 | The current version can be found at: | ||
6 | |||
7 | http://www.lanana.org/docs/unicode/unicode.txt | ||
8 | |||
9 | ------------------------ | ||
10 | |||
11 | The Linux kernel code has been rewritten to use Unicode to map | ||
12 | characters to fonts. By downloading a single Unicode-to-font table, | ||
13 | both the eight-bit character sets and UTF-8 mode are changed to use | ||
14 | the font as indicated. | ||
15 | |||
16 | This changes the semantics of the eight-bit character tables subtly. | ||
17 | The four character tables are now: | ||
18 | |||
19 | Map symbol Map name Escape code (G0) | ||
20 | |||
21 | LAT1_MAP Latin-1 (ISO 8859-1) ESC ( B | ||
22 | GRAF_MAP DEC VT100 pseudographics ESC ( 0 | ||
23 | IBMPC_MAP IBM code page 437 ESC ( U | ||
24 | USER_MAP User defined ESC ( K | ||
25 | |||
26 | In particular, ESC ( U is no longer "straight to font", since the font | ||
27 | might be completely different than the IBM character set. This | ||
28 | permits for example the use of block graphics even with a Latin-1 font | ||
29 | loaded. | ||
30 | |||
31 | Note that although these codes are similar to ISO 2022, neither the | ||
32 | codes nor their uses match ISO 2022; Linux has two 8-bit codes (G0 and | ||
33 | G1), whereas ISO 2022 has four 7-bit codes (G0-G3). | ||
34 | |||
35 | In accordance with the Unicode standard/ISO 10646 the range U+F000 to | ||
36 | U+F8FF has been reserved for OS-wide allocation (the Unicode Standard | ||
37 | refers to this as a "Corporate Zone", since this is inaccurate for | ||
38 | Linux we call it the "Linux Zone"). U+F000 was picked as the starting | ||
39 | point since it lets the direct-mapping area start on a large power of | ||
40 | two (in case 1024- or 2048-character fonts ever become necessary). | ||
41 | This leaves U+E000 to U+EFFF as End User Zone. | ||
42 | |||
43 | [v1.2]: The Unicodes range from U+F000 and up to U+F7FF have been | ||
44 | hard-coded to map directly to the loaded font, bypassing the | ||
45 | translation table. The user-defined map now defaults to U+F000 to | ||
46 | U+F0FF, emulating the previous behaviour. In practice, this range | ||
47 | might be shorter; for example, vgacon can only handle 256-character | ||
48 | (U+F000..U+F0FF) or 512-character (U+F000..U+F1FF) fonts. | ||
49 | |||
50 | |||
51 | Actual characters assigned in the Linux Zone | ||
52 | -------------------------------------------- | ||
53 | |||
54 | In addition, the following characters not present in Unicode 1.1.4 | ||
55 | have been defined; these are used by the DEC VT graphics map. [v1.2] | ||
56 | THIS USE IS OBSOLETE AND SHOULD NO LONGER BE USED; PLEASE SEE BELOW. | ||
57 | |||
58 | U+F800 DEC VT GRAPHICS HORIZONTAL LINE SCAN 1 | ||
59 | U+F801 DEC VT GRAPHICS HORIZONTAL LINE SCAN 3 | ||
60 | U+F803 DEC VT GRAPHICS HORIZONTAL LINE SCAN 7 | ||
61 | U+F804 DEC VT GRAPHICS HORIZONTAL LINE SCAN 9 | ||
62 | |||
63 | The DEC VT220 uses a 6x10 character matrix, and these characters form | ||
64 | a smooth progression in the DEC VT graphics character set. I have | ||
65 | omitted the scan 5 line, since it is also used as a block-graphics | ||
66 | character, and hence has been coded as U+2500 FORMS LIGHT HORIZONTAL. | ||
67 | |||
68 | [v1.3]: These characters have been officially added to Unicode 3.2.0; | ||
69 | they are added at U+23BA, U+23BB, U+23BC, U+23BD. Linux now uses the | ||
70 | new values. | ||
71 | |||
72 | [v1.2]: The following characters have been added to represent common | ||
73 | keyboard symbols that are unlikely to ever be added to Unicode proper | ||
74 | since they are horribly vendor-specific. This, of course, is an | ||
75 | excellent example of horrible design. | ||
76 | |||
77 | U+F810 KEYBOARD SYMBOL FLYING FLAG | ||
78 | U+F811 KEYBOARD SYMBOL PULLDOWN MENU | ||
79 | U+F812 KEYBOARD SYMBOL OPEN APPLE | ||
80 | U+F813 KEYBOARD SYMBOL SOLID APPLE | ||
81 | |||
82 | Klingon language support | ||
83 | ------------------------ | ||
84 | |||
85 | In 1996, Linux was the first operating system in the world to add | ||
86 | support for the artificial language Klingon, created by Marc Okrand | ||
87 | for the "Star Trek" television series. This encoding was later | ||
88 | adopted by the ConScript Unicode Registry and proposed (but ultimately | ||
89 | rejected) for inclusion in Unicode Plane 1. Thus, it remains as a | ||
90 | Linux/CSUR private assignment in the Linux Zone. | ||
91 | |||
92 | This encoding has been endorsed by the Klingon Language Institute. | ||
93 | For more information, contact them at: | ||
94 | |||
95 | http://www.kli.org/ | ||
96 | |||
97 | Since the characters in the beginning of the Linux CZ have been more | ||
98 | of the dingbats/symbols/forms type and this is a language, I have | ||
99 | located it at the end, on a 16-cell boundary in keeping with standard | ||
100 | Unicode practice. | ||
101 | |||
102 | NOTE: This range is now officially managed by the ConScript Unicode | ||
103 | Registry. The normative reference is at: | ||
104 | |||
105 | http://www.evertype.com/standards/csur/klingon.html | ||
106 | |||
107 | Klingon has an alphabet of 26 characters, a positional numeric writing | ||
108 | system with 10 digits, and is written left-to-right, top-to-bottom. | ||
109 | |||
110 | Several glyph forms for the Klingon alphabet have been proposed. | ||
111 | However, since the set of symbols appear to be consistent throughout, | ||
112 | with only the actual shapes being different, in keeping with standard | ||
113 | Unicode practice these differences are considered font variants. | ||
114 | |||
115 | U+F8D0 KLINGON LETTER A | ||
116 | U+F8D1 KLINGON LETTER B | ||
117 | U+F8D2 KLINGON LETTER CH | ||
118 | U+F8D3 KLINGON LETTER D | ||
119 | U+F8D4 KLINGON LETTER E | ||
120 | U+F8D5 KLINGON LETTER GH | ||
121 | U+F8D6 KLINGON LETTER H | ||
122 | U+F8D7 KLINGON LETTER I | ||
123 | U+F8D8 KLINGON LETTER J | ||
124 | U+F8D9 KLINGON LETTER L | ||
125 | U+F8DA KLINGON LETTER M | ||
126 | U+F8DB KLINGON LETTER N | ||
127 | U+F8DC KLINGON LETTER NG | ||
128 | U+F8DD KLINGON LETTER O | ||
129 | U+F8DE KLINGON LETTER P | ||
130 | U+F8DF KLINGON LETTER Q | ||
131 | - Written <q> in standard Okrand Latin transliteration | ||
132 | U+F8E0 KLINGON LETTER QH | ||
133 | - Written <Q> in standard Okrand Latin transliteration | ||
134 | U+F8E1 KLINGON LETTER R | ||
135 | U+F8E2 KLINGON LETTER S | ||
136 | U+F8E3 KLINGON LETTER T | ||
137 | U+F8E4 KLINGON LETTER TLH | ||
138 | U+F8E5 KLINGON LETTER U | ||
139 | U+F8E6 KLINGON LETTER V | ||
140 | U+F8E7 KLINGON LETTER W | ||
141 | U+F8E8 KLINGON LETTER Y | ||
142 | U+F8E9 KLINGON LETTER GLOTTAL STOP | ||
143 | |||
144 | U+F8F0 KLINGON DIGIT ZERO | ||
145 | U+F8F1 KLINGON DIGIT ONE | ||
146 | U+F8F2 KLINGON DIGIT TWO | ||
147 | U+F8F3 KLINGON DIGIT THREE | ||
148 | U+F8F4 KLINGON DIGIT FOUR | ||
149 | U+F8F5 KLINGON DIGIT FIVE | ||
150 | U+F8F6 KLINGON DIGIT SIX | ||
151 | U+F8F7 KLINGON DIGIT SEVEN | ||
152 | U+F8F8 KLINGON DIGIT EIGHT | ||
153 | U+F8F9 KLINGON DIGIT NINE | ||
154 | |||
155 | U+F8FD KLINGON COMMA | ||
156 | U+F8FE KLINGON FULL STOP | ||
157 | U+F8FF KLINGON SYMBOL FOR EMPIRE | ||
158 | |||
159 | Other Fictional and Artificial Scripts | ||
160 | -------------------------------------- | ||
161 | |||
162 | Since the assignment of the Klingon Linux Unicode block, a registry of | ||
163 | fictional and artificial scripts has been established by John Cowan | ||
164 | <jcowan@reutershealth.com> and Michael Everson <everson@evertype.com>. | ||
165 | The ConScript Unicode Registry is accessible at: | ||
166 | |||
167 | http://www.evertype.com/standards/csur/ | ||
168 | |||
169 | The ranges used fall at the low end of the End User Zone and can hence | ||
170 | not be normatively assigned, but it is recommended that people who | ||
171 | wish to encode fictional scripts use these codes, in the interest of | ||
172 | interoperability. For Klingon, CSUR has adopted the Linux encoding. | ||
173 | The CSUR people are driving adding Tengwar and Cirth into Unicode | ||
174 | Plane 1; the addition of Klingon to Unicode Plane 1 has been rejected | ||
175 | and so the above encoding remains official. | ||