1 |
neysx |
1.1 |
<?xml version='1.0' encoding="UTF-8"?> |
2 |
swift |
1.27 |
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/utf-8.xml,v 1.26 2005/06/24 19:30:27 fox2mike Exp $ --> |
3 |
neysx |
1.1 |
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> |
4 |
|
|
|
5 |
|
|
<guide link="/doc/en/utf-8.xml"> |
6 |
|
|
<title>Using UTF-8 with Gentoo</title> |
7 |
|
|
|
8 |
|
|
<author title="Author"> |
9 |
|
|
<mail link="slarti@gentoo.org">Thomas Martin</mail> |
10 |
|
|
</author> |
11 |
|
|
<author title="Contributor"> |
12 |
|
|
<mail link="devil@gentoo.org.ua">Alexander Simonov</mail> |
13 |
|
|
</author> |
14 |
fox2mike |
1.20 |
<author title="Editor"> |
15 |
neysx |
1.21 |
<mail link="fox2mike@gentoo.org">Shyam Mani</mail> |
16 |
fox2mike |
1.20 |
</author> |
17 |
neysx |
1.1 |
|
18 |
|
|
<abstract> |
19 |
|
|
This guide shows you how to set up and use the UTF-8 Unicode character set with |
20 |
|
|
your Gentoo Linux system, after explaining the benefits of Unicode and more |
21 |
|
|
specifically UTF-8. |
22 |
|
|
</abstract> |
23 |
|
|
|
24 |
fox2mike |
1.20 |
<!-- The content of this document is licensed under the CC-BY-SA license --> |
25 |
|
|
<!-- See http://creativecommons.org/licenses/by-sa/2.5 --> |
26 |
neysx |
1.1 |
<license /> |
27 |
|
|
|
28 |
swift |
1.27 |
<version>2.7</version> |
29 |
|
|
<date>2005-07-02</date> |
30 |
neysx |
1.1 |
|
31 |
|
|
<chapter> |
32 |
|
|
<title>Character Encodings</title> |
33 |
|
|
<section> |
34 |
|
|
<title>What is a Character Encoding?</title> |
35 |
|
|
<body> |
36 |
|
|
|
37 |
|
|
<p> |
38 |
|
|
Computers do not understand text themselves. Instead, every character is |
39 |
|
|
represented by a number. Traditionally, each set of numbers used to represent |
40 |
|
|
alphabets and characters (known as a coding system, encoding or character set) |
41 |
|
|
was limited in size due to limitations in computer hardware. |
42 |
|
|
</p> |
43 |
|
|
|
44 |
|
|
</body> |
45 |
|
|
</section> |
46 |
|
|
<section> |
47 |
|
|
<title>The History of Character Encodings</title> |
48 |
|
|
<body> |
49 |
|
|
|
50 |
|
|
<p> |
51 |
|
|
The most common (or at least the most widely accepted) character set is |
52 |
|
|
<b>ASCII</b> (American Standard Code for Information Interchange). It is widely |
53 |
|
|
held that ASCII is the most successful software standard ever. Modern ASCII |
54 |
|
|
was standardised in 1986 (ANSI X3.4, RFC 20, ISO/IEC 646:1991, ECMA-6) by the |
55 |
|
|
American National Standards Institute. |
56 |
|
|
</p> |
57 |
|
|
|
58 |
|
|
<p> |
59 |
|
|
ASCII is strictly seven-bit, meaning that it uses bit patterns representable |
60 |
|
|
with seven binary digits, which provides a range of 0 to 127 in decimal. These |
61 |
|
|
include 32 non-visible control characters, most between 0 and 31, with the |
62 |
|
|
final control character, DEL or delete at 127. Characters 32 to 126 are |
63 |
|
|
visible characters: a space, punctuation marks, Latin letters and numbers. |
64 |
|
|
</p> |
65 |
|
|
|
66 |
|
|
<p> |
67 |
|
|
The eighth bit in ASCII was originally used as a parity bit for error checking. |
68 |
|
|
If this is not desired, it is left as 0. This means that, with ASCII, each |
69 |
|
|
character is represented by a single byte. |
70 |
|
|
</p> |
71 |
|
|
|
72 |
|
|
<p> |
73 |
|
|
Although ASCII was enough for communication in modern English, in other |
74 |
|
|
European languages that include accented characters, things were not so easy. |
75 |
|
|
The ISO 8859 standards were developed to meet these needs. They were backwards |
76 |
|
|
compatible with ASCII, but instead of leaving the eighth bit blank, they used |
77 |
|
|
it to allow another 127 characters in each encoding. ISO 8859's limitations |
78 |
|
|
soon came to light, and there are currently 15 variants of the ISO 8859 |
79 |
|
|
standard (8859-1 through to 8859-15). Outside of the ASCII-compatible byte |
80 |
|
|
range of these character sets, there is often conflict between the letters |
81 |
|
|
represented by each byte. To complicate interoperability between character |
82 |
|
|
encodings further, Windows-1252 is used in some versions of Microsoft Windows |
83 |
|
|
instead for Western European languages. This is a superset of ISO 8859-1, |
84 |
|
|
however it is different in several ways. These sets do all retain ASCII |
85 |
|
|
compatibility, however. |
86 |
|
|
</p> |
87 |
|
|
|
88 |
|
|
<p> |
89 |
|
|
The necessary development of completely different single-byte encodings for |
90 |
|
|
non-Latin alphabets, such as EUC (Extended Unix Coding) which is used for |
91 |
|
|
Japanese and Korean (and to a lesser extent Chinese) created more confusion, |
92 |
|
|
while other operating systems still used different character sets for the same |
93 |
|
|
languages, for example, Shift-JIS and ISO-2022-JP. Users wishing to view |
94 |
|
|
cyrillic glyphs had to choose between KOI8-R for Russian and Bulgarian or |
95 |
|
|
KOI8-U for Ukrainian, as well as all the other cyrillic encodings such as the |
96 |
|
|
unsuccessful ISO 8859-5, and the common Windows-1251 set. All of these |
97 |
|
|
character sets broke most compatibility with ASCII (although KOI8 encodings |
98 |
|
|
place cyrillic characters in Latin order, so in case the eighth bit is |
99 |
|
|
stripped, text is still decipherable on an ASCII terminal through case-reversed |
100 |
|
|
transliteration.) |
101 |
|
|
</p> |
102 |
|
|
|
103 |
|
|
<p> |
104 |
|
|
This has led to confusion, and also to an almost total inability for |
105 |
|
|
multilingual communication, especially across different alphabets. Enter |
106 |
|
|
Unicode. |
107 |
|
|
</p> |
108 |
|
|
|
109 |
|
|
</body> |
110 |
|
|
</section> |
111 |
|
|
<section> |
112 |
|
|
<title>What is Unicode?</title> |
113 |
|
|
<body> |
114 |
|
|
|
115 |
|
|
<p> |
116 |
bennyc |
1.11 |
Unicode throws away the traditional single-byte limit of character sets. It |
117 |
|
|
uses 17 "planes" of 65,536 code points to describe a maximum of 1,114,112 |
118 |
|
|
characters. As the first plane, aka. "Basic Multilingual Plane" or BMP, |
119 |
|
|
contains almost everything you will ever use, many have made the wrong |
120 |
|
|
assumption that Unicode was a 16-bit character set. |
121 |
neysx |
1.1 |
</p> |
122 |
|
|
|
123 |
|
|
<p> |
124 |
|
|
Unicode has been mapped in many different ways, but the two most common are |
125 |
|
|
<b>UTF</b> (Unicode Transformation Format) and <b>UCS</b> (Universal Character |
126 |
|
|
Set). A number after UTF indicates the number of bits in one unit, while the |
127 |
|
|
number after UCS indicates the number of bytes. UTF-8 has become the most |
128 |
|
|
widespread means for the interchange of Unicode text as a result of its |
129 |
|
|
eight-bit clean nature, and it is the subject of this document. |
130 |
|
|
</p> |
131 |
|
|
|
132 |
|
|
</body> |
133 |
|
|
</section> |
134 |
|
|
<section> |
135 |
|
|
<title>UTF-8</title> |
136 |
|
|
<body> |
137 |
|
|
|
138 |
|
|
<p> |
139 |
|
|
UTF-8 is a variable-length character encoding, which in this instance means |
140 |
|
|
that it uses 1 to 4 bytes per symbol. So, the first UTF-8 byte is used for |
141 |
|
|
encoding ASCII, giving the character set full backwards compatibility with |
142 |
|
|
ASCII. UTF-8 means that ASCII and Latin characters are interchangeable with |
143 |
|
|
little increase in the size of the data, because only the first bit is used. |
144 |
|
|
Users of Eastern alphabets such as Japanese, who have been assigned a higher |
145 |
|
|
byte range are unhappy, as this results in as much as a 50% redundancy in their |
146 |
|
|
data. |
147 |
|
|
</p> |
148 |
|
|
|
149 |
|
|
</body> |
150 |
|
|
</section> |
151 |
|
|
<section> |
152 |
|
|
<title>What UTF-8 Can Do for You</title> |
153 |
|
|
<body> |
154 |
|
|
|
155 |
|
|
<p> |
156 |
|
|
UTF-8 allows you to work in a standards-compliant and internationally accepted |
157 |
bennyc |
1.11 |
multilingual environment, with a comparatively low data redundancy. UTF-8 is |
158 |
neysx |
1.1 |
the preferred way for transmitting non-ASCII characters over the Internet, |
159 |
|
|
through Email, IRC or almost any other medium. Despite this, many people regard |
160 |
|
|
UTF-8 in online communication as abusive. It is always best to be aware of the |
161 |
|
|
attitude towards UTF-8 in a specific channel, mailing list or Usenet group |
162 |
|
|
before using <e>non-ASCII</e> UTF-8. |
163 |
|
|
</p> |
164 |
|
|
|
165 |
|
|
</body> |
166 |
|
|
</section> |
167 |
|
|
</chapter> |
168 |
|
|
|
169 |
|
|
<chapter> |
170 |
|
|
<title>Setting up UTF-8 with Gentoo Linux</title> |
171 |
|
|
<section> |
172 |
|
|
<title>Finding or Creating UTF-8 Locales</title> |
173 |
|
|
<body> |
174 |
|
|
|
175 |
|
|
<p> |
176 |
|
|
Now that you understand the principles behind Unicode, you're ready to start |
177 |
|
|
using UTF-8 with your system. |
178 |
|
|
</p> |
179 |
|
|
|
180 |
|
|
<p> |
181 |
|
|
The preliminary requirement for UTF-8 is to have a version of glibc installed |
182 |
|
|
that has national language support. The recommend means to do this is the |
183 |
|
|
<path>/etc/locales.build</path> file in combination with the <c>userlocales</c> |
184 |
|
|
USE flag. It is beyond the scope of this document to explain the usage of this |
185 |
|
|
file though, luckily, the usage of this file is well documented in the comments |
186 |
|
|
within it. It is also explained in the <uri |
187 |
|
|
link="/doc/en/guide-localization.xml#doc_chap3_sect3"> Gentoo Localisation |
188 |
|
|
Guide</uri>. |
189 |
|
|
</p> |
190 |
|
|
|
191 |
|
|
<p> |
192 |
|
|
Next, we'll need to decide whether a UTF-8 locale is already available for our |
193 |
|
|
language, or whether we need to create one. |
194 |
|
|
</p> |
195 |
|
|
|
196 |
|
|
<pre caption="Checking for an existing UTF-8 locale"> |
197 |
|
|
<comment>(Replace "en_GB" with your desired locale setting)</comment> |
198 |
|
|
# <i>locale -a | grep 'en_GB'</i> |
199 |
|
|
en_GB |
200 |
bennyc |
1.12 |
en_GB.UTF-8 |
201 |
neysx |
1.1 |
</pre> |
202 |
|
|
|
203 |
|
|
<p> |
204 |
|
|
From the output of this command line, we need to take the result with a suffix |
205 |
bennyc |
1.12 |
similar to <c>.UTF-8</c>. If there is no result with a suffix similar to |
206 |
|
|
<c>.UTF-8</c>, we need to create a UTF-8 compatible locale. |
207 |
neysx |
1.1 |
</p> |
208 |
|
|
|
209 |
|
|
<note> |
210 |
|
|
Only execute the following code listing if you do not have a UTF-8 locale |
211 |
|
|
available for your language. |
212 |
|
|
</note> |
213 |
|
|
|
214 |
|
|
<pre caption="Creating a UTF-8 locale"> |
215 |
|
|
<comment>(Replace "en_GB" with your desired locale setting)</comment> |
216 |
bennyc |
1.12 |
# <i>localedef -i en_GB -f UTF-8 en_GB.UTF-8</i> |
217 |
neysx |
1.1 |
</pre> |
218 |
|
|
|
219 |
bennyc |
1.11 |
<p> |
220 |
|
|
Another way to include a UTF-8 locale is to add it to the |
221 |
|
|
<path>/etc/locales.build</path> file and rebuild <c>glibc</c> with the |
222 |
|
|
<c>userlocales</c> USE flag set. |
223 |
|
|
</p> |
224 |
|
|
|
225 |
|
|
<pre caption="Line in /etc/locales.build"> |
226 |
|
|
en_GB.UTF-8/UTF-8 |
227 |
|
|
</pre> |
228 |
|
|
|
229 |
neysx |
1.1 |
</body> |
230 |
|
|
</section> |
231 |
|
|
<section> |
232 |
|
|
<title>Setting the Locale</title> |
233 |
|
|
<body> |
234 |
|
|
|
235 |
|
|
<p> |
236 |
fox2mike |
1.19 |
There is one environment variable that needs to be set in order to use |
237 |
swift |
1.17 |
our new UTF-8 locales: <c>LC_ALL</c> (this variable overrides the <c>LANG</c> |
238 |
|
|
setting as well). There are also many different ways to set it; some people |
239 |
|
|
prefer to only have a UTF-8 environment for a specific user, in which case |
240 |
swift |
1.18 |
they set them in their <path>~/.profile</path> (if you use <c>/bin/sh</c>), |
241 |
|
|
<path>~/.bash_profile</path> or <path>~/.bashrc</path> (if you use |
242 |
|
|
<c>/bin/bash</c>). |
243 |
|
|
</p> |
244 |
|
|
|
245 |
|
|
<p> |
246 |
swift |
1.17 |
Others prefer to set the locale globally. One specific circumstance where |
247 |
|
|
the author particularly recommends doing this is when |
248 |
|
|
<path>/etc/init.d/xdm</path> is in use, because |
249 |
bennyc |
1.12 |
this init script starts the display manager and desktop before any of the |
250 |
|
|
aforementioned shell startup files are sourced, and so before any of the |
251 |
|
|
variables are in the environment. |
252 |
neysx |
1.1 |
</p> |
253 |
|
|
|
254 |
bennyc |
1.12 |
<p> |
255 |
|
|
Setting the locale globally should be done using |
256 |
swift |
1.15 |
<path>/etc/env.d/02locale</path>. The file should look something like the |
257 |
bennyc |
1.12 |
following: |
258 |
|
|
</p> |
259 |
|
|
|
260 |
|
|
<pre caption="Demonstration /etc/env.d/02locale"> |
261 |
|
|
<comment>(As always, change "en_GB.UTF-8" to your locale)</comment> |
262 |
|
|
LC_ALL="en_GB.UTF-8" |
263 |
|
|
</pre> |
264 |
|
|
|
265 |
|
|
<p> |
266 |
|
|
Next, the environment must be updated with the change. |
267 |
|
|
</p> |
268 |
bennyc |
1.10 |
|
269 |
bennyc |
1.12 |
<pre caption="Updating the environment"> |
270 |
|
|
# <i>env-update</i> |
271 |
|
|
>>> Regenerating /etc/ld.so.cache... |
272 |
|
|
* Caching service dependencies ... |
273 |
swift |
1.13 |
# <i>source /etc/profile</i> |
274 |
bennyc |
1.10 |
</pre> |
275 |
|
|
|
276 |
bennyc |
1.12 |
<p> |
277 |
|
|
Now, run <c>locale</c> with no arguments to see if we have the correct |
278 |
|
|
variables in our environment: |
279 |
|
|
</p> |
280 |
|
|
|
281 |
|
|
<pre caption="Checking if our new locale is in the environment"> |
282 |
|
|
# <i>locale</i> |
283 |
swift |
1.16 |
LANG= |
284 |
bennyc |
1.12 |
LC_CTYPE="en_GB.UTF-8" |
285 |
|
|
LC_NUMERIC="en_GB.UTF-8" |
286 |
|
|
LC_TIME="en_GB.UTF-8" |
287 |
|
|
LC_COLLATE="en_GB.UTF-8" |
288 |
|
|
LC_MONETARY="en_GB.UTF-8" |
289 |
|
|
LC_MESSAGES="en_GB.UTF-8" |
290 |
|
|
LC_PAPER="en_GB.UTF-8" |
291 |
|
|
LC_NAME="en_GB.UTF-8" |
292 |
|
|
LC_ADDRESS="en_GB.UTF-8" |
293 |
|
|
LC_TELEPHONE="en_GB.UTF-8" |
294 |
|
|
LC_MEASUREMENT="en_GB.UTF-8" |
295 |
|
|
LC_IDENTIFICATION="en_GB.UTF-8" |
296 |
|
|
LC_ALL=en_GB.UTF-8 |
297 |
neysx |
1.1 |
</pre> |
298 |
|
|
|
299 |
bennyc |
1.10 |
<p> |
300 |
bennyc |
1.12 |
That's everything. You are now using UTF-8 locales, and the next hurdle is the |
301 |
|
|
configuration of the applications you use from day to day. |
302 |
neysx |
1.1 |
</p> |
303 |
|
|
|
304 |
|
|
</body> |
305 |
|
|
</section> |
306 |
|
|
</chapter> |
307 |
|
|
|
308 |
|
|
<chapter> |
309 |
|
|
<title>Application Support</title> |
310 |
|
|
<section> |
311 |
|
|
<body> |
312 |
|
|
|
313 |
|
|
<p> |
314 |
|
|
When Unicode first started gaining momentum in the software world, multibyte |
315 |
|
|
character sets were not well suited to languages like C, in which many of the |
316 |
|
|
day-to-day programs people use are written. Even today, some programs are not |
317 |
|
|
able to handle UTF-8 properly. Fortunately, most are! |
318 |
|
|
</p> |
319 |
|
|
|
320 |
|
|
</body> |
321 |
|
|
</section> |
322 |
|
|
<section> |
323 |
|
|
<title>Filenames, NTFS, and FAT</title> |
324 |
|
|
<body> |
325 |
|
|
|
326 |
|
|
<p> |
327 |
|
|
There are several NLS options in the Linux kernel configuration menu, but it is |
328 |
|
|
important to not become confused! For the most part, the only thing you need to |
329 |
|
|
do is to build UTF-8 NLS support into your kernel, and change the default NLS |
330 |
|
|
option to utf8. |
331 |
|
|
</p> |
332 |
|
|
|
333 |
|
|
<pre caption="Kernel configuration steps for UTF-8 NLS"> |
334 |
|
|
File Systems --> |
335 |
|
|
Native Language Support --> |
336 |
|
|
(utf8) Default NLS Option |
337 |
|
|
<*> NLS UTF8 |
338 |
|
|
<comment>(Also <*> other character sets that are in use in |
339 |
|
|
your FAT filesystems or Joilet CD-ROMs.)</comment> |
340 |
|
|
</pre> |
341 |
|
|
|
342 |
|
|
<p> |
343 |
|
|
If you plan on mounting NTFS partitions, you may need to specify an <c>nls=</c> |
344 |
|
|
option with mount. For more information, see <c>man mount</c>. |
345 |
|
|
</p> |
346 |
|
|
|
347 |
|
|
<p> |
348 |
|
|
For changing the encoding of filenames, <c>app-text/convmv</c> can be used. |
349 |
|
|
</p> |
350 |
|
|
|
351 |
|
|
<pre caption="Example usage of convmv"> |
352 |
|
|
# <i>emerge --ask app-text/convmv</i> |
353 |
|
|
# <i>convmv -f current-encoding -t utf-8 filename</i> |
354 |
|
|
</pre> |
355 |
|
|
|
356 |
|
|
<p> |
357 |
|
|
For changing the <e>contents</e> of files, use the <c>iconv</c> utility, |
358 |
|
|
bundled with <c>glibc</c>: |
359 |
|
|
</p> |
360 |
|
|
|
361 |
|
|
<pre caption="Example usage of iconv"> |
362 |
|
|
<comment>(substitute iso-8859-1 with the charset you are converting from)</comment> |
363 |
|
|
<comment>(Check the output is sane)</comment> |
364 |
|
|
# <i>iconv -f iso-8859-1 -t utf-8 filename</i> |
365 |
|
|
<comment>(Convert a file, you must create another file)</comment> |
366 |
|
|
# <i>iconv -f iso-8859-1 -t utf-8 filename > newfile</i> |
367 |
|
|
</pre> |
368 |
|
|
|
369 |
|
|
<p> |
370 |
|
|
<c>app-text/recode</c> can also be used for this purpose. |
371 |
|
|
</p> |
372 |
|
|
|
373 |
|
|
</body> |
374 |
|
|
</section> |
375 |
|
|
<section> |
376 |
|
|
<title>The System Console</title> |
377 |
|
|
<body> |
378 |
|
|
|
379 |
|
|
<impo> |
380 |
|
|
You need >=sys-apps/baselayout-1.11.9 for Unicode on the console. |
381 |
|
|
</impo> |
382 |
|
|
|
383 |
|
|
<p> |
384 |
|
|
To enable UTF-8 on the console, you should edit <path>/etc/rc.conf</path> and |
385 |
|
|
set <c>UNICODE="yes"</c>, and also read the comments in that file -- it is |
386 |
|
|
important to have a font that has a good range of characters if you plan on |
387 |
|
|
making the most of Unicode. |
388 |
|
|
</p> |
389 |
|
|
|
390 |
|
|
<p> |
391 |
|
|
The <c>KEYMAP</c> variable, set in <path>/etc/conf.d/keymaps</path>, should |
392 |
fox2mike |
1.26 |
have a Unicode keymap specified. |
393 |
neysx |
1.1 |
</p> |
394 |
|
|
|
395 |
|
|
<pre caption="Example /etc/conf.d/keymaps snippet"> |
396 |
|
|
<comment>(Change "uk" to your local layout)</comment> |
397 |
fox2mike |
1.26 |
KEYMAP="uk" |
398 |
neysx |
1.1 |
</pre> |
399 |
|
|
|
400 |
|
|
</body> |
401 |
|
|
</section> |
402 |
|
|
<section> |
403 |
|
|
<title>Ncurses and Slang</title> |
404 |
|
|
<body> |
405 |
|
|
|
406 |
|
|
<note> |
407 |
|
|
Ignore any mention of Slang in this section if you do not have it installed or |
408 |
|
|
do not use it. |
409 |
|
|
</note> |
410 |
|
|
|
411 |
|
|
<p> |
412 |
|
|
It is wise to add <c>unicode</c> to your global USE flags in |
413 |
|
|
<path>/etc/make.conf</path>, and then to remerge <c>sys-libs/ncurses</c> and |
414 |
smithj |
1.22 |
<c>sys-libs/slang</c> if appropriate: |
415 |
neysx |
1.1 |
</p> |
416 |
|
|
|
417 |
|
|
<pre caption="Emerging ncurses and slang"> |
418 |
|
|
<comment>(We avoid putting these libraries in our world file with --oneshot)</comment> |
419 |
smithj |
1.22 |
# <i>emerge --oneshot sys-libs/ncurses sys-libs/slang</i> |
420 |
neysx |
1.1 |
</pre> |
421 |
|
|
|
422 |
|
|
<p> |
423 |
|
|
We also need to rebuild packages that link to these, now the USE changes have |
424 |
bennyc |
1.11 |
been applied. The tool we use (<c>revdep-rebuild</c>) is part of the |
425 |
|
|
<c>gentoolkit</c> package. |
426 |
neysx |
1.1 |
</p> |
427 |
|
|
|
428 |
|
|
<pre caption="Rebuilding of programs that link to ncurses or slang"> |
429 |
|
|
# <i>revdep-rebuild --soname libncurses.so.5</i> |
430 |
|
|
# <i>revdep-rebuild --soname libslang.so.1</i> |
431 |
|
|
</pre> |
432 |
|
|
|
433 |
|
|
</body> |
434 |
|
|
</section> |
435 |
|
|
<section> |
436 |
|
|
<title>KDE, GNOME and Xfce</title> |
437 |
|
|
<body> |
438 |
|
|
|
439 |
|
|
<p> |
440 |
|
|
All of the major desktop environments have full Unicode support, and will |
441 |
|
|
require no further setup than what has already been covered in this guide. This |
442 |
|
|
is because the underlying graphical toolkits (Qt or GTK+2) are UTF-8 aware. |
443 |
|
|
Subsequently, all applications running on top of these toolkits should be |
444 |
|
|
UTF-8-aware out of the box. |
445 |
|
|
</p> |
446 |
|
|
|
447 |
|
|
<p> |
448 |
|
|
The exceptions to this rule come in Xlib and GTK+1. GTK+1 requires a |
449 |
|
|
iso-10646-1 FontSpec in the ~/.gtkrc, for example |
450 |
|
|
<c>-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1</c>. Also, applications using |
451 |
|
|
Xlib or Xaw will need to be given a similar FontSpec, otherwise they will not |
452 |
|
|
work. |
453 |
|
|
</p> |
454 |
|
|
|
455 |
|
|
<note> |
456 |
|
|
If you have a version of the gnome1 control center around, use that instead. |
457 |
|
|
Pick any iso10646-1 font from there. |
458 |
|
|
</note> |
459 |
|
|
|
460 |
|
|
<pre caption="Example ~/.gtkrc (for GTK+1) that defines a Unicode compatible font"> |
461 |
|
|
style "user-font" |
462 |
|
|
{ |
463 |
|
|
fontset="-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" |
464 |
|
|
} |
465 |
|
|
widget_class "*" style "user-font" |
466 |
|
|
</pre> |
467 |
|
|
|
468 |
|
|
<p> |
469 |
|
|
If an application has support for both a Qt and GTK+2 GUI, the GTK+2 GUI will |
470 |
|
|
generally give better results with Unicode. |
471 |
|
|
</p> |
472 |
|
|
|
473 |
|
|
</body> |
474 |
|
|
</section> |
475 |
|
|
<section> |
476 |
|
|
<title>X11 and Fonts</title> |
477 |
|
|
<body> |
478 |
|
|
|
479 |
bennyc |
1.11 |
<impo> |
480 |
|
|
<c>x11-base/xorg-x11</c> has far better support for Unicode than XFree86 |
481 |
|
|
and is <e>highly</e> recommended. |
482 |
|
|
</impo> |
483 |
|
|
|
484 |
neysx |
1.1 |
<p> |
485 |
|
|
TrueType fonts have support for Unicode, and most of the fonts that ship with |
486 |
|
|
Xorg have impressive character support, although, obviously, not every single |
487 |
|
|
glyph available in Unicode has been created for that font. To build fonts |
488 |
|
|
(including the Bitstream Vera set) with support for East Asian letters with X, |
489 |
|
|
make sure you have the <c>cjk</c> USE flag set. Many other applications utilise |
490 |
|
|
this flag, so it may be worthwhile to add it as a permanent USE flag. |
491 |
|
|
</p> |
492 |
|
|
|
493 |
|
|
<p> |
494 |
|
|
Also, several font packages in Portage are Unicode aware. |
495 |
|
|
</p> |
496 |
|
|
|
497 |
|
|
<pre caption="Optional: Install some more Unicode-aware fonts"> |
498 |
|
|
# <i>emerge terminus-font intlfonts freefonts cronyx-fonts corefonts</i> |
499 |
|
|
</pre> |
500 |
|
|
|
501 |
|
|
</body> |
502 |
|
|
</section> |
503 |
|
|
<section> |
504 |
|
|
<title>Window Managers and Terminal Emulators</title> |
505 |
|
|
<body> |
506 |
|
|
|
507 |
|
|
<p> |
508 |
bennyc |
1.11 |
Window managers not built on GTK or Qt generally have very good Unicode |
509 |
|
|
support, as they often use the Xft library for handling fonts. If your window |
510 |
|
|
manager does not use Xft for fonts, you can still use the FontSpec mentioned in |
511 |
|
|
the previous section as a Unicode font. |
512 |
neysx |
1.1 |
</p> |
513 |
|
|
|
514 |
|
|
<p> |
515 |
|
|
Terminal emulators that use Xft and support Unicode are harder to come by. |
516 |
|
|
Aside from Konsole and gnome-terminal, the best options in Portage are |
517 |
|
|
<c>x11-terms/rxvt-unicode</c>, <c>xfce-extra/terminal</c>, |
518 |
cam |
1.5 |
<c>gnustep-apps/terminal</c>, <c>x11-terms/mlterm</c>, <c>x11-terms/mrxvt</c> or |
519 |
neysx |
1.1 |
plain <c>x11-terms/xterm</c> when built with the <c>unicode</c> USE flag and |
520 |
|
|
invoked as <c>uxterm</c>. <c>app-misc/screen</c> supports UTF-8 too, when |
521 |
bennyc |
1.11 |
invoked as <c>screen -u</c> or the following is put into the |
522 |
neysx |
1.1 |
<path>~/.screenrc</path>: |
523 |
|
|
</p> |
524 |
|
|
|
525 |
|
|
<pre caption="~/.screenrc for UTF-8"> |
526 |
|
|
defutf8 on |
527 |
|
|
</pre> |
528 |
|
|
|
529 |
|
|
</body> |
530 |
|
|
</section> |
531 |
|
|
<section> |
532 |
|
|
<title>Vim, Emacs, Xemacs and Nano</title> |
533 |
|
|
<body> |
534 |
|
|
|
535 |
|
|
<p> |
536 |
swift |
1.27 |
Vim provides full UTF-8 support, and also has builtin detection of UTF-8 files. |
537 |
|
|
For further information in Vim, use <c>:help mbyte.txt</c>. |
538 |
neysx |
1.1 |
</p> |
539 |
|
|
|
540 |
|
|
<p> |
541 |
swift |
1.27 |
Emacs 22.x and higher has full UTF-8 support as well. Xemacs 22.x does not |
542 |
|
|
support combining characters yet. |
543 |
|
|
</p> |
544 |
|
|
|
545 |
|
|
<p> |
546 |
|
|
Lower versions of Emacs and/or Xemacs might require you to install |
547 |
|
|
<c>app-emacs/mule-ucs</c> and/or <c>app-xemacs/mule-ucs</c> |
548 |
|
|
and add the following code to your <path>~/.emacs</path> to have support for CJK |
549 |
|
|
languages in UTF-8: |
550 |
|
|
</p> |
551 |
|
|
|
552 |
|
|
<pre caption="Emacs CJK UTF-8 support"> |
553 |
|
|
(require 'un-define) |
554 |
|
|
(require 'jisx0213) |
555 |
|
|
(set-language-environment "Japanese") |
556 |
|
|
(set-default-coding-systems 'utf-8) |
557 |
|
|
(set-terminal-coding-system 'utf-8) |
558 |
|
|
</pre> |
559 |
|
|
|
560 |
|
|
<p> |
561 |
neysx |
1.1 |
Nano currently does not provide support for UTF-8, although it has been planned |
562 |
|
|
for a long time. With luck, this will change in future. At the time of writing, |
563 |
|
|
UTF-8 support is in Nano's CVS, and should be included in the next release. |
564 |
|
|
</p> |
565 |
|
|
|
566 |
|
|
</body> |
567 |
|
|
</section> |
568 |
|
|
<section> |
569 |
|
|
<title>Shells</title> |
570 |
|
|
<body> |
571 |
|
|
|
572 |
|
|
<p> |
573 |
|
|
Currently, <c>bash</c> provides full Unicode support through the GNU readline |
574 |
|
|
library. Z Shell users are in a somewhat worse position -- no parts of the |
575 |
|
|
shell have Unicode support, although there is a concerted effort to add |
576 |
|
|
multibyte character set support underway at the moment. |
577 |
|
|
</p> |
578 |
|
|
|
579 |
|
|
<p> |
580 |
|
|
The C shell, <c>tcsh</c> and <c>ksh</c> do not provide UTF-8 support at all. |
581 |
|
|
</p> |
582 |
|
|
|
583 |
|
|
</body> |
584 |
|
|
</section> |
585 |
|
|
<section> |
586 |
|
|
<title>Irssi</title> |
587 |
|
|
<body> |
588 |
|
|
|
589 |
|
|
<p> |
590 |
cam |
1.5 |
Since 0.8.10, Irssi has complete UTF-8 support, although it does require a user |
591 |
|
|
to set an option. |
592 |
neysx |
1.1 |
</p> |
593 |
|
|
|
594 |
|
|
<pre caption="Enabling UTF-8 in Irssi"> |
595 |
|
|
/set term_charset UTF-8 |
596 |
|
|
</pre> |
597 |
|
|
|
598 |
|
|
<p> |
599 |
|
|
For channels where non-ASCII characters are often exchanged in non-UTF-8 |
600 |
|
|
charsets, the <c>/recode</c> command may be used to convert the characters. |
601 |
|
|
Type <c>/help recode</c> for more information. |
602 |
|
|
</p> |
603 |
|
|
|
604 |
|
|
</body> |
605 |
|
|
</section> |
606 |
|
|
<section> |
607 |
|
|
<title>Mutt</title> |
608 |
|
|
<body> |
609 |
|
|
|
610 |
|
|
<p> |
611 |
|
|
The Mutt mail user agent has very good Unicode support. To use UTF-8 with Mutt, |
612 |
|
|
put the following in your <path>~/.muttrc</path>: |
613 |
|
|
</p> |
614 |
|
|
|
615 |
|
|
<pre caption="~/.muttrc for UTF-8"> |
616 |
|
|
set send_charset="utf8" <comment>(outgoing character set)</comment> |
617 |
|
|
set charset="utf8" <comment>(display character set)</comment> |
618 |
|
|
</pre> |
619 |
|
|
|
620 |
|
|
<note> |
621 |
|
|
You may still see '?' in mail you read with Mutt. This is a result of people |
622 |
bennyc |
1.11 |
using a mail client which does not indicate the used charset. You can't do much |
623 |
|
|
about this than to ask them to configure their client correctly. |
624 |
neysx |
1.1 |
</note> |
625 |
|
|
|
626 |
|
|
<p> |
627 |
|
|
Further information is available from the <uri |
628 |
neysx |
1.25 |
link="http://wiki.mutt.org/index.cgi?MuttFaq/Charset">Mutt Wiki</uri>. |
629 |
neysx |
1.1 |
</p> |
630 |
|
|
|
631 |
|
|
</body> |
632 |
|
|
</section> |
633 |
|
|
<section> |
634 |
swift |
1.14 |
<title>Less</title> |
635 |
|
|
<body> |
636 |
|
|
|
637 |
|
|
<p> |
638 |
|
|
We all use a lot of <c>more</c> or <c>less</c> along with the <c>|</c> to be |
639 |
|
|
able to correctly see the output of a command, like for example |
640 |
|
|
<c>dmesg | less</c>. While <c>more</c> only needs the shell to be UTF-8 aware, |
641 |
|
|
<c>less</c> needs an environment variable set, <c>LESSCHARSET</c> to ensure |
642 |
|
|
that unicode characters are rendered correctly. This can be set in |
643 |
|
|
<path>/etc/profile</path> or <path>~/.bash_profile</path>. Fire up the editor |
644 |
|
|
of your choice and the add the following line to one of the files mentioned |
645 |
|
|
above. |
646 |
|
|
</p> |
647 |
|
|
|
648 |
|
|
<pre caption="Setting up the Environment variable for less"> |
649 |
|
|
LESSCHARSET=utf-8 |
650 |
|
|
</pre> |
651 |
|
|
|
652 |
|
|
</body> |
653 |
|
|
</section> |
654 |
|
|
<section> |
655 |
|
|
<title>Man</title> |
656 |
|
|
<body> |
657 |
|
|
|
658 |
|
|
<p> |
659 |
|
|
Man pages are an integral part of any Linux machine. To ensure that any |
660 |
|
|
unicode in your man pages render correctly, edit <path>/etc/man.conf</path> |
661 |
|
|
and replace a line as shown below. |
662 |
|
|
</p> |
663 |
|
|
|
664 |
|
|
<pre caption="man.conf changes for Unicode support"> |
665 |
|
|
<comment>(This is the old line)</comment> |
666 |
|
|
NROFF /usr/bin/nroff -Tascii -c -mandoc |
667 |
|
|
<comment>(Replace the one above with this)</comment> |
668 |
|
|
NROFF /usr/bin/nroff -mandoc -c |
669 |
|
|
</pre> |
670 |
|
|
|
671 |
|
|
</body> |
672 |
|
|
</section> |
673 |
|
|
<section> |
674 |
fox2mike |
1.20 |
<title>elinks and links</title> |
675 |
|
|
<body> |
676 |
|
|
|
677 |
|
|
<p> |
678 |
|
|
These are commonly used text-based browsers, and we shall see how we can enable |
679 |
|
|
UTF-8 support on them. On <c>elinks</c> and <c>links</c>, there are two ways to |
680 |
|
|
go about this, one using the Setup option from within the browser or editing the |
681 |
|
|
config file. To set the option through the browser, open a site with |
682 |
|
|
<c>elinks</c> or <c>links</c> and then <c>Alt+S</c> to enter the Setup Menu then |
683 |
|
|
select Terminal options, or press <c>T</c>. Scroll down and select the last |
684 |
|
|
option <c>UTF-8 I/O</c> by pressing Enter. Then Save and exit the menu. On |
685 |
|
|
<c>links</c> you may have to do a repeat <c>Alt+S</c> and then press <c>S</c> to |
686 |
|
|
save. The config file option, is shown below. |
687 |
|
|
</p> |
688 |
|
|
|
689 |
|
|
<pre caption="Enabling UTF-8 for elinks/links"> |
690 |
|
|
<comment>(For elinks, edit /etc/elinks/elinks.conf or ~/.elinks/elinks.conf and |
691 |
|
|
add the following line)</comment> |
692 |
|
|
set terminal.linux.utf_8_io = 1 |
693 |
|
|
|
694 |
|
|
<comment>(For links, edit ~/.links/links.cfg and add the following |
695 |
|
|
line)</comment> |
696 |
|
|
terminal "xterm" 0 1 0 us-ascii utf-8 |
697 |
|
|
</pre> |
698 |
|
|
|
699 |
|
|
</body> |
700 |
|
|
</section> |
701 |
|
|
<section> |
702 |
neysx |
1.1 |
<title>Testing it all out</title> |
703 |
|
|
<body> |
704 |
|
|
|
705 |
|
|
<p> |
706 |
|
|
There are numerous UTF-8 test websites around. <c>net-www/w3m</c>, |
707 |
|
|
<c>net-www/links</c>, <c>net-www/elinks</c>, <c>net-www/lynx</c> and all |
708 |
cam |
1.3 |
Mozilla based browsers (including Firefox) support UTF-8. Konqueror and Opera |
709 |
|
|
have full UTF-8 support too. |
710 |
neysx |
1.1 |
</p> |
711 |
|
|
|
712 |
|
|
<p> |
713 |
|
|
When using one of the text-only web browsers, make absolutely sure you are |
714 |
|
|
using a Unicode-aware terminal. |
715 |
|
|
</p> |
716 |
|
|
|
717 |
|
|
<p> |
718 |
|
|
If you see certain characters displayed as boxes with letters or numbers |
719 |
|
|
inside, this means that your font does not have a character for the symbol or |
720 |
|
|
glyph that the UTF-8 wants. Instead, it displays a box with the hex code of the |
721 |
|
|
UTF-8 symbol. |
722 |
|
|
</p> |
723 |
|
|
|
724 |
|
|
<ul> |
725 |
|
|
<li> |
726 |
|
|
<uri link="http://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html">A W3C |
727 |
|
|
UTF-8 Test Page</uri> |
728 |
|
|
</li> |
729 |
|
|
<li> |
730 |
|
|
<uri link="http://titus.uni-frankfurt.de/indexe.htm?/unicode/unitest.htm"> |
731 |
|
|
A UTF-8 test page provided by the University of Frankfurt</uri> |
732 |
|
|
</li> |
733 |
|
|
</ul> |
734 |
|
|
|
735 |
|
|
</body> |
736 |
|
|
</section> |
737 |
|
|
<section> |
738 |
|
|
<title>Input Methods</title> |
739 |
|
|
<body> |
740 |
|
|
|
741 |
|
|
<p> |
742 |
|
|
<e>Dead keys</e> may be used to input characters in X that are not included on |
743 |
|
|
your keyboard. These work by pressing your right Alt key (or in some countries, |
744 |
|
|
AltGr) and an optional key from the non-alphabetical section of the keyboard to |
745 |
|
|
the left of the return key at once, releasing them, and then pressing a letter. |
746 |
|
|
The dead key should modify it. Input can be further modified by using the Shift |
747 |
|
|
key at the same time as pressing the AltGr and modifier. |
748 |
|
|
</p> |
749 |
|
|
|
750 |
|
|
<p> |
751 |
|
|
To enable dead keys in X, you need a layout that supports it. Most European |
752 |
|
|
layouts already have dead keys with the default variant. However, this is not |
753 |
|
|
true of North American layouts. Although there is a degree of inconsistency |
754 |
|
|
between layouts, the easiest solution seems to be to use a layout in the form |
755 |
|
|
"en_US" rather than "us", for example. The layout is set in |
756 |
|
|
<path>/etc/X11/xorg.conf</path> like so: |
757 |
|
|
</p> |
758 |
|
|
|
759 |
|
|
<pre caption="/etc/X11/xorg.conf snippet"> |
760 |
|
|
Section "InputDevice" |
761 |
|
|
Identifier "Keyboard0" |
762 |
|
|
Driver "kbd" |
763 |
|
|
Option "XkbLayout" "en_US" <comment># Rather than just "us"</comment> |
764 |
|
|
<comment>(Other Xkb options here)</comment> |
765 |
|
|
EndSection |
766 |
|
|
</pre> |
767 |
|
|
|
768 |
|
|
<note> |
769 |
|
|
The preceding change only needs to be applied if you are using a North American |
770 |
|
|
layout, or another layout where dead keys do not seem to be working. European |
771 |
|
|
users should have working dead keys as is. |
772 |
|
|
</note> |
773 |
|
|
|
774 |
|
|
<p> |
775 |
bennyc |
1.11 |
This change will come into effect when your X server is restarted. To apply the |
776 |
neysx |
1.1 |
change now, use the <c>setxkbmap</c> tool, for example, <c>setxkbmap en_US</c>. |
777 |
|
|
</p> |
778 |
|
|
|
779 |
|
|
<p> |
780 |
|
|
It is probably easiest to describe dead keys with examples. Although the |
781 |
bennyc |
1.11 |
results are locale dependent, the concepts should remain the same regardless of |
782 |
neysx |
1.1 |
locale. The examples contain UTF-8, so to view them you need to either tell |
783 |
|
|
your browser to view the page as UTF-8, or have a UTF-8 locale already |
784 |
|
|
configured. |
785 |
|
|
</p> |
786 |
|
|
|
787 |
|
|
<p> |
788 |
|
|
When I press AltGr and [ at once, release them, and then press a, 'ä' is |
789 |
bennyc |
1.11 |
produced. When I press AltGr and [ at once, and then press e, 'ë' is produced. |
790 |
|
|
When I press AltGr and ; at once, 'á' is produced, and when I press AltGr and ; |
791 |
|
|
at once, release them, and then press e, 'é' is produced. |
792 |
neysx |
1.1 |
</p> |
793 |
|
|
|
794 |
|
|
<p> |
795 |
|
|
By pressing AltGr, Shift and [ at once, releasing them, and then pressing a, a |
796 |
|
|
Scandinavian 'å' is produced. Similarly, when I press AltGr, Shift and [ at |
797 |
|
|
once, release <e>only</e> the [, and then press it again, '˚' is produced. |
798 |
|
|
Although it looks like one, this (U+02DA) is not the same as a degree symbol |
799 |
|
|
(U+00B0). This works for other accents produced by dead keys — AltGr and [, |
800 |
|
|
releasing only the [, then pressing it again makes '¨'. |
801 |
|
|
</p> |
802 |
|
|
|
803 |
|
|
<p> |
804 |
|
|
AltGr can be used with alphabetical keys alone. For example, AltGr and m, a |
805 |
bennyc |
1.12 |
Greek lower-case letter mu is produced: 'µ'. AltGr and s produce a |
806 |
|
|
scharfes s or esszet: 'ß'. As many European users would expect (because |
807 |
|
|
it is marked on their keyboard), AltGr and 4 produces a Euro sign, '€'. |
808 |
neysx |
1.1 |
</p> |
809 |
|
|
|
810 |
|
|
</body> |
811 |
|
|
</section> |
812 |
|
|
<section> |
813 |
|
|
<title>Resources</title> |
814 |
|
|
<body> |
815 |
|
|
|
816 |
|
|
<ul> |
817 |
|
|
<li> |
818 |
|
|
<uri link="http://www.wikipedia.com/wiki/Unicode">The Wikipedia entry for |
819 |
|
|
Unicode</uri> |
820 |
|
|
</li> |
821 |
|
|
<li> |
822 |
|
|
<uri link="http://www.wikipedia.com/wiki/UTF-8">The Wikipedia entry for |
823 |
|
|
UTF-8</uri> |
824 |
|
|
</li> |
825 |
|
|
<li><uri link="http://www.unicode.org">Unicode.org</uri></li> |
826 |
|
|
<li><uri link="http://www.utf-8.com">UTF-8.com</uri></li> |
827 |
|
|
<li><uri link="http://www.ietf.org/rfc/rfc3629.txt">RFC 3629</uri></li> |
828 |
|
|
<li><uri link="http://www.ietf.org/rfc/rfc2277.txt">RFC 2277</uri></li> |
829 |
bennyc |
1.11 |
<li> |
830 |
|
|
<uri |
831 |
|
|
link="http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF">Characters vs. |
832 |
|
|
Bytes</uri> |
833 |
|
|
</li> |
834 |
neysx |
1.1 |
</ul> |
835 |
|
|
|
836 |
|
|
</body> |
837 |
|
|
</section> |
838 |
|
|
</chapter> |
839 |
|
|
</guide> |