1 |
<?xml version='1.0' encoding="UTF-8"?> |
2 |
<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/utf-8.xml,v 1.5 2005/02/24 14:57:18 cam Exp $ --> |
3 |
<!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> |
4 |
|
5 |
<guide link="/doc/en/utf-8.xml"> |
6 |
<title>Using UTF-8 with Gentoo</title> |
7 |
|
8 |
<author title="Author"> |
9 |
<mail link="slarti@gentoo.org">Thomas Martin</mail> |
10 |
</author> |
11 |
<author title="Contributor"> |
12 |
<mail link="devil@gentoo.org.ua">Alexander Simonov</mail> |
13 |
</author> |
14 |
|
15 |
<abstract> |
16 |
This guide shows you how to set up and use the UTF-8 Unicode character set with |
17 |
your Gentoo Linux system, after explaining the benefits of Unicode and more |
18 |
specifically UTF-8. |
19 |
</abstract> |
20 |
|
21 |
<license /> |
22 |
|
23 |
<version>1.5</version> |
24 |
<date>2005-04-23</date> |
25 |
|
26 |
<chapter> |
27 |
<title>Character Encodings</title> |
28 |
<section> |
29 |
<title>What is a Character Encoding?</title> |
30 |
<body> |
31 |
|
32 |
<p> |
33 |
Computers do not understand text themselves. Instead, every character is |
34 |
represented by a number. Traditionally, each set of numbers used to represent |
35 |
alphabets and characters (known as a coding system, encoding or character set) |
36 |
was limited in size due to limitations in computer hardware. |
37 |
</p> |
38 |
|
39 |
</body> |
40 |
</section> |
41 |
<section> |
42 |
<title>The History of Character Encodings</title> |
43 |
<body> |
44 |
|
45 |
<p> |
46 |
The most common (or at least the most widely accepted) character set is |
47 |
<b>ASCII</b> (American Standard Code for Information Interchange). It is widely |
48 |
held that ASCII is the most successful software standard ever. Modern ASCII |
49 |
was standardised in 1986 (ANSI X3.4, RFC 20, ISO/IEC 646:1991, ECMA-6) by the |
50 |
American National Standards Institute. |
51 |
</p> |
52 |
|
53 |
<p> |
54 |
ASCII is strictly seven-bit, meaning that it uses bit patterns representable |
55 |
with seven binary digits, which provides a range of 0 to 127 in decimal. These |
56 |
include 32 non-visible control characters, most between 0 and 31, with the |
57 |
final control character, DEL or delete at 127. Characters 32 to 126 are |
58 |
visible characters: a space, punctuation marks, Latin letters and numbers. |
59 |
</p> |
60 |
|
61 |
<p> |
62 |
The eighth bit in ASCII was originally used as a parity bit for error checking. |
63 |
If this is not desired, it is left as 0. This means that, with ASCII, each |
64 |
character is represented by a single byte. |
65 |
</p> |
66 |
|
67 |
<p> |
68 |
Although ASCII was enough for communication in modern English, in other |
69 |
European languages that include accented characters, things were not so easy. |
70 |
The ISO 8859 standards were developed to meet these needs. They were backwards |
71 |
compatible with ASCII, but instead of leaving the eighth bit blank, they used |
72 |
it to allow another 127 characters in each encoding. ISO 8859's limitations |
73 |
soon came to light, and there are currently 15 variants of the ISO 8859 |
74 |
standard (8859-1 through to 8859-15). Outside of the ASCII-compatible byte |
75 |
range of these character sets, there is often conflict between the letters |
76 |
represented by each byte. To complicate interoperability between character |
77 |
encodings further, Windows-1252 is used in some versions of Microsoft Windows |
78 |
instead for Western European languages. This is a superset of ISO 8859-1, |
79 |
however it is different in several ways. These sets do all retain ASCII |
80 |
compatibility, however. |
81 |
</p> |
82 |
|
83 |
<p> |
84 |
The necessary development of completely different single-byte encodings for |
85 |
non-Latin alphabets, such as EUC (Extended Unix Coding) which is used for |
86 |
Japanese and Korean (and to a lesser extent Chinese) created more confusion, |
87 |
while other operating systems still used different character sets for the same |
88 |
languages, for example, Shift-JIS and ISO-2022-JP. Users wishing to view |
89 |
cyrillic glyphs had to choose between KOI8-R for Russian and Bulgarian or |
90 |
KOI8-U for Ukrainian, as well as all the other cyrillic encodings such as the |
91 |
unsuccessful ISO 8859-5, and the common Windows-1251 set. All of these |
92 |
character sets broke most compatibility with ASCII (although KOI8 encodings |
93 |
place cyrillic characters in Latin order, so in case the eighth bit is |
94 |
stripped, text is still decipherable on an ASCII terminal through case-reversed |
95 |
transliteration.) |
96 |
</p> |
97 |
|
98 |
<p> |
99 |
This has led to confusion, and also to an almost total inability for |
100 |
multilingual communication, especially across different alphabets. Enter |
101 |
Unicode. |
102 |
</p> |
103 |
|
104 |
</body> |
105 |
</section> |
106 |
<section> |
107 |
<title>What is Unicode?</title> |
108 |
<body> |
109 |
|
110 |
<p> |
111 |
Unicode throws away the traditional single-byte limit of character sets, and |
112 |
even with two bytes per-character this allows a maximum 65,536 characters. |
113 |
Although this number is extremely high when compared to seven-bit and eight-bit |
114 |
encodings, it is still not enough for a character set designed to be used for |
115 |
symbols and scripts used only by scholars, and symbols that are only used in |
116 |
mathematics and other specialised fields. |
117 |
</p> |
118 |
|
119 |
<p> |
120 |
Unicode has been mapped in many different ways, but the two most common are |
121 |
<b>UTF</b> (Unicode Transformation Format) and <b>UCS</b> (Universal Character |
122 |
Set). A number after UTF indicates the number of bits in one unit, while the |
123 |
number after UCS indicates the number of bytes. UTF-8 has become the most |
124 |
widespread means for the interchange of Unicode text as a result of its |
125 |
eight-bit clean nature, and it is the subject of this document. |
126 |
</p> |
127 |
|
128 |
</body> |
129 |
</section> |
130 |
<section> |
131 |
<title>UTF-8</title> |
132 |
<body> |
133 |
|
134 |
<p> |
135 |
UTF-8 is a variable-length character encoding, which in this instance means |
136 |
that it uses 1 to 4 bytes per symbol. So, the first UTF-8 byte is used for |
137 |
encoding ASCII, giving the character set full backwards compatibility with |
138 |
ASCII. UTF-8 means that ASCII and Latin characters are interchangeable with |
139 |
little increase in the size of the data, because only the first bit is used. |
140 |
Users of Eastern alphabets such as Japanese, who have been assigned a higher |
141 |
byte range are unhappy, as this results in as much as a 50% redundancy in their |
142 |
data. |
143 |
</p> |
144 |
|
145 |
</body> |
146 |
</section> |
147 |
<section> |
148 |
<title>What UTF-8 Can Do for You</title> |
149 |
<body> |
150 |
|
151 |
<p> |
152 |
UTF-8 allows you to work in a standards-compliant and internationally accepted |
153 |
multilingual environment, with a comparitively low data redundancy. UTF-8 is |
154 |
the preferred way for transmitting non-ASCII characters over the Internet, |
155 |
through Email, IRC or almost any other medium. Despite this, many people regard |
156 |
UTF-8 in online communication as abusive. It is always best to be aware of the |
157 |
attitude towards UTF-8 in a specific channel, mailing list or Usenet group |
158 |
before using <e>non-ASCII</e> UTF-8. |
159 |
</p> |
160 |
|
161 |
</body> |
162 |
</section> |
163 |
</chapter> |
164 |
|
165 |
<chapter> |
166 |
<title>Setting up UTF-8 with Gentoo Linux</title> |
167 |
<section> |
168 |
<title>Finding or Creating UTF-8 Locales</title> |
169 |
<body> |
170 |
|
171 |
<p> |
172 |
Now that you understand the principles behind Unicode, you're ready to start |
173 |
using UTF-8 with your system. |
174 |
</p> |
175 |
|
176 |
<p> |
177 |
The preliminary requirement for UTF-8 is to have a version of glibc installed |
178 |
that has national language support. The recommend means to do this is the |
179 |
<path>/etc/locales.build</path> file in combination with the <c>userlocales</c> |
180 |
USE flag. It is beyond the scope of this document to explain the usage of this |
181 |
file though, luckily, the usage of this file is well documented in the comments |
182 |
within it. It is also explained in the <uri |
183 |
link="/doc/en/guide-localization.xml#doc_chap3_sect3"> Gentoo Localisation |
184 |
Guide</uri>. |
185 |
</p> |
186 |
|
187 |
<p> |
188 |
Next, we'll need to decide whether a UTF-8 locale is already available for our |
189 |
language, or whether we need to create one. |
190 |
</p> |
191 |
|
192 |
<pre caption="Checking for an existing UTF-8 locale"> |
193 |
<comment>(Replace "en_GB" with your desired locale setting)</comment> |
194 |
# <i>locale -a | grep 'en_GB'</i> |
195 |
en_GB |
196 |
en_GB.utf8 |
197 |
</pre> |
198 |
|
199 |
<p> |
200 |
From the output of this command line, we need to take the result with a suffix |
201 |
similar to <c>.utf8</c>. If there is no result with a suffix similar to |
202 |
<c>.utf8</c>, we need to create a UTF-8 compatible locale. |
203 |
</p> |
204 |
|
205 |
<note> |
206 |
Only execute the following code listing if you do not have a UTF-8 locale |
207 |
available for your language. |
208 |
</note> |
209 |
|
210 |
<pre caption="Creating a UTF-8 locale"> |
211 |
<comment>(Replace "en_GB" with your desired locale setting)</comment> |
212 |
# <i>localedef -i en_GB -f UTF-8 en_GB.utf8</i> |
213 |
</pre> |
214 |
|
215 |
</body> |
216 |
</section> |
217 |
<section> |
218 |
<title>Setting the Locale</title> |
219 |
<body> |
220 |
|
221 |
<p> |
222 |
There are two environment variables that need to be set in order to use |
223 |
our new UTF-8 locales: <c>LANG</c> and <c>LC_ALL</c>. There are also |
224 |
many different ways to set them; some people prefer to only have a UTF-8 |
225 |
environment for a specific user, in which case they set them in their |
226 |
<path>~/.profile</path> or <path>~/.bashrc</path>. Others prefer to set the |
227 |
locale globally. One specific circumstance where the author particularly |
228 |
recommends doing this is when <path>/etc/init.d/xdm</path> is in use, because |
229 |
this init script starts the display manager and desktop before any of the |
230 |
aforementioned shell startup files are sourced, and so before any of the |
231 |
variables are in the environment. |
232 |
</p> |
233 |
|
234 |
<p> |
235 |
Setting the locale globally should be done using |
236 |
<path>/etc/env.d/02local</path>. The file should look something like the |
237 |
following: |
238 |
</p> |
239 |
|
240 |
<pre caption="Demonstration /etc/env.d/02locale"> |
241 |
<comment>(As always, change "en_GB.UTF-8" to your locale)</comment> |
242 |
LC_ALL="en_GB.UTF-8" |
243 |
LOCALE="en_GB.UTF-8" |
244 |
</pre> |
245 |
|
246 |
<p> |
247 |
Next, the environment must be updated with the change. |
248 |
</p> |
249 |
|
250 |
<pre caption="Updating the environment"> |
251 |
# <i>env-update</i> |
252 |
>>> Regenerating /etc/ld.so.cache... |
253 |
* Caching service dependencies ... |
254 |
# <i>source /etc/profile</i> |
255 |
</pre> |
256 |
|
257 |
<p> |
258 |
Now, run <c>locale</c> with no arguments to see if we have the correct |
259 |
variables in our environment: |
260 |
</p> |
261 |
|
262 |
<pre caption="Checking if our new locale is in the environment"> |
263 |
# <i>locale</i> |
264 |
LANG=en_GB.UTF-8 |
265 |
LC_CTYPE="en_GB.UTF-8" |
266 |
LC_NUMERIC="en_GB.UTF-8" |
267 |
LC_TIME="en_GB.UTF-8" |
268 |
LC_COLLATE="en_GB.UTF-8" |
269 |
LC_MONETARY="en_GB.UTF-8" |
270 |
LC_MESSAGES="en_GB.UTF-8" |
271 |
LC_PAPER="en_GB.UTF-8" |
272 |
LC_NAME="en_GB.UTF-8" |
273 |
LC_ADDRESS="en_GB.UTF-8" |
274 |
LC_TELEPHONE="en_GB.UTF-8" |
275 |
LC_MEASUREMENT="en_GB.UTF-8" |
276 |
LC_IDENTIFICATION="en_GB.UTF-8" |
277 |
LC_ALL=en_GB.UTF-8 |
278 |
</pre> |
279 |
|
280 |
<p> |
281 |
That is all. You are now using UTF-8 locales, and the next hurdle is the |
282 |
configuration of the applications you use from day to day. |
283 |
</p> |
284 |
|
285 |
</body> |
286 |
</section> |
287 |
</chapter> |
288 |
|
289 |
<chapter> |
290 |
<title>Application Support</title> |
291 |
<section> |
292 |
<body> |
293 |
|
294 |
<p> |
295 |
When Unicode first started gaining momentum in the software world, multibyte |
296 |
character sets were not well suited to languages like C, in which many of the |
297 |
day-to-day programs people use are written. Even today, some programs are not |
298 |
able to handle UTF-8 properly. Fortunately, most are! |
299 |
</p> |
300 |
|
301 |
</body> |
302 |
</section> |
303 |
<section> |
304 |
<title>Filenames, NTFS, and FAT</title> |
305 |
<body> |
306 |
|
307 |
<p> |
308 |
There are several NLS options in the Linux kernel configuration menu, but it is |
309 |
important to not become confused! For the most part, the only thing you need to |
310 |
do is to build UTF-8 NLS support into your kernel, and change the default NLS |
311 |
option to utf8. |
312 |
</p> |
313 |
|
314 |
<pre caption="Kernel configuration steps for UTF-8 NLS"> |
315 |
File Systems --> |
316 |
Native Language Support --> |
317 |
(utf8) Default NLS Option |
318 |
<*> NLS UTF8 |
319 |
<comment>(Also <*> other character sets that are in use in |
320 |
your FAT filesystems or Joilet CD-ROMs.)</comment> |
321 |
</pre> |
322 |
|
323 |
<p> |
324 |
If you plan on mounting NTFS partitions, you may need to specify an <c>nls=</c> |
325 |
option with mount. For more information, see <c>man mount</c>. |
326 |
</p> |
327 |
|
328 |
<p> |
329 |
For changing the encoding of filenames, <c>app-text/convmv</c> can be used. |
330 |
</p> |
331 |
|
332 |
<pre caption="Example usage of convmv"> |
333 |
# <i>emerge --ask app-text/convmv</i> |
334 |
# <i>convmv -f current-encoding -t utf-8 filename</i> |
335 |
</pre> |
336 |
|
337 |
<p> |
338 |
For changing the <e>contents</e> of files, use the <c>iconv</c> utility, |
339 |
bundled with <c>glibc</c>: |
340 |
</p> |
341 |
|
342 |
<pre caption="Example usage of iconv"> |
343 |
<comment>(substitute iso-8859-1 with the charset you are converting from)</comment> |
344 |
<comment>(Check the output is sane)</comment> |
345 |
# <i>iconv -f iso-8859-1 -t utf-8 filename</i> |
346 |
<comment>(Convert a file, you must create another file)</comment> |
347 |
# <i>iconv -f iso-8859-1 -t utf-8 filename > newfile</i> |
348 |
</pre> |
349 |
|
350 |
<p> |
351 |
<c>app-text/recode</c> can also be used for this purpose. |
352 |
</p> |
353 |
|
354 |
</body> |
355 |
</section> |
356 |
<section> |
357 |
<title>The System Console</title> |
358 |
<body> |
359 |
|
360 |
<impo> |
361 |
You need >=sys-apps/baselayout-1.11.9 for Unicode on the console. |
362 |
</impo> |
363 |
|
364 |
<p> |
365 |
To enable UTF-8 on the console, you should edit <path>/etc/rc.conf</path> and |
366 |
set <c>UNICODE="yes"</c>, and also read the comments in that file -- it is |
367 |
important to have a font that has a good range of characters if you plan on |
368 |
making the most of Unicode. |
369 |
</p> |
370 |
|
371 |
<p> |
372 |
The <c>KEYMAP</c> variable, set in <path>/etc/conf.d/keymaps</path>, should |
373 |
have a Unicode keymap specified. To do this, simply prepend the keymap already |
374 |
specified there with -u. |
375 |
</p> |
376 |
|
377 |
<pre caption="Example /etc/conf.d/keymaps snippet"> |
378 |
<comment>(Change "uk" to your local layout)</comment> |
379 |
KEYMAP="uk" |
380 |
</pre> |
381 |
|
382 |
</body> |
383 |
</section> |
384 |
<section> |
385 |
<title>Ncurses and Slang</title> |
386 |
<body> |
387 |
|
388 |
<note> |
389 |
Ignore any mention of Slang in this section if you do not have it installed or |
390 |
do not use it. |
391 |
</note> |
392 |
|
393 |
<p> |
394 |
It is wise to add <c>unicode</c> to your global USE flags in |
395 |
<path>/etc/make.conf</path>, and then to remerge <c>sys-libs/ncurses</c> and |
396 |
also <c>sys-libs/slang</c> if appropriate: |
397 |
</p> |
398 |
|
399 |
<pre caption="Emerging ncurses and slang"> |
400 |
<comment>(We avoid putting these libraries in our world file with --oneshot)</comment> |
401 |
# <i>emerge --oneshot --verbose --ask sys-libs/ncurses sys-libs/slang</i> |
402 |
</pre> |
403 |
|
404 |
<p> |
405 |
We also need to rebuild packages that link to these, now the USE changes have |
406 |
been applied. |
407 |
</p> |
408 |
|
409 |
<pre caption="Rebuilding of programs that link to ncurses or slang"> |
410 |
# <i>revdep-rebuild --soname libncurses.so.5</i> |
411 |
# <i>revdep-rebuild --soname libslang.so.1</i> |
412 |
</pre> |
413 |
|
414 |
</body> |
415 |
</section> |
416 |
<section> |
417 |
<title>KDE, GNOME and Xfce</title> |
418 |
<body> |
419 |
|
420 |
<p> |
421 |
All of the major desktop environments have full Unicode support, and will |
422 |
require no further setup than what has already been covered in this guide. This |
423 |
is because the underlying graphical toolkits (Qt or GTK+2) are UTF-8 aware. |
424 |
Subsequently, all applications running on top of these toolkits should be |
425 |
UTF-8-aware out of the box. |
426 |
</p> |
427 |
|
428 |
<p> |
429 |
The exceptions to this rule come in Xlib and GTK+1. GTK+1 requires a |
430 |
iso-10646-1 FontSpec in the ~/.gtkrc, for example |
431 |
<c>-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1</c>. Also, applications using |
432 |
Xlib or Xaw will need to be given a similar FontSpec, otherwise they will not |
433 |
work. |
434 |
</p> |
435 |
|
436 |
<note> |
437 |
If you have a version of the gnome1 control center around, use that instead. |
438 |
Pick any iso10646-1 font from there. |
439 |
</note> |
440 |
|
441 |
<pre caption="Example ~/.gtkrc (for GTK+1) that defines a Unicode compatible font"> |
442 |
style "user-font" |
443 |
{ |
444 |
fontset="-misc-fixed-*-*-*-*-*-*-*-*-*-*-iso10646-1" |
445 |
} |
446 |
widget_class "*" style "user-font" |
447 |
</pre> |
448 |
|
449 |
<p> |
450 |
If an application has support for both a Qt and GTK+2 GUI, the GTK+2 GUI will |
451 |
generally give better results with Unicode. |
452 |
</p> |
453 |
|
454 |
</body> |
455 |
</section> |
456 |
<section> |
457 |
<title>X11 and Fonts</title> |
458 |
<body> |
459 |
|
460 |
<p> |
461 |
TrueType fonts have support for Unicode, and most of the fonts that ship with |
462 |
Xorg have impressive character support, although, obviously, not every single |
463 |
glyph available in Unicode has been created for that font. To build fonts |
464 |
(including the Bitstream Vera set) with support for East Asian letters with X, |
465 |
make sure you have the <c>cjk</c> USE flag set. Many other applications utilise |
466 |
this flag, so it may be worthwhile to add it as a permanent USE flag. |
467 |
</p> |
468 |
|
469 |
<p> |
470 |
Also, several font packages in Portage are Unicode aware. |
471 |
</p> |
472 |
|
473 |
<pre caption="Optional: Install some more Unicode-aware fonts"> |
474 |
# <i>emerge terminus-font intlfonts freefonts cronyx-fonts corefonts</i> |
475 |
</pre> |
476 |
|
477 |
</body> |
478 |
</section> |
479 |
<section> |
480 |
<title>Window Managers and Terminal Emulators</title> |
481 |
<body> |
482 |
|
483 |
<p> |
484 |
Window managers, even those not built on GTK or Qt, generally have very |
485 |
good Unicode support, as they often use the Xft library for handling |
486 |
fonts. If your window manager does not use Xft for fonts, you can still |
487 |
use the FontSpec mentioned in the previous section as a Unicode font. |
488 |
</p> |
489 |
|
490 |
<p> |
491 |
Terminal emulators that use Xft and support Unicode are harder to come by. |
492 |
Aside from Konsole and gnome-terminal, the best options in Portage are |
493 |
<c>x11-terms/rxvt-unicode</c>, <c>xfce-extra/terminal</c>, |
494 |
<c>gnustep-apps/terminal</c>, <c>x11-terms/mlterm</c>, <c>x11-terms/mrxvt</c> or |
495 |
plain <c>x11-terms/xterm</c> when built with the <c>unicode</c> USE flag and |
496 |
invoked as <c>uxterm</c>. <c>app-misc/screen</c> supports UTF-8 too, when |
497 |
invoked as <c>screen -U</c> or the following is put into the |
498 |
<path>~/.screenrc</path>: |
499 |
</p> |
500 |
|
501 |
<pre caption="~/.screenrc for UTF-8"> |
502 |
defutf8 on |
503 |
</pre> |
504 |
|
505 |
</body> |
506 |
</section> |
507 |
<section> |
508 |
<title>Vim, Emacs, Xemacs and Nano</title> |
509 |
<body> |
510 |
|
511 |
<p> |
512 |
Vim, Emacs and Xemacs provide full UTF-8 support, and also have builtin |
513 |
detection of UTF-8 files. For further information in Vim, use <c>:help |
514 |
mbyte.txt</c>. |
515 |
</p> |
516 |
|
517 |
<p> |
518 |
Nano currently does not provide support for UTF-8, although it has been planned |
519 |
for a long time. With luck, this will change in future. At the time of writing, |
520 |
UTF-8 support is in Nano's CVS, and should be included in the next release. |
521 |
</p> |
522 |
|
523 |
</body> |
524 |
</section> |
525 |
<section> |
526 |
<title>Shells</title> |
527 |
<body> |
528 |
|
529 |
<p> |
530 |
Currently, <c>bash</c> provides full Unicode support through the GNU readline |
531 |
library. Z Shell users are in a somewhat worse position -- no parts of the |
532 |
shell have Unicode support, although there is a concerted effort to add |
533 |
multibyte character set support underway at the moment. |
534 |
</p> |
535 |
|
536 |
<p> |
537 |
The C shell, <c>tcsh</c> and <c>ksh</c> do not provide UTF-8 support at all. |
538 |
</p> |
539 |
|
540 |
<note> |
541 |
Although not strictly related to shells, many of the GNU text-processing |
542 |
programs in your system (<c>tr</c>, <c>grep</c>, etc.) are much slower |
543 |
when processing Unicode. Nonetheless, the difference is not at all |
544 |
noticeable in nearly every case, but if you are ever hit by these bugs |
545 |
then at least you will know what is causing them. Perl also tends to be |
546 |
slower when operating on multibyte characters. The author knows of one |
547 |
other gotcha: <c>tr</c> will not convert three-byte UTF-8 characters to |
548 |
two-byte UTF-8 characters. |
549 |
</note> |
550 |
|
551 |
</body> |
552 |
</section> |
553 |
<section> |
554 |
<title>Irssi</title> |
555 |
<body> |
556 |
|
557 |
<p> |
558 |
Since 0.8.10, Irssi has complete UTF-8 support, although it does require a user |
559 |
to set an option. |
560 |
</p> |
561 |
|
562 |
<pre caption="Enabling UTF-8 in Irssi"> |
563 |
/set term_charset UTF-8 |
564 |
</pre> |
565 |
|
566 |
<p> |
567 |
For channels where non-ASCII characters are often exchanged in non-UTF-8 |
568 |
charsets, the <c>/recode</c> command may be used to convert the characters. |
569 |
Type <c>/help recode</c> for more information. |
570 |
</p> |
571 |
|
572 |
</body> |
573 |
</section> |
574 |
<section> |
575 |
<title>Mutt</title> |
576 |
<body> |
577 |
|
578 |
<p> |
579 |
The Mutt mail user agent has very good Unicode support. To use UTF-8 with Mutt, |
580 |
put the following in your <path>~/.muttrc</path>: |
581 |
</p> |
582 |
|
583 |
<pre caption="~/.muttrc for UTF-8"> |
584 |
set send_charset="utf8" <comment>(outgoing character set)</comment> |
585 |
set charset="utf8" <comment>(display character set)</comment> |
586 |
</pre> |
587 |
|
588 |
<note> |
589 |
You may still see '?' in mail you read with Mutt. This is a result of people |
590 |
using Latin (ISO 8859) or another charset for email transmission. It is best to |
591 |
tell them to use UTF-8 for mail, and point them to the IETF RFC 2277 (see |
592 |
References at the end of this document). Also note that in some lists, |
593 |
subscribers may not like UTF-8. Be sure that the group or person you are |
594 |
communicating with does not mind UTF-8. |
595 |
</note> |
596 |
|
597 |
<p> |
598 |
Further information is available from the <uri |
599 |
link="http://wiki.mutt.org/index.cgi?MuttFaq/Charset"> Mutt WikiWiki</uri>. |
600 |
</p> |
601 |
|
602 |
</body> |
603 |
</section> |
604 |
<section> |
605 |
<title>Testing it all out</title> |
606 |
<body> |
607 |
|
608 |
<p> |
609 |
There are numerous UTF-8 test websites around. <c>net-www/w3m</c>, |
610 |
<c>net-www/links</c>, <c>net-www/elinks</c>, <c>net-www/lynx</c> and all |
611 |
Mozilla based browsers (including Firefox) support UTF-8. Konqueror and Opera |
612 |
have full UTF-8 support too. |
613 |
</p> |
614 |
|
615 |
<p> |
616 |
When using one of the text-only web browsers, make absolutely sure you are |
617 |
using a Unicode-aware terminal. |
618 |
</p> |
619 |
|
620 |
<p> |
621 |
If you see certain characters displayed as boxes with letters or numbers |
622 |
inside, this means that your font does not have a character for the symbol or |
623 |
glyph that the UTF-8 wants. Instead, it displays a box with the hex code of the |
624 |
UTF-8 symbol. |
625 |
</p> |
626 |
|
627 |
<ul> |
628 |
<li> |
629 |
<uri link="http://www.w3.org/2001/06/utf-8-test/UTF-8-demo.html">A W3C |
630 |
UTF-8 Test Page</uri> |
631 |
</li> |
632 |
<li> |
633 |
<uri link="http://titus.uni-frankfurt.de/indexe.htm?/unicode/unitest.htm"> |
634 |
A UTF-8 test page provided by the University of Frankfurt</uri> |
635 |
</li> |
636 |
</ul> |
637 |
|
638 |
</body> |
639 |
</section> |
640 |
<section> |
641 |
<title>Input Methods</title> |
642 |
<body> |
643 |
|
644 |
<p> |
645 |
<e>Dead keys</e> may be used to input characters in X that are not included on |
646 |
your keyboard. These work by pressing your right Alt key (or in some countries, |
647 |
AltGr) and an optional key from the non-alphabetical section of the keyboard to |
648 |
the left of the return key at once, releasing them, and then pressing a letter. |
649 |
The dead key should modify it. Input can be further modified by using the Shift |
650 |
key at the same time as pressing the AltGr and modifier. |
651 |
</p> |
652 |
|
653 |
<p> |
654 |
To enable dead keys in X, you need a layout that supports it. Most European |
655 |
layouts already have dead keys with the default variant. However, this is not |
656 |
true of North American layouts. Although there is a degree of inconsistency |
657 |
between layouts, the easiest solution seems to be to use a layout in the form |
658 |
"en_US" rather than "us", for example. The layout is set in |
659 |
<path>/etc/X11/xorg.conf</path> like so: |
660 |
</p> |
661 |
|
662 |
<pre caption="/etc/X11/xorg.conf snippet"> |
663 |
Section "InputDevice" |
664 |
Identifier "Keyboard0" |
665 |
Driver "kbd" |
666 |
Option "XkbLayout" "en_US" <comment># Rather than just "us"</comment> |
667 |
<comment>(Other Xkb options here)</comment> |
668 |
EndSection |
669 |
</pre> |
670 |
|
671 |
<note> |
672 |
The preceding change only needs to be applied if you are using a North American |
673 |
layout, or another layout where dead keys do not seem to be working. European |
674 |
users should have working dead keys as is. |
675 |
</note> |
676 |
|
677 |
<p> |
678 |
This change will come into effect when the X server is restarted. To apply the |
679 |
change now, use the <c>setxkbmap</c> tool, for example, <c>setxkbmap en_US</c>. |
680 |
</p> |
681 |
|
682 |
<p> |
683 |
It is probably easiest to describe dead keys with examples. Although the |
684 |
results are layout dependent, the concepts should remain the same regardless of |
685 |
locale. The examples contain UTF-8, so to view them you need to either tell |
686 |
your browser to view the page as UTF-8, or have a UTF-8 locale already |
687 |
configured. |
688 |
</p> |
689 |
|
690 |
<p> |
691 |
When I press AltGr and [ at once, release them, and then press a, 'ä' is |
692 |
produced. When I press AltGr and [ at once, and then press e, 'ë' is |
693 |
produced. When I press AltGr and ; at once, release them, and press a, |
694 |
'á' is produced, and when I press AltGr and ; at once, release them, and |
695 |
then press e, 'é' is produced. |
696 |
</p> |
697 |
|
698 |
<p> |
699 |
By pressing AltGr, Shift and [ at once, releasing them, and then pressing a, a |
700 |
Scandinavian 'å' is produced. Similarly, when I press AltGr, Shift and [ at |
701 |
once, release <e>only</e> the [, and then press it again, '˚' is produced. |
702 |
Although it looks like one, this (U+02DA) is not the same as a degree symbol |
703 |
(U+00B0). This works for other accents produced by dead keys — AltGr and [, |
704 |
releasing only the [, then pressing it again makes '¨'. |
705 |
</p> |
706 |
|
707 |
<p> |
708 |
AltGr can be used with alphabetical keys alone. For example, AltGr and m, a |
709 |
Greek lower-case letter mu is produced: 'µ'. AltGr and s produce a |
710 |
Schauffer's s: 'ß'. As many European users would expect (because it is |
711 |
marked on their keyboard), AltGr and 4 produces a Euro sign, '€'. |
712 |
</p> |
713 |
|
714 |
</body> |
715 |
</section> |
716 |
<section> |
717 |
<title>Resources</title> |
718 |
<body> |
719 |
|
720 |
<ul> |
721 |
<li> |
722 |
<uri link="http://www.wikipedia.com/wiki/Unicode">The Wikipedia entry for |
723 |
Unicode</uri> |
724 |
</li> |
725 |
<li> |
726 |
<uri link="http://www.wikipedia.com/wiki/UTF-8">The Wikipedia entry for |
727 |
UTF-8</uri> |
728 |
</li> |
729 |
<li><uri link="http://www.unicode.org">Unicode.org</uri></li> |
730 |
<li><uri link="http://www.utf-8.com">UTF-8.com</uri></li> |
731 |
<li><uri link="http://www.ietf.org/rfc/rfc3629.txt">RFC 3629</uri></li> |
732 |
<li><uri link="http://www.ietf.org/rfc/rfc2277.txt">RFC 2277</uri></li> |
733 |
</ul> |
734 |
|
735 |
</body> |
736 |
</section> |
737 |
</chapter> |
738 |
</guide> |