/[gentoo]/xml/htdocs/doc/en/utf-8.xml
Gentoo

Diff of /xml/htdocs/doc/en/utf-8.xml

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

Revision 1.16 Revision 1.41
1<?xml version='1.0' encoding="UTF-8"?> 1<?xml version='1.0' encoding="UTF-8"?>
2<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/utf-8.xml,v 1.16 2005/06/02 18:36:28 swift Exp $ --> 2<!-- $Header: /var/cvsroot/gentoo/xml/htdocs/doc/en/utf-8.xml,v 1.41 2006/07/15 17:22:49 fox2mike Exp $ -->
3<!DOCTYPE guide SYSTEM "/dtd/guide.dtd"> 3<!DOCTYPE guide SYSTEM "/dtd/guide.dtd">
4 4
5<guide link="/doc/en/utf-8.xml"> 5<guide link="/doc/en/utf-8.xml">
6<title>Using UTF-8 with Gentoo</title> 6<title>Using UTF-8 with Gentoo</title>
7 7
9 <mail link="slarti@gentoo.org">Thomas Martin</mail> 9 <mail link="slarti@gentoo.org">Thomas Martin</mail>
10</author> 10</author>
11<author title="Contributor"> 11<author title="Contributor">
12 <mail link="devil@gentoo.org.ua">Alexander Simonov</mail> 12 <mail link="devil@gentoo.org.ua">Alexander Simonov</mail>
13</author> 13</author>
14<author title="Editor">
15 <mail link="fox2mike@gentoo.org">Shyam Mani</mail>
16</author>
14 17
15<abstract> 18<abstract>
16This guide shows you how to set up and use the UTF-8 Unicode character set with 19This guide shows you how to set up and use the UTF-8 Unicode character set with
17your Gentoo Linux system, after explaining the benefits of Unicode and more 20your Gentoo Linux system, after explaining the benefits of Unicode and more
18specifically UTF-8. 21specifically UTF-8.
19</abstract> 22</abstract>
20 23
24<!-- The content of this document is licensed under the CC-BY-SA license -->
25<!-- See http://creativecommons.org/licenses/by-sa/2.5 -->
21<license /> 26<license />
22 27
23<version>2.1</version> 28<version>2.20</version>
24<date>2005-06-02</date> 29<date>2006-07-15</date>
25 30
26<chapter> 31<chapter>
27<title>Character Encodings</title> 32<title>Character Encodings</title>
28<section> 33<section>
29<title>What is a Character Encoding?</title> 34<title>What is a Character Encoding?</title>
226<section> 231<section>
227<title>Setting the Locale</title> 232<title>Setting the Locale</title>
228<body> 233<body>
229 234
230<p> 235<p>
231There is one environment variables that needs to be set in order to use 236There is one environment variable that needs to be set in order to use
232our new UTF-8 locales: <c>LC_ALL</c> (this variable overrides the <c>LANG</c> setting as well). There are also 237our new UTF-8 locales: <c>LC_ALL</c> (this variable overrides the <c>LANG</c>
233many different ways to set it; some people prefer to only have a UTF-8 238setting as well). There are also many different ways to set it; some people
234environment for a specific user, in which case they set them in their 239prefer to only have a UTF-8 environment for a specific user, in which case
240they set them in their <path>~/.profile</path> (if you use <c>/bin/sh</c>),
235<path>~/.profile</path> or <path>~/.bashrc</path>. Others prefer to set the 241<path>~/.bash_profile</path> or <path>~/.bashrc</path> (if you use
236locale globally. One specific circumstance where the author particularly 242<c>/bin/bash</c>).
243</p>
244
245<p>
246Others prefer to set the locale globally. One specific circumstance where
247the author particularly recommends doing this is when
237recommends doing this is when <path>/etc/init.d/xdm</path> is in use, because 248<path>/etc/init.d/xdm</path> is in use, because
238this init script starts the display manager and desktop before any of the 249this init script starts the display manager and desktop before any of the
239aforementioned shell startup files are sourced, and so before any of the 250aforementioned shell startup files are sourced, and so before any of the
240variables are in the environment. 251variables are in the environment.
241</p> 252</p>
242 253
328 your FAT filesystems or Joilet CD-ROMs.)</comment> 339 your FAT filesystems or Joilet CD-ROMs.)</comment>
329</pre> 340</pre>
330 341
331<p> 342<p>
332If you plan on mounting NTFS partitions, you may need to specify an <c>nls=</c> 343If you plan on mounting NTFS partitions, you may need to specify an <c>nls=</c>
333option with mount. For more information, see <c>man mount</c>. 344option with mount. If you plan on mounting FAT partitions, you may need to
345specify a <c>codepage=</c> option with mount. Optionally, you can also set a
346default codepage for FAT in the kernel configuration. Note that the
347<c>codepage</c> option with mount will override the kernel settings.
348</p>
349
350<pre caption="FAT settings in kernel configuration">
351File Systems --&gt;
352 DOS/FAT/NT Filesystems --&gt;
353 (437) Default codepage for fat
354</pre>
355
356<p>
357You should avoid setting <c>Default iocharset for fat</c> to UTF-8, as it is
358not recommended. Instead, you may want to pass the option utf8=true when
359mounting your FAT partitions. For further information, see <c>man mount</c> and
360the kernel documentation at
361<path>/usr/src/linux/Documentation/filesystems/vfat.txt</path>.
334</p> 362</p>
335 363
336<p> 364<p>
337For changing the encoding of filenames, <c>app-text/convmv</c> can be used. 365For changing the encoding of filenames, <c>app-text/convmv</c> can be used.
338</p> 366</p>
339 367
340<pre caption="Example usage of convmv"> 368<pre caption="Example usage of convmv">
341# <i>emerge --ask app-text/convmv</i> 369# <i>emerge --ask app-text/convmv</i>
370<comment>(Command format)</comment>
342# <i>convmv -f current-encoding -t utf-8 filename</i> 371# <i>convmv -f &lt;current-encoding&gt; -t utf-8 &lt;filename&gt;</i>
372<comment>(Substitute iso-8859-1 with the charset you are converting
373from)</comment>
374# <i>convmv -f iso-8859-1 -t utf-8 filename</i>
343</pre> 375</pre>
344 376
345<p> 377<p>
346For changing the <e>contents</e> of files, use the <c>iconv</c> utility, 378For changing the <e>contents</e> of files, use the <c>iconv</c> utility,
347bundled with <c>glibc</c>: 379bundled with <c>glibc</c>:
376making the most of Unicode. 408making the most of Unicode.
377</p> 409</p>
378 410
379<p> 411<p>
380The <c>KEYMAP</c> variable, set in <path>/etc/conf.d/keymaps</path>, should 412The <c>KEYMAP</c> variable, set in <path>/etc/conf.d/keymaps</path>, should
381have a Unicode keymap specified. To do this, simply prepend the keymap already 413have a Unicode keymap specified.
382specified there with -u.
383</p> 414</p>
384 415
385<pre caption="Example /etc/conf.d/keymaps snippet"> 416<pre caption="Example /etc/conf.d/keymaps snippet">
386<comment>(Change "uk" to your local layout)</comment> 417<comment>(Change "uk" to your local layout)</comment>
387KEYMAP="-u uk" 418KEYMAP="uk"
388</pre> 419</pre>
389 420
390</body> 421</body>
391</section> 422</section>
392<section> 423<section>
399</note> 430</note>
400 431
401<p> 432<p>
402It is wise to add <c>unicode</c> to your global USE flags in 433It is wise to add <c>unicode</c> to your global USE flags in
403<path>/etc/make.conf</path>, and then to remerge <c>sys-libs/ncurses</c> and 434<path>/etc/make.conf</path>, and then to remerge <c>sys-libs/ncurses</c> and
404also <c>sys-libs/slang</c> if appropriate: 435<c>sys-libs/slang</c> if appropriate. Portage will do this automatically when
436you update your system:
405</p> 437</p>
406 438
407<pre caption="Emerging ncurses and slang"> 439<pre caption="Updating your system">
408<comment>(We avoid putting these libraries in our world file with --oneshot)</comment> 440# <i>emerge --update --deep --newuse world</i>
409# <i>emerge --oneshot --verbose --ask sys-libs/ncurses sys-libs/slang</i>
410</pre> 441</pre>
411 442
412<p> 443<p>
413We also need to rebuild packages that link to these, now the USE changes have 444We also need to rebuild packages that link to these, now the USE changes have
414been applied. The tool we use (<c>revdep-rebuild</c>) is part of the 445been applied. The tool we use (<c>revdep-rebuild</c>) is part of the
503 534
504<p> 535<p>
505Terminal emulators that use Xft and support Unicode are harder to come by. 536Terminal emulators that use Xft and support Unicode are harder to come by.
506Aside from Konsole and gnome-terminal, the best options in Portage are 537Aside from Konsole and gnome-terminal, the best options in Portage are
507<c>x11-terms/rxvt-unicode</c>, <c>xfce-extra/terminal</c>, 538<c>x11-terms/rxvt-unicode</c>, <c>xfce-extra/terminal</c>,
508<c>gnustep-apps/terminal</c>, <c>x11-terms/mlterm</c>, <c>x11-terms/mrxvt</c> or 539<c>gnustep-apps/terminal</c>, <c>x11-terms/mlterm</c>, or plain
509plain <c>x11-terms/xterm</c> when built with the <c>unicode</c> USE flag and 540<c>x11-terms/xterm</c> when built with the <c>unicode</c> USE flag and invoked
510invoked as <c>uxterm</c>. <c>app-misc/screen</c> supports UTF-8 too, when 541as <c>uxterm</c>. <c>app-misc/screen</c> supports UTF-8 too, when invoked as
511invoked as <c>screen -u</c> or the following is put into the 542<c>screen -U</c> or the following is put into the <path>~/.screenrc</path>:
512<path>~/.screenrc</path>:
513</p> 543</p>
514 544
515<pre caption="~/.screenrc for UTF-8"> 545<pre caption="~/.screenrc for UTF-8">
516defutf8 on 546defutf8 on
517</pre> 547</pre>
521<section> 551<section>
522<title>Vim, Emacs, Xemacs and Nano</title> 552<title>Vim, Emacs, Xemacs and Nano</title>
523<body> 553<body>
524 554
525<p> 555<p>
526Vim, Emacs and Xemacs provide full UTF-8 support, and also have builtin 556Vim provides full UTF-8 support, and also has builtin detection of UTF-8 files.
527detection of UTF-8 files. For further information in Vim, use <c>:help 557For further information in Vim, use <c>:help mbyte.txt</c>.
528mbyte.txt</c>.
529</p>
530
531<p> 558</p>
532Nano currently does not provide support for UTF-8, although it has been planned 559
533for a long time. With luck, this will change in future. At the time of writing, 560<p>
534UTF-8 support is in Nano's CVS, and should be included in the next release. 561Emacs 22.x and higher has full UTF-8 support as well. Xemacs 22.x does not
562support combining characters yet.
563</p>
564
565<p>
566Lower versions of Emacs and/or Xemacs might require you to install
567<c>app-emacs/mule-ucs</c> and/or <c>app-xemacs/mule-ucs</c>
568and add the following code to your <path>~/.emacs</path> to have support for CJK
569languages in UTF-8:
570</p>
571
572<pre caption="Emacs CJK UTF-8 support">
573(require 'un-define)
574(require 'jisx0213)
575(set-language-environment "Japanese")
576(set-default-coding-systems 'utf-8)
577(set-terminal-coding-system 'utf-8)
578</pre>
579
580<p>
581Nano has provided full UTF-8 support since version 1.3.6.
535</p> 582</p>
536 583
537</body> 584</body>
538</section> 585</section>
539<section> 586<section>
556<section> 603<section>
557<title>Irssi</title> 604<title>Irssi</title>
558<body> 605<body>
559 606
560<p> 607<p>
561Since 0.8.10, Irssi has complete UTF-8 support, although it does require a user 608Irssi has complete UTF-8 support, although it does require a user
562to set an option. 609to set an option.
563</p> 610</p>
564 611
565<pre caption="Enabling UTF-8 in Irssi"> 612<pre caption="Enabling UTF-8 in Irssi">
566/set term_charset UTF-8 613/set term_charset UTF-8
578<title>Mutt</title> 625<title>Mutt</title>
579<body> 626<body>
580 627
581<p> 628<p>
582The Mutt mail user agent has very good Unicode support. To use UTF-8 with Mutt, 629The Mutt mail user agent has very good Unicode support. To use UTF-8 with Mutt,
583put the following in your <path>~/.muttrc</path>: 630you don't need to put anything in your configuration files. Mutt will work
584</p> 631under unicode enviroment without modification if all your configuration files
585 632(signature included) are UTF-8 encoded.
586<pre caption="~/.muttrc for UTF-8">
587set send_charset="utf8" <comment>(outgoing character set)</comment>
588set charset="utf8" <comment>(display character set)</comment>
589</pre> 633</p>
590 634
591<note> 635<note>
592You may still see '?' in mail you read with Mutt. This is a result of people 636You may still see '?' in mail you read with Mutt. This is a result of people
593using a mail client which does not indicate the used charset. You can't do much 637using a mail client which does not indicate the used charset. You can't do much
594about this than to ask them to configure their client correctly. 638about this than to ask them to configure their client correctly.
595</note> 639</note>
596 640
597<p> 641<p>
598Further information is available from the <uri 642Further information is available from the <uri
599link="http://wiki.mutt.org/index.cgi?MuttFaq/Charset"> Mutt WikiWiki</uri>. 643link="http://wiki.mutt.org/index.cgi?MuttFaq/Charset">Mutt Wiki</uri>.
600</p> 644</p>
601 645
602</body> 646</body>
603</section> 647</section>
604<section> 648<section>
635<pre caption="man.conf changes for Unicode support"> 679<pre caption="man.conf changes for Unicode support">
636<comment>(This is the old line)</comment> 680<comment>(This is the old line)</comment>
637NROFF /usr/bin/nroff -Tascii -c -mandoc 681NROFF /usr/bin/nroff -Tascii -c -mandoc
638<comment>(Replace the one above with this)</comment> 682<comment>(Replace the one above with this)</comment>
639NROFF /usr/bin/nroff -mandoc -c 683NROFF /usr/bin/nroff -mandoc -c
684</pre>
685
686</body>
687</section>
688<section>
689<title>elinks and links</title>
690<body>
691
692<p>
693These are commonly used text-based browsers, and we shall see how we can enable
694UTF-8 support on them. On <c>elinks</c> and <c>links</c>, there are two ways to
695go about this, one using the Setup option from within the browser or editing the
696config file. To set the option through the browser, open a site with
697<c>elinks</c> or <c>links</c> and then <c>Alt+S</c> to enter the Setup Menu then
698select Terminal options, or press <c>T</c>. Scroll down and select the last
699option <c>UTF-8 I/O</c> by pressing Enter. Then Save and exit the menu. On
700<c>links</c> you may have to do a repeat <c>Alt+S</c> and then press <c>S</c> to
701save. The config file option, is shown below.
702</p>
703
704<pre caption="Enabling UTF-8 for elinks/links">
705<comment>(For elinks, edit /etc/elinks/elinks.conf or ~/.elinks/elinks.conf and
706add the following line)</comment>
707set terminal.linux.utf_8_io = 1
708
709<comment>(For links, edit ~/.links/links.cfg and add the following
710line)</comment>
711terminal "xterm" 0 1 0 us-ascii utf-8
712</pre>
713
714</body>
715</section>
716<section>
717<title>Samba</title>
718<body>
719
720<p>
721Samba is a software suite which implements the SMB (Server Message Block)
722protocol for UNIX systems such as Macs, Linux and FreeBSD. The protocol
723is also sometimes referred to as the Common Internet File System (CIFS). Samba
724also includes the NetBOIS system - used for file sharing over windows networks.
725</p>
726
727<pre caption="Enabling UTF-8 for Samba">
728<comment>(Edit /etc/samba/smb.conf and add the following under the [global] section)</comment>
729dos charset = 1255
730unix charset = UTF-8
731display charset = UTF-8
640</pre> 732</pre>
641 733
642</body> 734</body>
643</section> 735</section>
644<section> 736<section>
745 837
746<p> 838<p>
747AltGr can be used with alphabetical keys alone. For example, AltGr and m, a 839AltGr can be used with alphabetical keys alone. For example, AltGr and m, a
748Greek lower-case letter mu is produced: 'µ'. AltGr and s produce a 840Greek lower-case letter mu is produced: 'µ'. AltGr and s produce a
749scharfes s or esszet: 'ß'. As many European users would expect (because 841scharfes s or esszet: 'ß'. As many European users would expect (because
750it is marked on their keyboard), AltGr and 4 produces a Euro sign, '€'. 842it is marked on their keyboard), AltGr and 4 (or E depending on the keyboard
843layout) produces a Euro sign, '€'.
751</p> 844</p>
752 845
753</body> 846</body>
754</section> 847</section>
755<section> 848<section>
756<title>Resources</title> 849<title>Resources</title>
757<body> 850<body>
758 851
759<ul> 852<ul>
760 <li> 853 <li>
761 <uri link="http://www.wikipedia.com/wiki/Unicode">The Wikipedia entry for 854 <uri link="http://en.wikipedia.org/wiki/Unicode">The Wikipedia entry for
762 Unicode</uri> 855 Unicode</uri>
763 </li> 856 </li>
764 <li> 857 <li>
765 <uri link="http://www.wikipedia.com/wiki/UTF-8">The Wikipedia entry for 858 <uri link="http://en.wikipedia.org/wiki/UTF-8">The Wikipedia entry for
766 UTF-8</uri> 859 UTF-8</uri>
767 </li> 860 </li>
768 <li><uri link="http://www.unicode.org">Unicode.org</uri></li> 861 <li><uri link="http://www.unicode.org">Unicode.org</uri></li>
769 <li><uri link="http://www.utf-8.com">UTF-8.com</uri></li> 862 <li><uri link="http://www.utf-8.com">UTF-8.com</uri></li>
770 <li><uri link="http://www.ietf.org/rfc/rfc3629.txt">RFC 3629</uri></li> 863 <li><uri link="http://www.ietf.org/rfc/rfc3629.txt">RFC 3629</uri></li>

Legend:
Removed from v.1.16  
changed lines
  Added in v.1.41

  ViewVC Help
Powered by ViewVC 1.1.20