| 1 |
GLEP: 31
|
| 2 |
Title: Character Sets for Portage Tree Items
|
| 3 |
Version: $Revision: 1.2 $
|
| 4 |
Author: Ciaran McCreesh <ciaranm@gentoo.org>
|
| 5 |
Last-Modified: $Date: 2004/11/01 20:24:06 $
|
| 6 |
Status: Approved
|
| 7 |
Type: Standards Track
|
| 8 |
Content-Type: text/x-rst
|
| 9 |
Created: 27-October-2004
|
| 10 |
Post-Date: 28-October-2004, 1-November-2004, 11-November-2004
|
| 11 |
|
| 12 |
Abstract
|
| 13 |
========
|
| 14 |
|
| 15 |
A set of guidelines regarding what characters are permissible in the
|
| 16 |
portage tree and how they should be encoded is required.
|
| 17 |
|
| 18 |
Status
|
| 19 |
======
|
| 20 |
|
| 21 |
Approved on 8-Nov-2004 assuming that implementation will include
|
| 22 |
documentation for correctly encoding files within nano.
|
| 23 |
|
| 24 |
Motivation
|
| 25 |
==========
|
| 26 |
|
| 27 |
At present we have several developers and many more users whose names
|
| 28 |
require characters (for example, accents) which are not part of the
|
| 29 |
standard 'safe' 0..127 ASCII range. There is no current standard on how
|
| 30 |
these should be represented, leading to inconsistency across the tree.
|
| 31 |
|
| 32 |
Although the issues involved have been discussed informally many times, no
|
| 33 |
official decision has been made.
|
| 34 |
|
| 35 |
Specification
|
| 36 |
=============
|
| 37 |
|
| 38 |
ChangeLog and Metadata Character Sets
|
| 39 |
-------------------------------------
|
| 40 |
|
| 41 |
It is proposed that UTF-8 ([1]_) is used for encoding ChangeLog and
|
| 42 |
metadata.xml files inside the portage tree.
|
| 43 |
|
| 44 |
UTF-8 allows the full range of Unicode ([2]_) characters to be expressed,
|
| 45 |
which is necessary given the diversity of the Gentoo developer- and
|
| 46 |
user-base. It is character-compatible with ASCII for the 0..127
|
| 47 |
characters and does not significantly increase the storage requirements
|
| 48 |
for files which consist mainly of American English characters. It is
|
| 49 |
widely supported, widely used and an official standard.
|
| 50 |
|
| 51 |
The ISO-8859-* character sets ([3]_) would *not* be appropriate since they
|
| 52 |
cannot express the full range of required characters.
|
| 53 |
|
| 54 |
Ebuild and Eclass Character Sets
|
| 55 |
--------------------------------
|
| 56 |
|
| 57 |
For the same reasons as previously, it is proposed that UTF-8 is used as
|
| 58 |
the official encoding for ebuild and eclass files.
|
| 59 |
|
| 60 |
However, developers should be warned that any code which is parsed by bash
|
| 61 |
(in other words, non-comments), and any output which is echoed to the
|
| 62 |
screen (for example, einfo messages) or given to portage (for example any
|
| 63 |
of the standard global variables) must not use anything outside the
|
| 64 |
regular ASCII 0..127 range for compatibility purposes.
|
| 65 |
|
| 66 |
files/ Entries Character Sets
|
| 67 |
-----------------------------
|
| 68 |
|
| 69 |
Patches must clearly be in the same character set as the file they are
|
| 70 |
patching. For other files/ entries (for example, GNOME desktop files),
|
| 71 |
consistency with the upstream-recommended character set is most sensible.
|
| 72 |
|
| 73 |
Suitable Characters for File and Directory Names
|
| 74 |
------------------------------------------------
|
| 75 |
|
| 76 |
Characters outside the ASCII 0..127 range cannot safely be used for file
|
| 77 |
or directory names. (Of course, not all characters inside the ASCII 0..127
|
| 78 |
range can be used safely either.)
|
| 79 |
|
| 80 |
Backwards Compatibility
|
| 81 |
=======================
|
| 82 |
|
| 83 |
The existing tree uses a mixture of encodings. It would be straightforward
|
| 84 |
to fix existing ChangeLogs and metadata files to use UTF-8.
|
| 85 |
|
| 86 |
The ``echangelog`` tool is character-set agnostic. In order to properly
|
| 87 |
enter UTF-8, developers would have to switch to a UTF-8 shell session.
|
| 88 |
This only applies if the developer is entering new text which uses 'fancy'
|
| 89 |
characters -- existing characters are not mangled.
|
| 90 |
|
| 91 |
Certain text editors are incapable of handling UTF-8 cleanly. However,
|
| 92 |
since the ``echangelog`` tool is generally the correct way to generate
|
| 93 |
ChangeLog entries, this should not be a major problem. Generating
|
| 94 |
metadata.xml files correctly in these editors could become problematic.
|
| 95 |
The ``vim`` and ``emacs`` editors, which appear to be most widely used,
|
| 96 |
are both capable of handling UTF-8 cleanly -- for vim, this could be
|
| 97 |
configured automatically via the ``gentoo-syntax`` ([4]_) package.
|
| 98 |
|
| 99 |
References
|
| 100 |
==========
|
| 101 |
|
| 102 |
.. [1] RFC 3629: UTF-8, a transformation format of ISO 10646
|
| 103 |
http://www.ietf.org/rfc/rfc3629.txt
|
| 104 |
.. [2] ISO/IEC 10646 (Universal Multiple-Octet Coded Character Set)
|
| 105 |
.. [3] ISO/IEC 8859 (8-bit single-byte coded graphic character sets)
|
| 106 |
.. [4] The app-vim/gentoo-syntax package,
|
| 107 |
https://developer.berlios.de/projects/gentoo-syntax/
|
| 108 |
|
| 109 |
Copyright
|
| 110 |
=========
|
| 111 |
|
| 112 |
This document has been placed in the public domain.
|
| 113 |
|
| 114 |
vim: set tw=74 fileencoding=utf-8 :
|
| 115 |
|