| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
MULE is the name originally given to the version of GNU Emacs extended for multi-lingual (and in particular Asian-language) support. "MULE" is short for "MUlti-Lingual Emacs". It is an extension and complete rewrite of Nemacs ("Nihon Emacs" where "Nihon" is the Japanese word for "Japan"), which only provided support for Japanese. XEmacs refers to its multi-lingual support as MULE support since it is based on MULE.
| 63.1 Internationalization Terminology | Definition of various internationalization terms. | |
| 63.2 Charsets | Sets of related characters. | |
| 63.3 MULE Characters | Working with characters in XEmacs/MULE. | |
| 63.4 Composite Characters | Making new characters by overstriking other ones. | |
| 63.5 Coding Systems | Ways of representing a string of chars using integers. | |
| 63.7 CCL | A special language for writing fast converters. | |
| 63.8 Category Tables | Subdividing charsets into groups. | |
| 63.9 Unicode Support | The universal coded character set. | |
| 63.10 Character Set Unification | Handling overlapping character sets. | |
| 63.12.5 Charsets and Coding Systems | Tables and reference information. |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
In internationalization terminology, a string of text is divided up into characters, which are the printable units that make up the text. A single character is (for example) a capital `A', the number `2', a Katakana character, a Hangul character, a Kanji ideograph (an ideograph is a "picture" character, such as is used in Japanese Kanji, Chinese Hanzi, and Korean Hanja; typically there are thousands of such ideographs in each language), etc. The basic property of a character is that it is the smallest unit of text with semantic significance in text processing--i.e., characters are abstract units defined by their meaning, not by their exact appearance.
Human beings normally process text visually, so to a first approximation a character may be identified with its shape. Note that the same character may be drawn by two different people (or in two different fonts) in slightly different ways, although the "basic shape" will be the same. But consider the works of Scott Kim; human beings can recognize hugely variant shapes as the "same" character. Sometimes, especially where characters are extremely complicated to write, completely different shapes may be defined as the "same" character in national standards. The Taiwanese variant of Hanzi is generally the most complicated; over the centuries, the Japanese, Koreans, and the People's Republic of China have adopted simplifications of the shape, but the line of descent from the original shape is recorded, and the meanings and pronunciation of different forms of the same character are considered to be identical within each language. (Of course, it may take a specialist to recognize the related form; the point is that the relations are standardized, despite the differing shapes.)
In some cases, the differences will be significant enough that it is actually possible to identify two or more distinct shapes that both represent the same character. For example, the lowercase letters `a' and `g' each have two distinct possible shapes--the `a' can optionally have a curved tail projecting off the top, and the `g' can be formed either of two loops, or of one loop and a tail hanging off the bottom. Such distinct possible shapes of a character are called glyphs. The important characteristic of two glyphs making up the same character is that the choice between one or the other is purely stylistic and has no linguistic effect on a word (this is the reason why a capital `A' and lowercase `a' are different characters rather than different glyphs--e.g. `Aspen' is a city while `aspen' is a kind of tree).
Note that character and glyph are used differently here than elsewhere in XEmacs.
A character set is essentially a set of related characters. ASCII, for example, is a set of 94 characters (or 128, if you count non-printing characters). Other character sets are ISO8859-1 (ASCII plus various accented characters and other international symbols), JIS X 0201 (ASCII, more or less, plus half-width Katakana), JIS X 0208 (Japanese Kanji), JIS X 0212 (a second set of less-used Japanese Kanji), GB2312 (Mainland Chinese Hanzi), etc.
The definition of a character set will implicitly or explicitly give it an ordering, a way of assigning a number to each character in the set. For many character sets, there is a natural ordering, for example the "ABC" ordering of the Roman letters. But it is not clear whether digits should come before or after the letters, and in fact different European languages treat the ordering of accented characters differently. It is useful to use the natural order where available, of course. The number assigned to any particular character is called the character's code point. (Within a given character set, each character has a unique code point. Thus the word "set" is ill-chosen; different orderings of the same characters are different character sets. Identifying characters is simple enough for alphabetic character sets, but the difference in ordering can cause great headaches when the same thousands of characters are used by different cultures as in the Hanzi.)
It's important to understand that a character is defined not by any number attached to it, but by its meaning. For example, ASCII and EBCDIC are two charsets containing exactly the same characters (lowercase and uppercase letters, numbers 0 through 9, particular punctuation marks) but with different numberings. The `comma' character in ASCII and EBCDIC, for instance, is the same character despite having a different numbering. Conversely, when comparing ASCII and JIS-Roman, which look the same except that the latter has a yen sign substituted for the backslash, we would say that the backslash and yen sign are not the same characters, despite having the same number (95) and despite the fact that all other characters are present in both charsets, with the same numbering. ASCII and JIS-Roman, then, do not have exactly the same characters in them (ASCII has a backslash character but no yen-sign character, and vice-versa for JIS-Roman), unlike ASCII and EBCDIC, even though the numberings in ASCII and JIS-Roman are closer.
Sometimes, a code point is not a single number, but instead a group of numbers, called position codes. In such cases, the number of position codes required to index a particular character in a character set is called the dimension of the character set. Character sets indexed by more than one position code typically use byte-sized position codes. Small character sets, e.g. ASCII, invariably use a single position code, but for larger character sets, the choice of whether to use multiple position codes or a single large (16-bit or 32-bit) number is arbitrary. Unicode typically uses a single large number, but language-specific or "national" character sets often use multiple (usually two) position codes. For example, JIS X 0208, i.e. Japanese Kanji, has thousands of characters, and is of dimension two -- every character is indexed by two position codes, each in the range 1 through 94. (This number "94" is not a coincidence; it is the same as the number of printable characters in ASCII, and was chosen so that JIS characters could be directly encoded using two printable ASCII characters.) Note that the choice of the range here is somewhat arbitrary -- it could just as easily be 0 through 93, 2 through 95, etc. In fact, the range for JIS position codes (and for other character sets modeled after it) is often given as range 33 through 126, so as to directly match ASCII printing characters.
An encoding is a way of numerically representing characters from one or more character sets into a stream of like-sized numerical values called words -- typically 8-bit bytes, but sometimes 16-bit or 32-bit quantities. In a context where dealing with Japanese motivates much of XEmacs' design in this area, it's important to clearly distinguish between charsets and encodings. For a simple charset like ASCII, there is only one encoding normally used -- each character is represented by a single byte, with the same value as its code point. For more complicated charsets, however, or when a single encoding needs to represent more than charset, things are not so obvious. Unicode version 2, for example, is a large charset with thousands of characters, each indexed by a 16-bit number, often represented in hex, e.g. 0x05D0 for the Hebrew letter "aleph". One obvious encoding (actually two encodings, depending on which of the two possible byte orderings is chosen) simply uses two bytes per character. This encoding is convenient for internal processing of Unicode text; however, it's incompatible with ASCII, and thus external text (files, e-mail, etc.) that is encoded this way is completely uninterpretable by programs lacking Unicode support. For this reason, a different, ASCII-compatible encoding, e.g. UTF-8, is usually used for external text. UTF-8 represents Unicode characters with one to three bytes (often extended to six bytes to handle characters with up to 31-bit indices). Unicode characters 00 to 7F (identical with ASCII) are directly represented with one byte, and other characters with two or more bytes, each in the range 80 to FF. Applications that don't understand Unicode will still be able to process ASCII characters represented in UTF-8-encoded text, and will typically ignore (and hopefully preserve) the high-bit characters.
Similarly, Shift-JIS and EUC-JP are different encodings normally used to encode the same character set(s), these character sets being subsets of Unicode. However, the obvious approach of unifying XEmacs' internal encoding across character sets, as was part of the motivation behind Unicode, wasn't taken. This means that characters in these character sets that are identical to characters in other character sets--for example, the Greek alphabet is in the large Japanese character sets and at least one European character set--are unfortunately disjoint.
Naive use of code points is also not possible if more than one character set is to be used in the encoding. For example, printed Japanese text typically requires characters from multiple character sets -- ASCII, JIS X 0208, and JIS X 0212, to be specific. Each of these is indexed using one or more position codes in the range 1 through 94 (or 33 through 126), so the position codes could not be used directly or there would be no way to tell which character was meant. Different Japanese encodings handle this differently -- JIS uses special escape characters to denote different character sets; EUC sets the high bit of the position codes for JIS X 0208 and JIS X 0212, and puts a special extra byte before each JIS X 0212 character; etc.
The encodings described above are all 7-bit or 8-bit encodings. The fixed-width Unicode encoding previous described, however, is sometimes considered to be a 16-bit encoding, in which case the issue of byte ordering does not come up. (Imagine, for example, that the text is represented as an array of shorts.) Similarly, Unicode version 3 (which has characters with indices above 0xFFFF), and other very large character sets, may be represented internally as 32-bit encodings, i.e. arrays of ints. However, it does not make too much sense to talk about 16-bit or 32-bit encodings for external data, since nowadays 8-bit data is a universal standard -- the closest you can get is fixed-width encodings using two or four bytes to encode 16-bit or 32-bit values. (A "7-bit" encoding is used when it cannot be guaranteed that the high bit of 8-bit data will be correctly preserved. Some e-mail gateways, for example, strip the high bit of text passing through them. These same gateways often handle non-printable characters incorrectly, and so 7-bit encodings usually avoid using bytes with such values.)
A general method of handling text using multiple character sets (whether for multilingual text, or simply text in an extremely complicated single language like Japanese) is defined in the international standard ISO 2022. ISO 2022 will be discussed in more detail later (see section 63.6 ISO 2022), but for now suffice it to say that text needs control functions (at least spacing), and if escape sequences are to be used, an escape sequence introducer. It was decided to make all text streams compatible with ASCII in the sense that the codes 0--31 (and 128-159) would always be control codes, never graphic characters, and where defined by the character set the `SPC' character would be assigned code 32, and `DEL' would be assigned 127. Thus there are 94 code points remaining if 7 bits are used. This is the reason that most character sets are defined using position codes in the range 1 through 94. Then ISO 2022 compatible encodings are produced by shifting the position codes 1 to 94 into character codes 33 to 126, or (if 8 bit codes are available) into character codes 161 to 254.
Encodings are classified as either modal or non-modal. In a modal encoding, there are multiple states that the encoding can be in, and the interpretation of the values in the stream depends on the current global state of the encoding. Special values in the encoding, called escape sequences, are used to change the global state. JIS, for example, is a modal encoding. The bytes `ESC $ B' indicate that, from then on, bytes are to be interpreted as position codes for JIS X 0208, rather than as ASCII. This effect is cancelled using the bytes `ESC ( B', which mean "switch from whatever the current state is to ASCII". To switch to JIS X 0212, the escape sequence `ESC $ ( D'. (Note that here, as is common, the escape sequences do in fact begin with `ESC'. This is not necessarily the case, however. Some encodings use control characters called "locking shifts" (effect persists until cancelled) to switch character sets.)
A non-modal encoding has no global state that extends past the character currently being interpreted. EUC, for example, is a non-modal encoding. Characters in JIS X 0208 are encoded by setting the high bit of the position codes, and characters in JIS X 0212 are encoded by doing the same but also prefixing the character with the byte 0x8F.
The advantage of a modal encoding is that it is generally more space-efficient, and is easily extendible because there are essentially an arbitrary number of escape sequences that can be created. The disadvantage, however, is that it is much more difficult to work with if it is not being processed in a sequential manner. In the non-modal EUC encoding, for example, the byte 0x41 always refers to the letter `A'; whereas in JIS, it could either be the letter `A', or one of the two position codes in a JIS X 0208 character, or one of the two position codes in a JIS X 0212 character. Determining exactly which one is meant could be difficult and time-consuming if the previous bytes in the string have not already been processed, or impossible if they are drawn from an external stream that cannot be rewound.
Non-modal encodings are further divided into fixed-width and variable-width formats. A fixed-width encoding always uses the same number of words per character, whereas a variable-width encoding does not. EUC is a good example of a variable-width encoding: one to three bytes are used per character, depending on the character set. 16-bit and 32-bit encodings are nearly always fixed-width, and this is in fact one of the main reasons for using an encoding with a larger word size. The advantages of fixed-width encodings should be obvious. The advantages of variable-width encodings are that they are generally more space-efficient and allow for compatibility with existing 8-bit encodings such as ASCII. (For example, in Unicode ASCII characters are simply promoted to a 16-bit representation. That means that every ASCII character contains a `NUL' byte; evidently all of the standard string manipulation functions will lose badly in a fixed-width Unicode environment.)
The bytes in an 8-bit encoding are often referred to as octets rather than simply as bytes. This terminology dates back to the days before 8-bit bytes were universal, when some computers had 9-bit bytes, others had 10-bit bytes, etc.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A charset in MULE is an object that encapsulates a particular character set as well as an ordering of those characters. Charsets are permanent objects and are named using symbols, like faces.
nil if object is a charset.
| 63.2.1 Charset Properties | Properties of a charset. | |
| 63.2.2 Basic Charset Functions | Functions for working with charsets. | |
| 63.2.3 Charset Property Functions | Functions for accessing charset properties. | |
| 63.2.4 Predefined Charsets | Predefined charset objects. |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Charsets have the following properties:
name
doc-string
registry
ascii and latin-iso8859-1
charsets use the registry "ISO8859-1". This field is used to
choose an appropriate font when the user gives a general font
specification such as `-*-courier-medium-r-*-140-*', i.e. a
14-point upright medium-weight Courier font.
dimension
chars
columns
direction
l2r (left-to-right) or r2l
(right-to-left). Defaults to l2r. This specifies the
direction that the text should be displayed in, and will be
left-to-right for most charsets but right-to-left for Hebrew
and Arabic. (Right-to-left display is not currently implemented.)
final
graphic
graphic set to 0,
position codes 33 through 126 map to font indices 33 through 126; with
it set to 1, position codes 33 through 126 map to font indices 161
through 254 (i.e. the same number but with the high bit set). For
example, for a font whose registry is ISO8859-1, the left half of the
font (octets 0x20 - 0x7F) is the ascii charset, while the right
half (octets 0xA0 - 0xFF) is the latin-iso8859-1 charset.
ccl-program
graphic
property. If a CCL program is defined, the position codes of a
character will first be processed according to graphic and
then passed through the CCL program, with the resulting values used
to index the font.
This is used, for example, in the Big5 character set (used in Taiwan).
This character set is not ISO-2022-compliant, and its size (94x157) does
not fit within the maximum 96x96 size of ISO-2022-compliant character
sets. As a result, XEmacs/MULE splits it (in a rather complex fashion,
so as to group the most commonly used characters together) into two
charset objects (big5-1 and big5-2), each of size 94x94,
and each charset object uses a CCL program to convert the modified
position codes back into standard Big5 indices to retrieve a character
from a Big5 font.
Most of the above properties can only be set when the charset is initialized, and cannot be changed later. See section 63.2.3 Charset Property Functions.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
nil is returned. Otherwise the associated charset
object is returned.
find-charset except an error is signalled if there is no such
charset instead of returning nil.
registry, dimension, columns,
chars, final, graphic, direction, and
ccl-program, as previously described.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
All of these functions accept either a charset name or charset object.
Convenience functions are also provided for retrieving individual properties of a charset.
l2r or r2l.
The two properties of a charset that can currently be set after the charset has been created are the CCL program and the font registry.
ccl-program property of charset to
ccl-program.
registry property of charset to
registry.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following charsets are predefined in the C code.
Name Type Fi Gr Dir Registry -------------------------------------------------------------- ascii 94 B 0 l2r ISO8859-1 control-1 94 0 l2r --- latin-iso8859-1 94 A 1 l2r ISO8859-1 latin-iso8859-2 96 B 1 l2r ISO8859-2 latin-iso8859-3 96 C 1 l2r ISO8859-3 latin-iso8859-4 96 D 1 l2r ISO8859-4 cyrillic-iso8859-5 96 L 1 l2r ISO8859-5 arabic-iso8859-6 96 G 1 r2l ISO8859-6 greek-iso8859-7 96 F 1 l2r ISO8859-7 hebrew-iso8859-8 96 H 1 r2l ISO8859-8 latin-iso8859-9 96 M 1 l2r ISO8859-9 thai-tis620 96 T 1 l2r TIS620 katakana-jisx0201 94 I 1 l2r JISX0201.1976 latin-jisx0201 94 J 0 l2r JISX0201.1976 japanese-jisx0208-1978 94x94 @ 0 l2r JISX0208.1978 japanese-jisx0208 94x94 B 0 l2r JISX0208.19(83|90) japanese-jisx0212 94x94 D 0 l2r JISX0212 chinese-gb2312 94x94 A 0 l2r GB2312 chinese-cns11643-1 94x94 G 0 l2r CNS11643.1 chinese-cns11643-2 94x94 H 0 l2r CNS11643.2 chinese-big5-1 94x94 0 0 l2r Big5 chinese-big5-2 94x94 1 0 l2r Big5 korean-ksc5601 94x94 C 0 l2r KSC5601 composite 96x96 0 l2r --- |
The following charsets are predefined in the Lisp code.
Name Type Fi Gr Dir Registry -------------------------------------------------------------- arabic-digit 94 2 0 l2r MuleArabic-0 arabic-1-column 94 3 0 r2l MuleArabic-1 arabic-2-column 94 4 0 r2l MuleArabic-2 sisheng 94 0 0 l2r sisheng_cwnn\|OMRON_UDC_ZH chinese-cns11643-3 94x94 I 0 l2r CNS11643.1 chinese-cns11643-4 94x94 J 0 l2r CNS11643.1 chinese-cns11643-5 94x94 K 0 l2r CNS11643.1 chinese-cns11643-6 94x94 L 0 l2r CNS11643.1 chinese-cns11643-7 94x94 M 0 l2r CNS11643.1 ethiopic 94x94 2 0 l2r Ethio ascii-r2l 94 B 0 r2l ISO8859-1 ipa 96 0 1 l2r MuleIPA vietnamese-viscii-lower 96 1 1 l2r VISCII1.1 vietnamese-viscii-upper 96 2 1 l2r VISCII1.1 |
For all of the above charsets, the dimension and number of columns are the same.
Note that ASCII, Control-1, and Composite are handled specially. This is why some of the fields are blank; and some of the filled-in fields (e.g. the type) are not really accurate.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Composite characters are not yet completely implemented.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
A coding system is an object that defines how text containing multiple character sets is encoded into a stream of (typically 8-bit) bytes. The coding system is used to decode the stream into a series of characters (which may be from multiple charsets) when the text is read from a file or process, and is used to encode the text back into the same format when it is written out to a file or process.
For example, many ISO-2022-compliant coding systems (such as Compound
Text, which is used for inter-client data under the X Window System) use
escape sequences to switch between different charsets -- Japanese Kanji,
for example, is invoked with `ESC $ ( B'; ASCII is invoked with
`ESC ( B'; and Cyrillic is invoked with `ESC - L'. See
make-coding-system for more information.
Coding systems are normally identified using a symbol, and the symbol is accepted in place of the actual coding system object whenever a coding system is called for. (This is similar to how faces and charsets work.)
nil if object is a coding system.
| 63.5.1 Coding System Types | Classifying coding systems. | |
| 63.6 ISO 2022 | An international standard for charsets and encodings. | |
| 63.6.1 EOL Conversion | Dealing with different ways of denoting the end of a line. | |
| 63.6.2 Coding System Properties | Properties of a coding system. | |
| 63.6.3 Basic Coding System Functions | Working with coding systems. | |
| 63.6.4 Coding System Property Functions | Retrieving a coding system's properties. | |
| 63.6.5 Encoding and Decoding Text | Encoding and decoding text. | |
| 63.6.6 Detection of Textual Encoding | Determining how text is encoded. | |
| 63.6.7 Big5 and Shift-JIS Functions | Special functions for these non-standard encodings. | |
| 63.6.8 Coding Systems Implemented | Coding systems implemented by MULE. |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The coding system type determines the basic algorithm XEmacs will use to decode or encode a data stream. Character encodings will be converted to the MULE encoding, escape sequences processed, and newline sequences converted to XEmacs's internal representation. There are three basic classes of coding system type: no-conversion, ISO-2022, and special.
No conversion allows you to look at the file's internal representation. Since XEmacs is basically a text editor, "no conversion" does convert newline conventions by default. (Use the 'binary coding-system if this is not desired.)
ISO 2022 (see section 63.6 ISO 2022) is the basic international standard regulating use of "coded character sets for the exchange of data", ie, text streams. ISO 2022 contains functions that make it possible to encode text streams to comply with restrictions of the Internet mail system and de facto restrictions of most file systems (eg, use of the separator character in file names). Coding systems which are not ISO 2022 conformant can be difficult to handle. Perhaps more important, they are not adaptable to multilingual information interchange, with the obvious exception of ISO 10646 (Unicode). (Unicode is partially supported by XEmacs with the addition of the Lisp package ucs-conv.)
The special class of coding systems includes automatic detection, CCL (a "little language" embedded as an interpreter, useful for translating between variants of a single character set), non-ISO-2022-conformant encodings like Unicode, Shift JIS, and Big5, and MULE internal coding. (NB: this list is based on XEmacs 21.2. Terminology may vary slightly for other versions of XEmacs and for GNU Emacs 20.)
no-conversion
iso2022
ucs-4
utf-8
undecided
shift-jis
big5
ccl
internal
DEBUG_XEMACS set
(the `--debug' configure option). Warning: Reading in a
file using internal conversion can result in an internal
inconsistency in the memory representing a buffer's text, which will
produce unpredictable results and may cause XEmacs to crash. Under
normal circumstances you should never use internal conversion.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This section briefly describes the ISO 2022 encoding standard. A more thorough treatment is available in the original document of ISO 2022 as well as various national standards (such as JIS X 0202).
Character sets (charsets) are classified into the following four categories, according to the number of characters in the charset: 94-charset, 96-charset, 94x94-charset, and 96x96-charset. This means that although an ISO 2022 coding system may have variable width characters, each charset used is fixed-width (in contrast to the MULE character set and UTF-8, for example).
ISO 2022 provides for switching between character sets via escape sequences. This switching is somewhat complicated, because ISO 2022 provides for both legacy applications like Internet mail that accept only 7 significant bits in some contexts (RFC 822 headers, for example), and more modern "8-bit clean" applications. It also provides for compact and transparent representation of languages like Japanese which mix ASCII and a national script (even outside of computer programs).
First, ISO 2022 codified prevailing practice by dividing the code space into "control" and "graphic" regions. The code points 0x00-0x1F and 0x80-0x9F are reserved for "control characters", while "graphic characters" must be assigned to code points in the regions 0x20-0x7F and 0xA0-0xFF. The positions 0x20 and 0x7F are special, and under some circumstances must be assigned the graphic character "ASCII SPACE" and the control character "ASCII DEL" respectively.
The various regions are given the name C0 (0x00-0x1F), GL (0x20-0x7F), C1 (0x80-0x9F), and GR (0xA0-0xFF). GL and GR stand for "graphic left" and "graphic right", respectively, because of the standard method of displaying graphic character sets in tables with the high byte indexing columns and the low byte indexing rows. I don't find it very intuitive, but these are called "registers".
An ISO 2022-conformant encoding for a graphic character set must use a fixed number of bytes per character, and the values must fit into a single register; that is, each byte must range over either 0x20-0x7F, or 0xA0-0xFF. It is not allowed to extend the range of the repertoire of a character set by using both ranges at the same. This is why a standard character set such as ISO 8859-1 is actually considered by ISO 2022 to be an aggregation of two character sets, ASCII and LATIN-1, and why it is technically incorrect to refer to ISO 8859-1 as "Latin 1". Also, a single character's bytes must all be drawn from the same register; this is why Shift JIS (for Japanese) and Big 5 (for Chinese) are not ISO 2022-compatible encodings.
The reason for this restriction becomes clear when you attempt to define an efficient, robust encoding for a language like Japanese. Like ISO 8859, Japanese encodings are aggregations of several character sets. In practice, the vast majority of characters are drawn from the "JIS Roman" character set (a derivative of ASCII; it won't hurt to think of it as ASCII) and the JIS X 0208 standard "basic Japanese" character set including not only ideographic characters ("kanji") but syllabic Japanese characters ("kana"), a wide variety of symbols, and many alphabetic characters (Roman, Greek, and Cyrillic) as well. Although JIS X 0208 includes the whole Roman alphabet, as a 2-byte code it is not suited to programming; thus the inclusion of ASCII in the standard Japanese encodings.
For normal Japanese text such as in newspapers, a broad repertoire of approximately 3000 characters is used. Evidently this won't fit into one byte; two must be used. But much of the text processed by Japanese computers is computer source code, nearly all of which is ASCII. A not insignificant portion of ordinary text is English (as such or as borrowed Japanese vocabulary) or other languages which can represented at least approximately in ASCII, as well. It seems reasonable then to represent ASCII in one byte, and JIS X 0208 in two. And this is exactly what the Extended Unix Code for Japanese (EUC-JP) does. ASCII is invoked to the GL register, and JIS X 0208 is invoked to the GR register. Thus, each byte can be tested for its character set by looking at the high bit; if set, it is Japanese, if clear, it is ASCII. Furthermore, since control characters like newline can never be part of a graphic character, even in the case of corruption in transmission the stream will be resynchronized at every line break, on the order of 60-80 bytes. This coding system requires no escape sequences or special control codes to represent 99.9% of all Japanese text.
Note carefully the distinction between the character sets (ASCII and JIS X 0208), the encoding (EUC-JP), and the coding system (ISO 2022). The JIS X 0208 character set is used in three different encodings for Japanese, but in ISO-2022-JP it is invoked into GL (so the high bit is always clear), in EUC-JP it is invoked into GR (setting the high bit in the process), and in Shift JIS the high bit may be set or reset, and the significant bits are shifted within the 16-bit character so that the two main character sets can coexist with a third (the "halfwidth katakana" of JIS X 0201). As the name implies, the ISO-2022-JP encoding is also a version of the ISO-2022 coding system.
In order to systematically treat subsidiary character sets (like the "halfwidth katakana" already mentioned, and the "supplementary kanji" of JIS X 0212), four further registers are defined: G0, G1, G2, and G3. Unlike GL and GR, they are not logically distinguished by internal format. Instead, the process of "invocation" mentioned earlier is broken into two steps: first, a character set is designated to one of the registers G0-G3 by use of an escape sequence of the form:
ESC [I] I F |
where I is an intermediate character or characters in the range 0x20 - 0x3F, and F, from the range 0x30-0x7Fm is the final character identifying this charset. (Final characters in the range 0x30-0x3F are reserved for private use and will never have a publicly registered meaning.)
Then that register is invoked to either GL or GR, either automatically (designations to G0 normally involve invocation to GL as well), or by use of shifting (affecting only the following character in the data stream) or locking (effective until the next designation or locking) control sequences. An encoding conformant to ISO 2022 is typically defined by designating the initial contents of the G0-G3 registers, specifying a 7 or 8 bit environment, and specifying whether further designations will be recognized.
Some examples of character sets and the registered final characters F used to designate them:
The meanings of the various characters in these sequences, where not specified by the ISO 2022 standard (such as the ESC character), are assigned by ECMA, the European Computer Manufacturers Association.
The meaning of intermediate characters are:
$ [0x24]: indicate charset of dimension 2 (94x94 or 96x96).
( [0x28]: designate to G0 a 94-charset whose final byte is F.
) [0x29]: designate to G1 a 94-charset whose final byte is F.
* [0x2A]: designate to G2 a 94-charset whose final byte is F.
+ [0x2B]: designate to G3 a 94-charset whose final byte is F.
, [0x2C]: designate to G0 a 96-charset whose final byte is F.
- [0x2D]: designate to G1 a 96-charset whose final byte is F.
. [0x2E]: designate to G2 a 96-charset whose final byte is F.
/ [0x2F]: designate to G3 a 96-charset whose final byte is F.
|
The comma may be used in files read and written only by MULE, as a MULE extension, but this is illegal in ISO 2022. (The reason is that in ISO 2022 G0 must be a 94-member character set, with 0x20 assigned the value SPACE, and 0x7F assigned the value DEL.)
Here are examples of designations:
ESC ( B : designate to G0 ASCII
ESC - A : designate to G1 Latin-1
ESC $ ( A or ESC $ A : designate to G0 GB2312
ESC $ ( B or ESC $ B : designate to G0 JISX0208
ESC $ ) C : designate to G1 KSC5601
|
(The short forms used to designate GB2312 and JIS X 0208 are for backwards compatibility; the long forms are preferred.)
To use a charset designated to G2 or G3, and to use a charset designated to G1 in a 7-bit environment, you must explicitly invoke G1, G2, or G3 into GL. There are two types of invocation, Locking Shift (forever) and Single Shift (one character only).
Locking Shift is done as follows:
LS0 or SI (0x0F): invoke G0 into GL
LS1 or SO (0x0E): invoke G1 into GL
LS2: invoke G2 into GL
LS3: invoke G3 into GL
LS1R: invoke G1 into GR
LS2R: invoke G2 into GR
LS3R: invoke G3 into GR
|
Single Shift is done as follows:
SS2 or ESC N: invoke G2 into GL
SS3 or ESC O: invoke G3 into GL
|
The shift functions (such as LS1R and SS3) are represented by control characters (from C1) in 8 bit environments and by escape sequences in 7 bit environments.
(#### Ben says: I think the above is slightly incorrect. It appears that SS2 invokes G2 into GR and SS3 invokes G3 into GR, whereas ESC N and ESC O behave as indicated. The above definitions will not parse EUC-encoded text correctly, and it looks like the code in mule-coding.c has similar problems.)
Evidently there are a lot of ISO-2022-compliant ways of encoding multilingual text. Now, in the world, there exist many coding systems such as X11's Compound Text, Japanese JUNET code, and so-called EUC (Extended UNIX Code); all of these are variants of ISO 2022.
In MULE, we characterize a version of ISO 2022 by the following attributes:
(The last two are only for Japanese.)
By specifying these attributes, you can create any variant of ISO 2022.
Here are several examples:
ISO-2022-JP -- Coding system used in Japanese email (RFC 1463 #### check).
1. G0 <- ASCII, G1..3 <- never used
2. Yes.
3. Yes.
4. Yes.
5. 7-bit environment
6. No.
7. Use ASCII
8. Use JIS X 0208-1983
ctext -- X11 Compound Text
1. G0 <- ASCII, G1 <- Latin-1, G2,3 <- never used.
2. No.
3. No.
4. Yes.
5. 8-bit environment.
6. No.
7. Use ASCII.
8. Use JIS X 0208-1983.
euc-china -- Chinese EUC. Often called the "GB encoding", but that is
technically incorrect.
1. G0 <- ASCII, G1 <- GB 2312, G2,3 <- never used.
2. No.
3. Yes.
4. Yes.
5. 8-bit environment.
6. No.
7. Use ASCII.
8. Use JIS X 0208-1983.
ISO-2022-KR -- Coding system used in Korean email.
1. G0 <- ASCII, G1 <- KSC 5601, G2,3 <- never used.
2. No.
3. Yes.
4. Yes.
5. 7-bit environment.
6. Yes.
7. Use ASCII.
8. Use JIS X 0208-1983.
|
MULE creates all of these coding systems by default.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
nil
name-unix,
name-dos, and name-mac, that are identical to
this coding system but have an EOL-TYPE value of lf, crlf,
and cr, respectively.
lf
crlf
cr
t
nil when stored
internally, and coding-system-property will return nil.)
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
mnemonic
eol-type
eol-lf
eol-crlf
eol-cr
post-read-conversion
pre-write-conversion
The following additional properties are recognized if type is
iso2022:
charset-g0
charset-g1
charset-g2
charset-g3
nil (do not ever use this register)
t (no character set is initially designated to the register, but
may be later on; this automatically sets the corresponding
force-g*-on-output property)
force-g0-on-output
force-g1-on-output
force-g2-on-output
force-g3-on-output
nil, send an explicit designation sequence on output
before using the specified register.
short
nil, use the short forms `ESC $ @', `ESC $ A',
and `ESC $ B' on output in place of the full designation sequences
`ESC $ ( @', `ESC $ ( A', and `ESC $ ( B'.
no-ascii-eol
nil, don't designate ASCII to G0 at each end of line on
output. Setting this to non-nil also suppresses other
state-resetting that normally happens at the end of a line.
no-ascii-cntl
nil, don't designate ASCII to G0 before control chars on
output.
seven
nil, use 7-bit environment on output. Otherwise, use 8-bit
environment.
lock-shift
nil, use locking-shift (SO/SI) instead of single-shift or
designation by escape sequence.
no-iso6429
nil, don't use ISO6429's direction specification.
escape-quoted
nil, literal control characters that are the same as the
beginning of a recognized ISO 2022 or ISO 6429 escape sequence (in
particular, ESC (0x1B), SO (0x0E), SI (0x0F), SS2 (0x8E), SS3 (0x8F),
and CSI (0x9B)) are "quoted" with an escape character so that they can
be properly distinguished from an escape sequence. (Note that doing
this results in a non-portable encoding.) This encoding flag is used for
byte-compiled files. Note that ESC is a good choice for a quoting
character because there are no escape sequences whose second byte is a
character from the Control-0 or Control-1 character sets; this is
explicitly disallowed by the ISO 2022 standard.
input-charset-conversion
output-charset-conversion
input-charset-conversion.
The following additional properties are recognized (and required) if
type is ccl:
decode
encode
The following properties are used internally: eol-cr, eol-crlf, eol-lf, and base.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
If coding-system-or-name is a coding-system object, it is simply
returned. Otherwise, coding-system-or-name should be a symbol.
If there is no such coding system, nil is returned. Otherwise
the associated coding system object is returned.
find-coding-system except an error is signalled if there is no
such coding system instead of returning nil.
type describes the conversion method used and should be one of the types listed in 63.5.1 Coding System Types.
doc-string is a string describing the coding system.
props is a property list, describing the specific nature of the character set. Recognized properties are as in 63.6.2 Coding System Properties.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
binary or no-conversion coding
system, so that it shows up as `^[$B!<!+^[(B'). The length of the
encoded text is returned. buffer defaults to the current buffer
if unspecified.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
autodetect or one of its subsidiary coding systems
according to a detected end-of-line type. Optional arg buffer
defaults to the current buffer.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
These are special functions for working with the non-standard Shift-JIS and Big5 encodings.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
MULE initializes most of the commonly used coding systems at XEmacs's startup. A few others are initialized only when the relevant language environment is selected and support libraries are loaded. (NB: The following list is based on XEmacs 21.2.19, the development branch at the time of writing. The list may be somewhat different for other versions. Recent versions of GNU Emacs 20 implement a few more rare coding systems; work is being done to port these to XEmacs.)
Unfortunately, there is not a consistent naming convention for character sets, and for practical purposes coding systems often take their name from their principal character sets (ASCII, KOI8-R, Shift JIS). Others take their names from the coding system (ISO-2022-JP, EUC-KR), and a few from their non-text usages (internal, binary). To provide for this, and for the fact that many coding systems have several common names, an aliasing system is provided. Finally, some effort has been made to use names that are registered as MIME charsets (this is why the name 'shift_jis contains that un-Lisp-y underscore).
There is a systematic naming convention regarding end-of-line (EOL) conventions for different systems. A coding system whose name ends in "-unix" forces the assumptions that lines are broken by newlines (0x0A). A coding system whose name ends in "-mac" forces the assumptions that lines are broken by ASCII CRs (0x0D). A coding system whose name ends in "-dos" forces the assumptions that lines are broken by CRLF sequences (0x0D 0x0A). These subsidiary coding systems are automatically derived from a base coding system. Use of the base coding system implies autodetection of the text file convention. (The fact that the -unix, -mac, and -dos are derived from a base system results in them showing up as "aliases" in `list-coding-systems'.) These subsidiaries have a consistent modeline indicator as well. "-dos" coding systems have ":T" appended to their modeline indicator, while "-mac" coding systems have ":t" appended (eg, "ISO8:t" for iso-2022-8-mac).
In the following table, each coding system is given with its mode line indicator in parentheses. Non-textual coding systems are listed first, followed by textual coding systems and their aliases. (The coding system subsidiary modeline indicators ":T" and ":t" will be omitted from the table of coding systems.)
### SJT 1999-08-23 Maybe should order these by language? Definitely need language usage for the ISO-8859 family.
Note that although true coding system aliases have been implemented for XEmacs 21.2, the coding system initialization has not yet been converted as of 21.2.19. So coding systems described as aliases have the same properties as the aliased coding system, but will not be equal as Lisp objects.
automatic-conversion
undecided
undecided-dos
undecided-mac
undecided-unix
Modeline indicator: Auto. A type undecided coding system.
Attempts to determine an appropriate coding system from file contents or
the environment.
raw-text
no-conversion
raw-text-dos
raw-text-mac
raw-text-unix
no-conversion-dos
no-conversion-mac
no-conversion-unix
Modeline indicator: Raw. A type no-conversion coding system,
which converts only line-break-codes. An implementation quirk means
that this coding system is also used for ISO8859-1.
binary
Binary. A type no-conversion coding
system which does no character coding or EOL conversions. An alias for
raw-text-unix.
alternativnyj
alternativnyj-dos
alternativnyj-mac
alternativnyj-unix
Modeline indicator: Cy.Alt. A type ccl coding system used for
Alternativnyj, an encoding of the Cyrillic alphabet.
big5
big5-dos
big5-mac
big5-unix
Modeline indicator: Zh/Big5. A type big5 coding system used for
BIG5, the most common encoding of traditional Chinese as used in Taiwan.
cn-gb-2312
cn-gb-2312-dos
cn-gb-2312-mac
cn-gb-2312-unix
Modeline indicator: Zh-GB/EUC. A type iso2022 coding system used
for simplified Chinese (as used in the People's Republic of China), with
the ascii (G0), chinese-gb2312 (G1), and sisheng
(G2) character sets initially designated. Chinese EUC (Extended Unix
Code).
ctext-hebrew
ctext-hebrew-dos
ctext-hebrew-mac
ctext-hebrew-unix
Modeline indicator: CText/Hbrw. A type iso2022 coding system
with the ascii (G0) and hebrew-iso8859-8 (G1) character
sets initially designated for Hebrew.
ctext
ctext-dos
ctext-mac
ctext-unix
Modeline indicator: CText. A type iso2022 8-bit coding system
with the ascii (G0) and latin-iso8859-1 (G1) character
sets initially designated. X11 Compound Text Encoding. Often
mistakenly recognized instead of EUC encodings; usual cause is
inappropriate setting of coding-priority-list.
escape-quoted
Modeline indicator: ESC/Quot. A type iso2022 8-bit coding
system with the ascii (G0) and latin-iso8859-1 (G1)
character sets initially designated and escape quoting. Unix EOL
conversion (ie, no conversion). It is used for .ELC files.
euc-jp
euc-jp-dos
euc-jp-mac
euc-jp-unix
Modeline indicator: Ja/EUC. A type iso2022 8-bit coding system
with ascii (G0), japanese-jisx0208 (G1),
katakana-jisx0201 (G2), and japanese-jisx0212 (G3)
initially designated. Japanese EUC (Extended Unix Code).
euc-kr
euc-kr-dos
euc-kr-mac
euc-kr-unix
Modeline indicator: ko/EUC. A type iso2022 8-bit coding system
with ascii (G0) and korean-ksc5601 (G1) initially
designated. Korean EUC (Extended Unix Code).
hz-gb-2312
Zh-GB/Hz. A type no-conversion coding
system with Unix EOL convention (ie, no conversion) using
post-read-decode and pre-write-encode functions to translate the Hz/ZW
coding system used for Chinese.
iso-2022-7bit
iso-2022-7bit-unix
iso-2022-7bit-dos
iso-2022-7bit-mac
iso-2022-7
Modeline indicator: ISO7. A type iso2022 7-bit coding system
with ascii (G0) initially designated. Other character sets must
be explicitly designated to be used.
iso-2022-7bit-ss2
iso-2022-7bit-ss2-dos
iso-2022-7bit-ss2-mac
iso-2022-7bit-ss2-unix
Modeline indicator: ISO7/SS. A type iso2022 7-bit coding system
with ascii (G0) initially designated. Other character sets must
be explicitly designated to be used. SS2 is used to invoke a
96-charset, one character at a time.
iso-2022-8
iso-2022-8-dos
iso-2022-8-mac
iso-2022-8-unix
Modeline indicator: ISO8. A type iso2022 8-bit coding system
with ascii (G0) and latin-iso8859-1 (G1) initially
designated. Other character sets must be explicitly designated to be
used. No single-shift or locking-shift.
iso-2022-8bit-ss2
iso-2022-8bit-ss2-dos
iso-2022-8bit-ss2-mac
iso-2022-8bit-ss2-unix
Modeline indicator: ISO8/SS. A type iso2022 8-bit coding system
with ascii (G0) and latin-iso8859-1 (G1) initially
designated. Other character sets must be explicitly designated to be
used. SS2 is used to invoke a 96-charset, one character at a time.
iso-2022-int-1
iso-2022-int-1-dos
iso-2022-int-1-mac
iso-2022-int-1-unix
Modeline indicator: INT-1. A type iso2022 7-bit coding system
with ascii (G0) and korean-ksc5601 (G1) initially
designated. ISO-2022-INT-1.
iso-2022-jp-1978-irv
iso-2022-jp-1978-irv-dos
iso-2022-jp-1978-irv-mac
iso-2022-jp-1978-irv-unix
Modeline indicator: Ja-78/7bit. A type iso2022 7-bit coding
system. For compatibility with old Japanese terminals; if you need to
know, look at the source.
iso-2022-jp
iso-2022-jp-2 (ISO7/SS)
iso-2022-jp-dos
iso-2022-jp-mac
iso-2022-jp-unix
iso-2022-jp-2-dos
iso-2022-jp-2-mac
iso-2022-jp-2-unix
Modeline indicator: MULE/7bit. A type iso2022 7-bit coding
system with ascii (G0) initially designated, and complex
specifications to insure backward compatibility with old Japanese
systems. Used for communication with mail and news in Japan. The "-2"
versions also use SS2 to invoke a 96-charset one character at a time.
iso-2022-kr
Ko/7bit A type iso2022 7-bit coding
system with ascii (G0) and korean-ksc5601 (G1) initially
designated. Used for e-mail in Korea.
iso-2022-lock
iso-2022-lock-dos
iso-2022-lock-mac
iso-2022-lock-unix
Modeline indicator: ISO7/Lock. A type iso2022 7-bit coding
system with ascii (G0) initially designated, using Locking-Shift
to invoke a 96-charset.
iso-8859-1
iso-8859-1-dos
iso-8859-1-mac
iso-8859-1-unix
Due to implementation, this is not a type iso2022 coding system,
but rather an alias for the raw-text coding system.
iso-8859-2
iso-8859-2-dos
iso-8859-2-mac
iso-8859-2-unix
Modeline indicator: MIME/Ltn-2. A type iso2022 coding
system with ascii (G0) and latin-iso8859-2 (G1) initially
invoked.
iso-8859-3
iso-8859-3-dos
iso-8859-3-mac
iso-8859-3-unix
Modeline indicator: MIME/Ltn-3. A type iso2022 coding system
with ascii (G0) and latin-iso8859-3 (G1) initially
invoked.
iso-8859-4
iso-8859-4-dos
iso-8859-4-mac
iso-8859-4-unix
Modeline indicator: MIME/Ltn-4. A type iso2022 coding system
with ascii (G0) and latin-iso8859-4 (G1) initially
invoked.
iso-8859-5
iso-8859-5-dos
iso-8859-5-mac
iso-8859-5-unix
Modeline indicator: ISO8/Cyr. A type iso2022 coding system with
ascii (G0) and cyrillic-iso8859-5 (G1) initially invoked.
iso-8859-7
iso-8859-7-dos
iso-8859-7-mac
iso-8859-7-unix
Modeline indicator: Grk. A type iso2022 coding system with
ascii (G0) and greek-iso8859-7 (G1) initially invoked.
iso-8859-8
iso-8859-8-dos
iso-8859-8-mac
iso-8859-8-unix
Modeline indicator: MIME/Hbrw. A type iso2022 coding system with
ascii (G0) and hebrew-iso8859-8 (G1) initially invoked.
iso-8859-9
iso-8859-9-dos
iso-8859-9-mac
iso-8859-9-unix
Modeline indicator: MIME/Ltn-5. A type iso2022 coding system
with ascii (G0) and latin-iso8859-9 (G1) initially
invoked.
koi8-r
koi8-r-dos
koi8-r-mac
koi8-r-unix
Modeline indicator: KOI8. A type ccl coding-system used for
KOI8-R, an encoding of the Cyrillic alphabet.
shift_jis
shift_jis-dos
shift_jis-mac
shift_jis-unix
Modeline indicator: Ja/SJIS. A type shift-jis coding-system
implementing the Shift-JIS encoding for Japanese. The underscore is to
conform to the MIME charset implementing this encoding.
tis-620
tis-620-dos
tis-620-mac
tis-620-unix
Modeline indicator: TIS620. A type ccl encoding for Thai. The
external encoding is defined by TIS620, the internal encoding is
peculiar to MULE, and called thai-xtis.
viqr
Modeline indicator: VIQR. A type no-conversion coding
system with Unix EOL convention (ie, no conversion) using
post-read-decode and pre-write-encode functions to translate the VIQR
coding system for Vietnamese.
viscii
viscii-dos
viscii-mac
viscii-unix
Modeline indicator: VISCII. A type ccl coding-system used
for VISCII 1.1 for Vietnamese. Differs slightly from VSCII; VISCII is
given priority by XEmacs.
vscii
vscii-dos
vscii-mac
vscii-unix
Modeline indicator: VSCII. A type ccl coding-system used
for VSCII 1.1 for Vietnamese. Differs slightly from VISCII, which is
given priority by XEmacs. Use
(prefer-coding-system 'vietnamese-vscii) to give priority to VSCII.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
CCL (Code Conversion Language) is a simple structured programming
language designed for character coding conversions. A CCL program is
compiled to CCL code (represented by a vector of integers) and executed
by the CCL interpreter embedded in Emacs. The CCL interpreter
implements a virtual machine with 8 registers called r0, ...,
r7, a number of control structures, and some I/O operators. Take
care when using registers r0 (used in implicit set
statements) and especially r7 (used internally by several
statements and operations, especially for multiple return values and I/O
operations).
CCL is used for code conversion during process I/O and file I/O for non-ISO2022 coding systems. (It is the only way for a user to specify a code conversion function.) It is also used for calculating the code point of an X11 font from a character code. However, since CCL is designed as a powerful programming language, it can be used for more generic calculation where efficiency is demanded. A combination of three or more arithmetic operations can be calculated faster by CCL than by Emacs Lisp.
Warning: The code in `src/mule-ccl.c' and `$packages/lisp/mule-base/mule-ccl.el' is the definitive description of CCL's semantics. The previous version of this section contained several typos and obsolete names left from earlier versions of MULE, and many may remain. (I am not an experienced CCL programmer; the few who know CCL well find writing English painful.)
A CCL program transforms an input data stream into an output data stream. The input stream, held in a buffer of constant bytes, is left unchanged. The buffer may be filled by an external input operation, taken from an Emacs buffer, or taken from a Lisp string. The output buffer is a dynamic array of bytes, which can be written by an external output operation, inserted into an Emacs buffer, or returned as a Lisp string.
A CCL program is a (Lisp) list containing two or three members. The first member is the buffer magnification, which indicates the required minimum size of the output buffer as a multiple of the input buffer. It is followed by the main block which executes while there is input remaining, and an optional EOF block which is executed when the input is exhausted. Both the main block and the EOF block are CCL blocks.
A CCL block is either a CCL statement or list of CCL statements. A CCL statement is either a set statement (either an integer or an assignment, which is a list of a register to receive the assignment, an assignment operator, and an expression) or a control statement (a list starting with a keyword, whose allowable syntax depends on the keyword).
| 63.7.1 CCL Syntax | CCL program syntax in BNF notation. | |
| 63.7.2 CCL Statements | Semantics of CCL statements. | |
| 63.7.3 CCL Expressions | Operators and expressions in CCL. | |
| 63.7.4 Calling CCL | Running CCL programs. | |
| 63.7.5 CCL Example | A trivial program to transform the Web's URL encoding. |
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The full syntax of a CCL program in BNF notation:
CCL_PROGRAM :=
(BUFFER_MAGNIFICATION
CCL_MAIN_BLOCK
[ CCL_EOF_BLOCK ])
BUFFER_MAGNIFICATION := integer
CCL_MAIN_BLOCK := CCL_BLOCK
CCL_EOF_BLOCK := CCL_BLOCK
CCL_BLOCK :=
STATEMENT | (STATEMENT [STATEMENT ...])
STATEMENT :=
SET | IF | BRANCH | LOOP | REPEAT | BREAK | READ | WRITE | CALL
| TRANSLATE | MAP | END
SET :=
(REG = EXPRESSION)
| (REG ASSIGNMENT_OPERATOR EXPRESSION)
| INT-OR-CHAR
EXPRESSION := ARG | (EXPRESSION OPERATOR ARG)
IF := (if EXPRESSION CCL_BLOCK [CCL_BLOCK])
BRANCH := (branch EXPRESSION CCL_BLOCK [CCL_BLOCK ...])
LOOP := (loop STATEMENT [STATEMENT ...])
BREAK := (break)
REPEAT :=
(repeat)
| (write-repeat [REG | INT-OR-CHAR | string])
| (write-read-repeat REG [INT-OR-CHAR | ARRAY])
READ :=
(read REG ...)
| (read-if (REG OPERATOR ARG) CCL_BLOCK [CCL_BLOCK])
| (read-branch REG CCL_BLOCK [CCL_BLOCK ...])
WRITE :=
(write REG ...)
| (write EXPRESSION)
| (write INT-OR-CHAR) | (write string) | (write REG ARRAY)
| string
CALL := (call ccl-program-name)
TRANSLATE := ;; Not implemented under XEmacs, except mule-to-unicode and
;; unicode-to-mule.
(translate-character REG(table) REG(charset) REG(codepoint))
| (translate-character SYMBOL REG(charset) REG(codepoint))
| (mule-to-unicode REG(charset) REG(codepoint))
| (unicode-to-mule REG(unicode,code) REG(CHARSET))
END := (end)
REG := r0 | r1 | r2 | r3 | r4 | r5 | r6 | r7
ARG := REG | INT-OR-CHAR
OPERATOR :=
+ | - | * | / | % | & | '|' | ^ | << | >> | <8 | >8 | //
| < | > | == | <= | >= | != | de-sjis | en-sjis
ASSIGNMENT_OPERATOR :=
+= | -= | *= | /= | %= | &= | '|=' | ^= | <<= | >>=
ARRAY := '[' INT-OR-CHAR ... ']'
INT-OR-CHAR := integer | character
|
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The Emacs Code Conversion Language provides the following statement types: set, if, branch, loop, repeat, break, read, write, call, translate and end.
The set statement has three variants with the syntaxes
`(reg = expression)',
`(reg assignment_operator expression)', and
`integer'. The assignment operator variation of the
set statement works the same way as the corresponding C expression
statement does. The assignment operators are +=, -=,
*=, /=, %=, &=, |=, ^=,
<<=, and >>=, and they have the same meanings as in C. A
"naked integer" integer is equivalent to a set statement of
the form (r0 = integer).
The read statement takes one or more registers as arguments. It reads one byte (a C char) from the input into each register in turn.
The write takes several forms. In the form `(write reg ...)' it takes one or more registers as arguments and writes each in turn to the output. The integer in a register (interpreted as an Ichar) is encoded to multibyte form (ie, Ibytes) and written to the current output buffer. If it is less than 256, it is written as is. The forms `(write expression)' and `(write integer)' are treated analogously. The form `(write string)' writes the constant string to the output. A "naked string" `string' is equivalent to the statement `(write string)'. The form `(write reg array)' writes the regth element of the array to the output.
The if statement takes an expression, a CCL block, and an optional second CCL block as arguments. If the expression evaluates to non-zero, the first CCL block is executed. Otherwise, if there is a second CCL block, it is executed.
The read-if variant of the if statement takes an
expression, a CCL block, and an optional second CCL
block as arguments. The expression must have the form
(reg operator operand) (where operand is
a register or an integer). The read-if statement first reads
from the input into the first register operand in the expression,
then conditionally executes a CCL block just as the if statement
does.
The branch statement takes an expression and one or more CCL
blocks as arguments. The CCL blocks are treated as a zero-indexed
array, and the branch statement uses the expression as the
index of the CCL block to execute. Null CCL blocks may be used as
no-ops, continuing execution with the statement following the
branch statement in the containing CCL block. Out-of-range
values for the expression are also treated as no-ops.
The read-branch variant of the branch statement takes an
register, a CCL block, and an optional second CCL
block as arguments. The read-branch statement first reads from
the input into the register, then conditionally executes a CCL
block just as the branch statement does.
The loop statement creates a block with an implied jump from the
end of the block back to its head. The loop is exited on a break
statement, and continued without executing the tail by a repeat
statement.
The break statement, written `(break)', terminates the current loop and continues with the next statement in the current block.
The repeat statement has three variants, repeat,
write-repeat, and write-read-repeat. Each continues the
current loop from its head, possibly after performing I/O.
repeat takes no arguments and does no I/O before jumping.
write-repeat takes a single argument (a register, an
integer, or a string), writes it to the output, then jumps.
write-read-repeat takes one or two arguments. The first must
be a register. The second may be an integer or an array; if absent, it
is implicitly set to the first (register) argument.
write-read-repeat writes its second argument to the output, then
reads from the input into the register, and finally jumps. See the
write and read statements for the semantics of the I/O
operations for each type of argument.
The call statement, written `(call ccl-program-name)', executes a CCL program as a subroutine. It does not return a value to the caller, but can modify the register status.
The mule-to-unicode statement translates an XEmacs character into a UCS code point, using U+FFFD REPLACEMENT CHARACTER if the given XEmacs character has no known corresponding code point. It takes two arguments; the first is a register in which is stored the character set ID of the character to be translated, and into which the UCS code is stored. The second is a register which stores the XEmacs code of the character in question; if it is from a multidimensional character set, like most of the East Asian national sets, it's stored as `((c1 << 8) & c2)', where `c1' is the first code, and `c2' the second. (That is, as a single integer, the high-order eight bits of which encode the first position code, and the low order bits of which encode the second.)
The unicode-to-mule statement translates a Unicode code point (an integer) into an XEmacs character. Its first argument is a register containing the UCS code point; the code for the correspond character will be written into this register, in the same format as for `mule-to-unicode' The second argument is a register into which will be written the character set ID of the converted character.
The end statement, written `(end)', terminates the CCL program successfully, and returns to caller (which may be a CCL program). It does not alter the status of the registers.
| [ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
CCL, unlike Lisp, uses infix expressions. The simplest CCL expressions
consist of a single operand, either a register (one of r0,
..., r0) or an integer. Complex expressions are lists of the
form ( expression operator operand ). Unlike
C, assignments are not expressions.
In the following table, X is the target resister for a set.
In subexpressions, this is implicitly r7. This means that
>8, //, de-sjis, and en-sjis cannot be used
freely in subexpressions, since they return parts of their values in
r7. Y may be an expression, register, or integer, while
Z must be a register or an integer.
| Name | Operator | Code | C-like Description |
| CCL_PLUS | + | 0x00 | X = Y + Z |
| CCL_MINUS | - | 0x01 | X = Y - Z |
| CCL_MUL | * | 0x02 | X = Y * Z |
| CCL_DIV | / | 0x03 | X = Y / Z |
| CCL_MOD | % | 0x04 | X = Y % Z |
| CCL_AND | & | 0x05 | X = Y & Z |
| CCL_OR | | | 0x06 | X = Y | Z |
| CCL_XOR | ^ | 0x07 | X = Y ^ Z |
| CCL_LSH | << | 0x08 | X = Y << Z |
| CCL_RSH | >> | 0x09 | X = Y >> Z |
| CCL_LSH8 | <8 | 0x0A | X = (Y << 8) | Z |
| CCL_RSH8 | >8 | 0x0B | X = Y >> 8, r[7] = Y & 0xFF |
| CCL_DIVMOD | // | 0x0C | X = Y / Z, r[7] = Y % Z |
| CCL_LS | < | 0x10 | X = (X < Y) |
| CCL_GT | > | 0x11 | X = (X > Y) |
| CCL_EQ | == | 0x12 | X = (X == Y) |
| CCL_LE | <= | 0x13 | X = (X <= Y) |
| CCL_GE | >= | 0x14 | X = (X >= Y) |
| CCL_NE | != | 0x15 | X = (X != Y) |
| CCL_ENCODE_SJIS | en-sjis | 0x16 | X = HIGHER_BYTE (SJIS (Y, Z)) |
| r[7] = LOWER_BYTE (SJIS (Y, Z) | |||
| CCL_DECODE_SJIS | de-sjis | 0x17 | X = HI |