Roger Firth's IF pages

Home

Inform 6: Frequently Asked Questions

Email
Back up

Into
the Intro

Setting
the scene

Preparing
to program

Learning
the lingo

Dabbling
in data

Operating
on objects

Verbal
versatility

Bothered
by bugs

History and
hereafter

Worldly
woes

Inside
information

Tips and
techniques

Could you explain how character sets are handled?

More information in:
the DM (§1.11, §36 and Table 2)
the Technical Manual (§8.3 and §12.3)
the Z-Machine Standards 1.0 document (§3)
the FAQ topic on Writing a French game

The Z-machine's character handling is one of those topics which is simple until you start thinking about it, whereupon it takes a turn for the trickier (or maybe it's just me). Although most of the necessary information is available, it's scattered across several locations; this is an attempt to pull together the basics in one place. Bear in mind that most of what follows is specific to the Z-machine; Glulx handles characters quite differently (and currently doesn't offer any support for extended character sets). We'll set the scene with a trip down memory lane.

Character encoding

Historically, most computers have stored arbitrary character sequences -- 'strings' -- with one character per byte of storage. An eight-bit byte supports 256 different character codes, more than enough for the letters, digits, punctuation marks and control codes which were common currency among the English-speaking fathers of modern-day computing. Fairly early on, the obvious advantages of standardisation resulted in the American Standard Code for Information Interchange (ASCII), defined in 1968 as ANSI X3.4, and later adopted as ISO 646. This encoding allocated characters in a reasonably logical manner to the first 128 possible values (hex $00..$7F) -- so that for example $21 is an exclamation mark, $31 is the digit '1', $41 is the letter 'A' and $61 is the letter 'a' -- and is still almost universally used today (only IBM's EBCDIC, oriented towards mainframe punched cards, survives as an alternative).

ASCII provides exactly 26 letter forms in upper and lower case, sufficient only to encode text in English, Hawaiian, Latin and Swahili. As the need to handle other languages grew, a wide and incompatible range of alternative allocations were devised for the infrequently-used standard characters '#$@[\]^`{|}~', and for the remaining 128 values (hex $80..$FF) which ASCII didn't specify.

Around the mid-1980s, in an attempt to escape total chaos, ISO 8859 defined a series of nine ways of encoding all 256 byte values. Values $00..$7F (the ASCII encoding) and $80..$9F (additional control codes) are identical in each of the nine encodings; the differences lie in the 96 characters with values $A0..$FF. For example, ISO 8859-1 allocates those values to characters commonly used in Western European text, ISO 8859-4 is appropriate for North European text, ISO 8859-7 is for Greek text, and so on.

More information can be found in the excellent sites of Roman Czyborra (offline?) and Alan Wood, and at unicode.org

Although ISO 8859 was a step in the right direction, it still left an awful lot of languages out in the cold; basically, it just isn't possible to squeeze all of those incompatible character sets into an eight-bit encoding scheme. As so, in 1991 we get to ISO 10646 -- Unicode -- which uses a sixteen-bit encoding (with a twenty-bit extension zone for when that runs out). Sixteen bits support 65,536 different character codes, of which roughly 50,000 have currently been allocated. Conveniently, the first 256 Unicode characters are the same as ISO 8859-1, so that $0041 and $0061 are still 'A' and 'a', while $00C1 and $00E1 remain consistently 'Á' and 'á'.

        

TECHNICAL NOTE: The rest of this page assumes that you're using version 6.30 (or higher) of the Inform compiler. There were significant problems with character set handling in earlier versions of the compiler, and I strongly advise you to upgrade before experimenting with these techniques.

The ZSCII character set

Character strings in the Z-machine use the ZSCII encoding ('Z', unsurprisingly, stands for 'Zork'). ZSCII is an eight-bit character set, leading to a repertoire of 256 character codes.

        

TECHNICAL NOTE: The Z-Machine Standard 1.1 Proposal includes a mechanism to extend this, offering direct access to Unicode strings.

Basic ZSCII

In the ZSCII character set, some of the 256 values are reserved for control purposes and others are unused; these values are shown in grey. The values in the range 32..126 -- shown in white -- are standard encodings which cannot be changed (and are the same as the ASCII/ISO 8859/Unicode characters in that range). The values in the range 155..251 -- shown in yellow -- are 'extra characters' which you can allocate. There are two ways of populating this ZSCII range with characters of your choice:


Allocating extra characters using the C switch

At compile-time, you can choose to populate the yellow range with characters taken from one of the nine ISO 8859 series. You can set the switch either on the command line (-Cn) or as a special comment at the very start of your source file (!% -Cn).

ZSCII with C1

If you specify C0 or C1, or if you omit the C switch altogether, then the compiler takes letters and symbols from ISO 8859-1. Not all characters are taken, and their ZSCII order is not the same as in ISO 8859-1. The result is Infocom's version of ZSCII, appropriate for Western European games: English, French, German, Spanish, Swedish, etc. This is what happens by default.


 

ZSCII with C2

If you specify C2, the compiler takes all of the letter characters, but not in order, from ISO 8859-2. The result is a version of ZSCII appropriate for Eastern European games: Croatian, Hungarian, Polish, etc.


 

ZSCII with C3

If you specify C3, the compiler takes all of the letter characters, in order, from ISO 8859-3. The result is a version of ZSCII appropriate for Southern European games: Maltese, also Esperanto.


 

ZSCII with C4

If you specify C4, the compiler takes all of the letter characters, in order, from ISO 8859-4. The result is a version of ZSCII appropriate for Northern European games: Estonian, Latvian, Lithuanian, etc.


 

ZSCII with C5

If you specify C5, the compiler takes all of the letter characters, in order, from ISO 8859-5. The result is a version of ZSCII appropriate for Cyrillic games: Bulgarian, Russian, Serbian, etc.


 

ZSCII with C6

If you specify C6, the compiler takes all of the letter characters, in order, from ISO 8859-6. The result is a version of ZSCII appropriate for Arabic games.


 

ZSCII with C7

If you specify C7, the compiler takes all of the letter characters, in order, from ISO 8859-7. The result is a version of ZSCII appropriate for Greek games.


 

ZSCII with C8

If you specify C8, the compiler takes all of the letter characters, in order, from ISO 8859-8. The result is a version of ZSCII appropriate for Hebrew games.


 

ZSCII with C9

If you specify C9, the compiler takes all of the letter characters, in order, from ISO 8859-9. The result is a version of ZSCII appropriate for Turkish games.


 

Note that in all nine cases, some of the yellow range remains free. You can allocate additional characters here, as explained next.

Allocating extra characters using the Zcharacter directive

Having first populated the yellow range with a selection of ISO 8859 characters, you can then choose either to supplement those characters with others of your own choosing, or to replace them completely by up to 96 characters which you specify within the source file.

Adding to C1

For example, you could start with the default allocation, and then add the three characters '¢°±', by using this Zcharacter directive at the very start of your source file:

  Zcharacter table + '@{00A2}' '@{00B0}' '@{00B1}';

 

Replacing C1

Alternatively, you could replace the initial allocation by just those three characters, using this (very similar) Zcharacter directive:

  Zcharacter table '@{00A2}' '@{00B0}' '@{00B1}';

Both these Zcharacter directives expect a list of character values, each given in single quotes. The construct @{00A2} specifies a Unicode character value in hexadecimal -- 00A2 is the value for the cent sign '¢', 00B0 is the degree sign '°', and 00B1 is the plus-minus sign '±'.


 

        

TECHNICAL NOTE: Every Z-code interpreter requires a table to convert ZSCII values back into printable Unicode characters. A default table -- matching the default ZSCII character set -- is built into the interpreter. However, if you supply a switch C2..C9, or if you use either of these Zcharacter directives, then the default table becomes inadequate, and the compiler must include a replacement table within the game file. This is the 'Unicode translation table', whose address is given in the Z-machine Header Extension area (whose address is itself given in the game's header at address HDR_EXTENSION-->0). You can display the table by including Roger Firth's dump.h library extension, and typing DUMP UCODE.

Just for completeness, here are Zcharacter directives which set up tables equivalent to those created by the switches C1-C9:

  Zcharacter table       ! C0/C1 - Latin 1 (West European)
      '@{00E4}' '@{00F6}' '@{00FC}' '@{00C4}' '@{00D6}' '@{00DC}' '@{00DF}' '@{00BB}'
      '@{00AB}' '@{00EB}' '@{00EF}' '@{00FF}' '@{00CB}' '@{00CF}' '@{00E1}' '@{00E9}'
      '@{00ED}' '@{00F3}' '@{00FA}' '@{00FD}' '@{00C1}' '@{00C9}' '@{00CD}' '@{00D3}'
      '@{00DA}' '@{00DD}' '@{00E0}' '@{00E8}' '@{00EC}' '@{00F2}' '@{00F9}' '@{00C0}'
      '@{00C8}' '@{00CC}' '@{00D2}' '@{00D9}' '@{00E2}' '@{00EA}' '@{00EE}' '@{00F4}'
      '@{00FB}' '@{00C2}' '@{00CA}' '@{00CE}' '@{00D4}' '@{00DB}' '@{00E5}' '@{00C5}'
      '@{00F8}' '@{00D8}' '@{00E3}' '@{00F1}' '@{00F5}' '@{00C3}' '@{00D1}' '@{00D5}'
      '@{00E6}' '@{00C6}' '@{00E7}' '@{00C7}' '@{00FE}' '@{00F0}' '@{00DE}' '@{00D0}'
      '@{00A3}' '@{0153}' '@{0152}' '@{00A1}' '@{00BF}';
  Zcharacter table       ! C2 - Latin 2 (East European)
      '@{0104}' '@{0141}' '@{013D}' '@{015A}' '@{0160}' '@{015E}' '@{0164}' '@{0179}'
      '@{017D}' '@{017B}' '@{0154}' '@{00C1}' '@{00C2}' '@{0102}' '@{00C4}' '@{0139}'
      '@{0106}' '@{00C7}' '@{010C}' '@{00C9}' '@{0118}' '@{00CB}' '@{011A}' '@{00CD}'
      '@{00CE}' '@{010E}' '@{0110}' '@{0143}' '@{0147}' '@{00D3}' '@{00D4}' '@{0150}'
      '@{00D6}' '@{0158}' '@{016E}' '@{00DA}' '@{0170}' '@{00DC}' '@{00DD}' '@{0162}'
      '@{0105}' '@{0142}' '@{013E}' '@{015B}' '@{0161}' '@{015F}' '@{0165}' '@{017A}'
      '@{017E}' '@{017C}' '@{00DF}' '@{0155}' '@{00E1}' '@{00E2}' '@{0103}' '@{00E4}'
      '@{013A}' '@{0107}' '@{00E7}' '@{010D}' '@{00E9}' '@{0119}' '@{00EB}' '@{011B}'
      '@{00ED}' '@{00EE}' '@{010F}' '@{0111}' '@{0144}' '@{0148}' '@{00F3}' '@{00F4}'
      '@{0151}' '@{00F6}' '@{0159}' '@{016F}' '@{00FA}' '@{0171}' '@{00FC}' '@{00FD}'
      '@{0163}';
  Zcharacter table       ! C3 - Latin 3 (South European)
      '@{0126}' '@{0124}' '@{0130}' '@{015E}' '@{011E}' '@{0134}' '@{017B}' '@{0127}'
      '@{0125}' '@{0131}' '@{015F}' '@{011F}' '@{0135}' '@{017C}' '@{00C0}' '@{00C1}'
      '@{00C2}' '@{00C4}' '@{010A}' '@{0108}' '@{00C7}' '@{00C8}' '@{00C9}' '@{00CA}'
      '@{00CB}' '@{00CC}' '@{00CD}' '@{00CE}' '@{00CF}' '@{00D1}' '@{00D2}' '@{00D3}'
      '@{00D4}' '@{0120}' '@{00D6}' '@{011C}' '@{00D9}' '@{00DA}' '@{00DB}' '@{00DC}'
      '@{016C}' '@{015C}' '@{00DF}' '@{00E0}' '@{00E1}' '@{00E2}' '@{00E4}' '@{010B}'
      '@{0109}' '@{00E7}' '@{00E8}' '@{00E9}' '@{00EA}' '@{00EB}' '@{00EC}' '@{00ED}'
      '@{00EE}' '@{00EF}' '@{00F1}' '@{00F2}' '@{00F3}' '@{00F4}' '@{0121}' '@{00F6}'
      '@{011D}' '@{00F9}' '@{00FA}' '@{00FB}' '@{00FC}' '@{016D}' '@{015D}';
  Zcharacter table       ! C4 - Latin 4 (North European)
      '@{0104}' '@{0138}' '@{0156}' '@{0128}' '@{013B}' '@{0160}' '@{0112}' '@{0122}'
      '@{0166}' '@{017D}' '@{0105}' '@{0157}' '@{0129}' '@{013C}' '@{0161}' '@{0113}'
      '@{0123}' '@{0167}' '@{014A}' '@{017E}' '@{014B}' '@{0100}' '@{00C1}' '@{00C2}'
      '@{00C3}' '@{00C4}' '@{00C5}' '@{00C6}' '@{012E}' '@{010C}' '@{00C9}' '@{0118}'
      '@{00CB}' '@{0116}' '@{00CD}' '@{00CE}' '@{012A}' '@{0110}' '@{0145}' '@{014C}'
      '@{0136}' '@{00D4}' '@{00D5}' '@{00D6}' '@{00D8}' '@{0172}' '@{00DA}' '@{00DB}'
      '@{00DC}' '@{0168}' '@{016A}' '@{00DF}' '@{0101}' '@{00E1}' '@{00E2}' '@{00E3}'
      '@{00E4}' '@{00E5}' '@{00E6}' '@{012F}' '@{010D}' '@{00E9}' '@{0119}' '@{00EB}'
      '@{0117}' '@{00ED}' '@{00EE}' '@{012B}' '@{0111}' '@{0146}' '@{014D}' '@{0137}'
      '@{00F4}' '@{00F5}' '@{00F6}' '@{00F8}' '@{0173}' '@{00FA}' '@{00FB}' '@{00FC}'
      '@{0169}' '@{016B}';
  Zcharacter table       ! C5 - Cyrillic
      '@{0401}' '@{0402}' '@{0403}' '@{0404}' '@{0405}' '@{0406}' '@{0407}' '@{0408}'
      '@{0409}' '@{040A}' '@{040B}' '@{040C}' '@{040E}' '@{040F}' '@{0410}' '@{0411}'
      '@{0412}' '@{0413}' '@{0414}' '@{0415}' '@{0416}' '@{0417}' '@{0418}' '@{0419}'
      '@{041A}' '@{041B}' '@{041C}' '@{041D}' '@{041E}' '@{041F}' '@{0420}' '@{0421}'
      '@{0422}' '@{0423}' '@{0424}' '@{0425}' '@{0426}' '@{0427}' '@{0428}' '@{0429}'
      '@{042A}' '@{042B}' '@{042C}' '@{042D}' '@{042E}' '@{042F}' '@{0430}' '@{0431}'
      '@{0432}' '@{0433}' '@{0434}' '@{0435}' '@{0436}' '@{0437}' '@{0438}' '@{0439}'
      '@{043A}' '@{043B}' '@{043C}' '@{043D}' '@{043E}' '@{043F}' '@{0440}' '@{0441}'
      '@{0442}' '@{0443}' '@{0444}' '@{0445}' '@{0446}' '@{0447}' '@{0448}' '@{0449}'
      '@{044A}' '@{044B}' '@{044C}' '@{044D}' '@{044E}' '@{044F}' '@{0451}' '@{0452}'
      '@{0453}' '@{0454}' '@{0455}' '@{0456}' '@{0457}' '@{0458}' '@{0459}' '@{045A}'
      '@{045B}' '@{045C}' '@{045E}' '@{045F}';
   Zcharacter table      ! C6 - Arabic
      '@{060C}' '@{061B}' '@{061F}' '@{0621}' '@{0622}' '@{0623}' '@{0624}' '@{0625}'
      '@{0626}' '@{0627}' '@{0628}' '@{0629}' '@{062A}' '@{062B}' '@{062C}' '@{062D}'
      '@{062E}' '@{062F}' '@{0630}' '@{0631}' '@{0632}' '@{0633}' '@{0634}' '@{0635}'
      '@{0636}' '@{0637}' '@{0638}' '@{0639}' '@{063A}' '@{0640}' '@{0641}' '@{0642}'
      '@{0643}' '@{0644}' '@{0645}' '@{0646}' '@{0647}' '@{0648}' '@{0649}' '@{064A}'
      '@{064B}' '@{064C}' '@{064D}' '@{064E}' '@{064F}' '@{0650}' '@{0651}' '@{0652}';
  Zcharacter table       ! C7 - Greek
      '@{0384}' '@{0385}' '@{0386}' '@{0388}' '@{0389}' '@{038A}' '@{038C}' '@{038E}'
      '@{038F}' '@{0390}' '@{0391}' '@{0392}' '@{0393}' '@{0394}' '@{0395}' '@{0396}'
      '@{0397}' '@{0398}' '@{0399}' '@{039A}' '@{039B}' '@{039C}' '@{039D}' '@{039E}'
      '@{039F}' '@{03A0}' '@{03A1}' '@{03A3}' '@{03A4}' '@{03A5}' '@{03A6}' '@{03A7}'
      '@{03A8}' '@{03A9}' '@{03AA}' '@{03AB}' '@{03AC}' '@{03AD}' '@{03AE}' '@{03AF}'
      '@{03B0}' '@{03B1}' '@{03B2}' '@{03B3}' '@{03B4}' '@{03B5}' '@{03B6}' '@{03B7}'
      '@{03B8}' '@{03B9}' '@{03BA}' '@{03BB}' '@{03BC}' '@{03BD}' '@{03BE}' '@{03BF}'
      '@{03C0}' '@{03C1}' '@{03C2}' '@{03C3}' '@{03C4}' '@{03C5}' '@{03C6}' '@{03C7}'
      '@{03C8}' '@{03C9}' '@{03CA}' '@{03CB}' '@{03CC}' '@{03CD}' '@{03CE}';
  Zcharacter table       ! C8 - Hebrew
      '@{05D0}' '@{05D1}' '@{05D2}' '@{05D3}' '@{05D4}' '@{05D5}' '@{05D6}' '@{05D7}'
      '@{05D8}' '@{05D9}' '@{05DA}' '@{05DB}' '@{05DC}' '@{05DD}' '@{05DE}' '@{05DF}'
      '@{05E0}' '@{05E1}' '@{05E2}' '@{05E3}' '@{05E4}' '@{05E5}' '@{05E6}' '@{05E7}'
      '@{05E8}' '@{05E9}' '@{05EA}';
  Zcharacter table       ! C9 - Latin 5 (Turkish)
      '@{00C0}' '@{00C1}' '@{00C2}' '@{00C3}' '@{00C4}' '@{00C5}' '@{00C6}' '@{00C7}'
      '@{00C8}' '@{00C9}' '@{00CA}' '@{00CB}' '@{00CC}' '@{00CD}' '@{00CE}' '@{00CF}'
      '@{011E}' '@{00D1}' '@{00D2}' '@{00D3}' '@{00D4}' '@{00D5}' '@{00D6}' '@{00D8}'
      '@{00D9}' '@{00DA}' '@{00DB}' '@{00DC}' '@{0130}' '@{015E}' '@{00DF}' '@{00E0}'
      '@{00E1}' '@{00E2}' '@{00E3}' '@{00E4}' '@{00E5}' '@{00E6}' '@{00E7}' '@{00E8}'
      '@{00E9}' '@{00EA}' '@{00EB}' '@{00EC}' '@{00ED}' '@{00EE}' '@{00EF}' '@{011F}'
      '@{00F1}' '@{00F2}' '@{00F3}' '@{00F4}' '@{00F5}' '@{00F6}' '@{00F8}' '@{00F9}'
      '@{00FA}' '@{00FB}' '@{00FC}' '@{0131}' '@{015F}' '@{00FF}';

Reading your source file

More information about ISO 8859 and Microsoft's code pages can be found here

As well as populating the ZSCII table, the Cn switches also perform another function; they define the character set which applies to your source file. For example, if you specify C8, the compiler expects to find a mixture of English text (Inform words like Constant, if, description and Initialise) and Hebrew text (the strings which can be displayed and the dictionary words which can be typed while the game is being played), encoded using ISO 8859-8. Remember we said that values $20..$7F (English letters, digits and punctuation) are identical in each of the nine encodings; the international differences lie in the 96 characters with values $A0..$FF.

This can cause problems if operating systems don't stick to the ISO 8859 encodings. For example, while Window's Latin 1 code page CP1252 is effectively identical to ISO 8859-1 (which is why games can be written in English, French, German and so on without great difficulty), this doesn't always hold true for other code pages. Thus, the Windows Central European CP1250 is significantly different from ISO 8859-2, and the high values in Windows Cyrillic CP1251 don't match ISO 8859-5 at any point; there may also be discrepancies with other code pages.

If you encounter the problem, you can get round it by creating a mapping file which transforms the character set used by your source file into the ISO 8859 character set which the compiler is expecting. Here are two we prepared earlier: win1250.map for use with C2:

  ! Windows Central Europe (code page 1250) to ISO 8859-2
  C2
    0, 63, 63, 63, 63, 63, 63, 63, 63, 32, 10, 63, 10, 10, 63, 63
   63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63
   32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
   48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
   64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79
   80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95
   96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
  112,113,114,115,116,117,118,119,120,121,122,123,124,125,126, 63
   63,129, 44,131, 34, 46, 63, 63,136, 63,169, 60,166,171,174,172
  144, 39, 39, 34, 34, 46, 45, 45,152, 84,185, 62,182,187,190,188
   32,183,162,163,164,161,124,167,168, 67,170, 60, 63, 45, 82,175
  176, 63,178,179,180, 63, 63, 46,184,177,186, 62,165,189,181,191
  192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
  208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
  224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239
  240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255

and win1251.map for use with C5:

  ! Windows Cyrillic (code page 1251) to ISO 8859-5
  C5
    0, 63, 63, 63, 63, 63, 63, 63, 63, 32, 10, 63, 10, 10, 63, 63
   63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63
   32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47
   48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63
   64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79
   80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95
   96, 97, 98, 99,100,101,102,103,104,105,106,107,108,109,110,111
  112,113,114,115,116,117,118,119,120,121,122,123,124,125,126, 63
  162,163, 44,243, 34, 46, 63, 63, 63, 63,169, 60,170,172,171,175
  242, 39, 39, 34, 34, 46, 45, 45,152, 84,249, 62,250,252,251,255
   32,174,254,168, 36, 63,124,253,161, 67,164, 60, 63, 45, 82,167
   63, 63,166,246, 63, 63, 63, 46,241,240,244, 62,248,165,245,247
  176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191
  192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207
  208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223
  224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239

Lines starting with "!" are treated as comments. The line beginning with "C" defines the ISO set to map to, and means that you don't then need to provide a -Cn command line switch. To use the mapping, Inform treats each character in the source file as a number between 0 and 255, and uses that number as an index into the mapping table. For example, suppose that the character read in from a Russian Windows source file is the Cyrillic small letter "yu", represented in CP1251 by the number 254. Inform takes that entry in the mapping, which is 238 (highlighted in the win1251.map table above); therefore the "yu" character is regarded as being 238 in ISO 8859-5. Contrast this with a Polish source file, where the character in CP1250 represented by the number 254 -- a small "t" with a cedilla -- is also at position 254 in ISO 8859-2, so the character number remains unchanged.

The name of the mapping file is specified by a compiler path variable +charset_map. For example, a Russian game could be compiled with a command line of the form:

  inform +charset_map=win1251.map +language_name=Russian mygame.inf

Printing ZSCII characters

Let's suppose that you've compiled your game without supplying any C switch, and that you've used the Zcharacter directive to add the three characters '¢°±'; these are now in the ZSCII table at positions 224..226, as illustrated in the previous section. You can output them in six ways:

1.

print "¢°±";

print a string which includes the literal character values.

The advantage of these forms is readability. The disadvantage is its lack of source portability; the game compiled on the PC will run differently from the same game compiled on the Mac, because the internal character sets of those two machine are not the same. (However, the game compiled on the PC will run the same on both PC and Mac, and vice versa; Inform game files are always portable.)

2.

print (char) '¢', (char) '°', (char) '±';

print individual character constants using the literal values.

3.

print "@{00A2}@{00B0}@{00B1}";

print a string which includes the Unicode character values.

The advantage of these forms is that your source code remains portable -- you can for example write the game on a PC, then copy the source to a Mac. The game will compile identically in both environments.

4.

print (char) '@{00A2}',
  (char) '@{00B0}', (char) '@{00B1}';

print individual character constants using the Unicode values.

5.

print "@@224@@225@@226";

print a string which includes the ZSCII character values.

The disadvantage of these forms is their dependence on the physical ordering of the ZSCII table. Since your additional characters are appended to those taken from ISO 8859, their position varies depending on which C switch is used.

6.

print (char) 224, (char) 225, (char) 226;

print individual character constants using the ZSCII values.

Note that forms 1-4 won't compile if you mention a character -- literally or by Unicode value -- which doesn't exist in the game's ZSCII character set. Forms 5 and 6 will compile regardless of whether or not any character has been allocated to the ZSCII value; if there isn't a character at that position, the interpreter prints a question mark.

Remember that you can use constructs like @^a to represent 'â', and like @ss to represent 'ß'. The good news is that these constructs are independent of the C switch setting; if 'ß' appears somewhere in the ZSCII table, then @ss will represent it. The bad news is that these constructs aren't extensible. For example, one of the characters loaded by switch C3 is 'j' with a circumflex, but you can't use @^j to represent it.

        

TECHNICAL NOTE: Due to related bugs in the 6.21 compiler and the 0.2 version of the Z-Machine Standards Document, you may encounter problems with printing the left and right guillemet characters. @{00AB}, @@163 and @<< should all output '«', while @{00BB}, @@162 and @>> should all output '»'. Although the compiler now gets this right, the result may vary depending on the age of the interpreter which the player is using.

The bottom line is: using these techniques, an Inform game can output a maximum of 191 different characters: the 95 standard values with ZSCII codes of 32..126, and anywhere up to 96 extra characters with ZSCII codes of 155 upwards.

Printing Unicode characters

There's another technique available which bypasses the ZSCII character set (and its limit of 191 characters) altogether -- printing Unicode directly. Version 1.0 the Z-Machine Standards Document introduced the check_unicode (test if interpreter can handle a given Unicode character) and print_unicode (output a given Unicode character) opcodes. Create this routine:

  [ Unicode c exist;
      if (HDR_TERPSTANDARD->0 < 1) { @print_char '?'; return; }
      @check_unicode c -> exist;
      if (exist & $0001) @print_unicode c;
      else               @print_char '?';
  ];

        

TECHNICAL NOTE: HDR_TERPSTANDARD -- bytes $0032 and $0033 of the game header -- reflects the version of the standard to which the interpreter conforms; we can't use the new opcodes on a pre-1.0 interpreter. If check_unicode returns a flag bit of 1 then it's safe to output the character using print_unicode.

We can then use this routine either by calling it directly, or using it as a print rule. Its argument is a simple Unicode character number, best given in hexadecimal. For instance, we could display a small Greek 'pi' symbol with either of:

  Unicode($03C0);
  print (Unicode) $03C0;

As we said, this method is independent of ZSCII; you can output any characters irrespective of whether they exist in the game's ZSCII character set.

String packing

Although we've been talking about ZSCII as an eight-bit encoding system, the Z-machine doesn't actually hold each ZSCII character in a separate byte. Rather, it stores characters using five-bit units, in which (slightly simplified):

The major advantage of this scheme is that lower case text -- the largest component of many games -- is stored fairly economically at three characters to a sixteen-bit word (which is 50% more effective than storing at two bytes to a word). The downside is that each upper case and punctuation character costs ten bits rather than eight, but such characters are relatively infrequent. The major disadvantage -- which doesn't really impinge if you're writing a game in English -- is that each accented character occupies four units; this hits hardest in the dictionary, where only nine storage units are available for each dictionary entry. Entries are stored in lower case, so an entry can comprise up to: nine unaccented letters, one accented letter plus five unaccented letters, or two accented letters plus a single unaccented letter. In a heavily accented language like Swedish, this is a real issue.

There's a way round the limitation, using two more forms of the Zcharacter directive in order to manipulate the 'Alphabet table' -- the Z-machine's list of 26 characters which occupy a single storage unit and 51 characters which occupy two units. Here's the default table:

  abcdefghijklmnopqrstuvwxyz
  ABCDEFGHIJKLMNOPQRSTUVWXYZ
  ¤^0123456789.,!?_#'"/\-:()

The character shown here as ¤ is a non-printing escape code, while ^ represents the newline character. Using a directive like:

  Zcharacter '@`e';

near the start of your source file places the specified character 'è' into the Alphabet table. It does this not by extending the table, but rather by replacing an existing character there which hasn't yet been used; the search for an unused character starts at '0' and moves rightwards along the bottom row. You can provide several such Zcharacter directives, each one swopping a single character into the table. Alternatively, you can define a completely new Alphabet table by using the fourth and final form of the Zcharacter directive; for example:

  Zcharacter
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "0123456789-',.;:@'e@`a@`e@`u@^e@^i@^u";

In this format, the first and second strings are each exactly 26 characters long, while the third contains only 23 characters -- ¤^" for the escape, the newline and the double quotes are included automatically. The advantage of this method would seem to be that you can choose to retain the digits and the common punctuation, losing instead much rarer characters like '#' and '\'. The resulting table is:

  abcdefghijklmnopqrstuvwxyz
  ABCDEFGHIJKLMNOPQRSTUVWXYZ
  ¤^"0123456789-',.;:éàèùêîû

On a sample of French text, the overhead of using accented characters was reduced from 6-7% (using the standard table) to 2-3% (using this table), against the same text without accents.

        

TECHNICAL NOTE: A default Alphabet table is built into the interpreter. However, if you use either of these Zcharacter directives, then the default table becomes inadequate, and the compiler must include a replacement table within the game file. The table's address is given in the game's header at address HDR_ALPHABET-->0. You can display the table by including Roger Firth's dump.h library extension, and typing DUMP ALPHA.

        

TECHNICAL NOTE: Because the fourth form automatically supplies ¤^" at the start of the Alphabet table's third line, it's not possible to use the Zcharacter directive to build an exact replica of the default table (in which " falls between ' and /). The closest you can get is either:

  Zcharacter
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "0123456789.,!?_#'/@{005C}-:()";

(in which, compared to the default, all characters in the range 0 to ' are shifted one to the right), or:

  Zcharacter
      "abcdefghijklmnopqrstuvwxyz"
      "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
      "123456789.,!?_#'0/@{005C}-:()";

(in which, compared to the default, 0 and " are interchanged). Note the need to represent \ by its Unicode form @{005C}; neither plain \ nor the usual @@92 are accepted by the compiler here.

Remember that, prior to version 6.30 of the Inform compiler, neither of these Zcharacter forms worked properly on most platforms.

        

TECHNICAL NOTE: An alternative approach, rather than adding accents to dictionary words, is to remove them from the player's input. For example, you might include code like this in your LanguageToInformese routine:

  for (i=0 : i<buffer->1 : i++)
      switch (buffer->(i+2)) {
        '@`a','@'a','@^a','@:a','@~a','@oa': buffer->(i+2) = 'a';
        '@ae':     buffer->(i+2) = 'a'; i++; LTI_Insert(i+2,'e');
        '@cc':                               buffer->(i+2) = 'c';
        '@`e','@'e','@^e','@:e':             buffer->(i+2) = 'e';
        '@`i','@'i','@^i','@:i':             buffer->(i+2) = 'i';
        '@~n':                               buffer->(i+2) = 'n';
        '@`o','@'o','@^o','@:o','@~o','@/o': buffer->(i+2) = 'o';
        '@oe':     buffer->(i+2) = 'o'; i++; LTI_Insert(i+2,'e');
        '@ss':     buffer->(i+2) = 's'; i++; LTI_Insert(i+2,'s');
        '@`u','@'u','@^u','@:u':             buffer->(i+2) = 'u';
        '@'y':                               buffer->(i+2) = 'y';
      }
  @tokenise buffer parse;

Runtime issues

Access to the full range of 50,000 Unicode characters opens up all sorts of enticing possibilities... but not for long. It's important to remember that the majority of English computer fonts have fairly limited support for characters beyond the ISO 8859-1 repertoire. Sure, if you're based in Algeria, or Bulgaria, or Cambodia, then you'll have appropriate local fonts on your machine. Elsewhere in the world, though, those fonts are unlikely to be available, and therefore a game written to use those fonts will become unplayable. Until such as time as fonts with reasonably complete Unicode support become the norm rather than the exception, you'll either have to be modest in your creativity, or your players will have to be prepared to obtain the fonts for themselves.

A small number of Unicode fonts are readily available. You might consider (see also Alan Wood's site, mentioned earlier, for other possibilities):

It's not just the lack of appropriate fonts which can be a stumbling block; many Z-machine interpreters themselves are weak in Unicode support. The 1.0 standard requires only that the interpreter be able to display characters up to $00FF -- effectively those defined by the C1 switch -- which means that, even if the font provides them, you can't rely on characters above that value being displayed. Thus, for example, most ports of the Frotz interpreter assume the use of ISO 8859-1 at all times; they also display '?' for all Unicode characters in the range $0100..$FFFF, thus precluding the C2..C9 character sets, let alone anything more esoteric. Of the 70 or so Z-machine interpreters in the archive, the only ones offering full support for extended Unicode characters are believed to be Windows Frotz 2002, Zip2000 (RISC OS) and Zoom (X-Windows and Mac OS X); this is alas another argument against being too adventurous in your character displays.

Into
the Intro

Setting
the scene

Preparing
to program

Learning
the lingo

Dabbling
in data

Operating
on objects

Verbal
versatility

Bothered
by bugs

History and
hereafter

Worldly
woes

Inside
information

Tips and
techniques