|
|
Introduction============
As successful as PHP has proven to be in the past several years, it is stillthe only remaining member of the P-trinity of scripting languages - Perl andPython being the other two - that remains blithely ignorant of themultilingual and multinational environment around it. The softwaredevelopment community has been moving towards Unicode Standard for some timenow, and PHP can no longer afford to be outside of this movement. Surely,some steps have been taken recently to allow for easier processing ofmultibyte data with the mbstring extension, but it is not enabled in PHP bydefault and is not as intuitive or transparent as it could be.
The basic goal of this document is to describe how PHP 6 will support theUnicode Standard natively. Since the full implementation of the UnicodeStandard is very involved, the idea is to use the already existing,well-tested, full-featured, and freely available ICU (InternationalComponents for Unicode) library. This will allow us to concentrate on thedetails of PHP integration and speed up the implementation.
General Remarks===============
Backwards Compatibility-----------------------Throughout the design and implementation of Unicode support, backwardscompatibility must be of paramount concern. PHP is used on an enormous number ofsites and the upgrade to Unicode-enabled PHP has to be transparent. This meansthat the existing data types and functions must work as they have alwaysdone. However, the speed of certain operations may be affected, due toincreased complexity of the code overall.
Unicode Encoding----------------The initial version will not support Byte Order Mark. Text processing willgenerally perform better if the characters are in Normalization Form C.
Implementation Approach=======================
The implementation is done in phases. This allows for more basic andlow-level implementation issues to be ironed out and tested beforeproceeding to more advanced topics.
Legend: - TODO + finished * in progress
Phase I ------- + Basic Unicode string support, including instantiation, concatenation, indexing
+ Simple output of Unicode strings via 'print' and 'echo' statements with appropriate output encoding conversion
+ Conversion of Unicode strings to/from various encodings via encode() and decode() functions
+ Determining length of Unicode strings via strlen() function, some simple string functions ported (substr).
Phase II -------- * HTTP input request decoding
+ Fixing remaining string-aware operators (assignment to [] etc)
+ Support for Unicode and binary strings in PHP streams
+ Support for Unicode identifiers
+ Configurable handling of conversion failures
+ \C{} escape sequence in strings
Phase III --------- * Exposing ICU API
* Porting all remaining functions to support Unicode and/or binary strings
Encoding Names==============All the encoding settings discussed in this document accept any validencoding name supported by ICU. See ICU online documentation for the fulllist of encodings.
Unicode Semantics Switch========================
Obviously, PHP cannot simply impose new Unicode support on everyone. Thereare many applications that do not care about Unicode and do not need it.Consequently, there is a switch that enables certain fundamental languagechanges related to Unicode. This switch is available only as a site-wide (pervirtual server) INI setting.
Note that having switch turned off does not imply that PHP is unaware of Unicodeat all and that no Unicode strings can exist. It only affects certain aspects ofthe language, and Unicode strings can always be created programmatically. Allthe functions and operators will still support Unicode strings and workappropriately.
unicode.semantics = On
Internal Encoding=================
UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumestwo bytes for any Unicode character in the Basic Multilingual Plane, whichis where most of the current world's languages are represented. While beingless memory efficient for basic ASCII text it simplifies the processing andmakes interfacing with ICU easier, since ICU uses UTF-16 for its internalprocessing as well.
Fallback Encoding=================
This setting specifies the "fallback" encoding for all the other ones. So ifa specific encoding setting is not set, PHP defaults it to the fallbackencoding. If the fallback_encoding is not specified either, it is set toUTF-8.
unicode.fallback_encoding = "iso-8859-1"
Runtime Encoding================
Currently PHP neither specifies nor cares what the encoding of its stringsis. However, the Unicode implementation needs to know what this encoding isfor several reasons, including explicit (casting) and implicit (concatenation,comparison, parameter passing) type coersions. This setting specifies theruntime encoding.
unicode.runtime_encoding = "iso-8859-1"
Output Encoding===============
Automatic output encoding conversion is supported on the standard outputstream. Therefore, commands such as 'print' and 'echo' automatically converttheir arguments to the specified encoding. No automatic output encoding isperformed for anything else. Therefore, when writing to files or externalresources, the developer has to manually encode the data using functionsprovided by the unicode extension or rely on stream encoding features
The existing default_charset setting so far has been used only forspecifying the charset portion of the Content-Type MIME header. For severalreasons, this setting is deprecated. Now it is only used when the Unicodesemantics switch is disabled and does not affect the actual transcoding ofthe output stream. The output encoding setting takes precedence in all othercases. If the output encoding is set, PHP will automatically add 'charset'portion to the Conten-Type header.
unicode.output_encoding = "utf-8"
HTTP Input Encoding===================
There will be no explicit input encoding setting. Instead, PHP will rely on acouple of heuristics to determine what encoding the incoming request might bein. Firstly, PHP will attempt to decode the input using the value of theunicode.output_encoding setting, because that is the most logical choice if weassume that the clients send the data back in the encoding that the page withthe form was in. If that is unsuccessful, we could fallback on the "_charset_"form parameter, if present. This parameter is sent by IE (and possibly Firefox)along with the form data and indicates the encoding of the request. Note thatthis parameter will be present only if the form contains a hidden field named"_charset_".
The variables that are decoded successfully will be put into the request arraysas Unicode strings, those that fail -- as binary strings. PHP will set aflag (probably in the $_SERVER array) indicating that there were problems duringthe conversion. The user will have access to the raw input in case offailure via the input filter extension and can to access the request parametersvia input_get_arg() function. The input filter extension always looks inthe raw input data and not in the request arrays, and input_get_arg() has a'charset' parameter that can be specified to tell PHP what charset the incomingdata is in. This kills two birds with one stone: users have access to requestarrays data on successful decoding as well as a standard and secure way to getat the data in case of failed decoding.
Script Encoding===============
PHP scripts may be written in any encoding supported by ICU. The encodingof the scripts can be specified site-wide via an INI directive, or with a'declare' pragma at the beginning of the script. The reason for pragma is thatan application written in Shift-JIS, for example, should be executable on asystem where the INI directive cannot be changed by the application itself. Thepragma setting is valid only for the script it occurs in, and does not propagateto the included files.
pragma: <?php declare(encoding = 'utf-8'); ?>
INI setting: unicode.script_encoding = utf-8
Conversion Semantics====================
Not all characters can be converted between Unicode and legacy encodings.Normally, when downconverting from Unicode, the default behavior of ICUconverters is to substitute the missing sequence with the appropriatesubstitution sequence for that codepage, such as 0x1A (Control-Z) inISO-8859-1. When upconverting to Unicode, if an encoding has a characterwhich cannot be converted into Unicode, that sequence is replaced by theUnicode substitution character (U+FFFD).
The conversion error behavior can be customized:
- stop the conversion and return an empty string - skip any invalid characters - substibute invalid characters with a custom substitution character - escape the invalid character in various formats
The global conversion error settings can be controlled with these two functions:
unicode_set_error_mode(int direction, int mode) unicode_set_subst_char(unicode char)
Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of theseconstants:
U_CONV_ERROR_STOP U_CONV_ERROR_SKIP U_CONV_ERROR_SUBST U_CONV_ERROR_ESCAPE_UNICODE U_CONV_ERROR_ESCAPE_ICU U_CONV_ERROR_ESCAPE_JAVA U_CONV_ERROR_ESCAPE_XML_DEC U_CONV_ERROR_ESCAPE_XML_HEX
Substitution character can be set only for FROM_UNICODE direction and has toexist in the target character set.
Unicode String Type===================
Unicode string type (IS_UNICODE) is supposed to contain text data encoded inUTF-16 format. It is the main string type in PHP when Unicode semanticsswitch is turned on. Unicode strings can exist when the switch is off, butthey have to be produced programmatically, via calls to functions thatreturn Unicode type.
The operational unit when working with Unicode strings is a code point, notcode unit or byte. One code point in UTF-16 may be comprised of 1 or 2 codeunits, each of which is a 16-bit word. Working on the code point level isnecessary because doing otherwise would mean offloading the processing ofsurrogate pairs onto PHP users, and that is less than desirable.
The repercussions are that one cannot expect code point N to be at offset N inthe Unicode string. Instead, one has to iterate from the beginning from thestring using U16_FWD() macro until the desired codepoint is reached. This willbe transparent to the end user who will work only with "character" offsets.
The codepoint access is one of the primary areas targeted for optimization.
Binary String Type==================
Binary string type (IS_STRING) serves two purposes: backwards compatibility andrepresenting non-Unicode strings and binary data. When Unicode semantics switchis off, it is used for all strings in PHP, same in previous versions. When theswitch is on, this type will be used to store text in other encodings as well astrue binary data such as images, PDFs, etc.
Printing binary data to the standard output passes it through as-is, independentof the output encoding.
Zval Structure Changes======================
PHP is a type-agnostic language. Its data values are encapsulated in a zval(Zend value) structure that can change as necessary to accomodate various types.
struct _zval_struct { /* Variable information */ union { long lval; /* long value */ double dval; /* double value */ struct { char *val; int len; } str; /* string value */ HashTable *ht; /* hash table value */ zend_object_value obj; /* object value */ } value; zend_uint refcount; zend_uchar type; /* active type */ zend_uchar is_ref;};
The type field determines what is stored in the union, IS_STRING being the onlydata type pertinent to this discussion. In the current version, the stringsare binary-safe, but, for all intents and purposes, are assumed to becomprised of 8-bit characters. It is possible to treat the string value asan opaque type containing arbitrary binary data, and in fact that is howmbstring extension uses it, in order to store multibyte strings. However,many extensions and the Zend engine itself manipulate the string valuedirectly without regard to its internals. Needless to say, this can lead toproblems.
For IS_UNICODE type, we need to add another structure to the union:
union { .... struct { UChar *val; /* Unicode string value */ int len; /* number of UChar's */ } ustr; .... } value;
This cleanly separates the two types of strings and helps preserve backwardscompatibility.
To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yetanother structure:
union { .... struct { /* Universal string type */ zstr val; int len; } uni; .... } value;
Where zstr ia union of char*, UChar*, and void*.
Language Modifications======================
If a Unicode switch is turned on, PHP string literals - single-quoted,double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).They support all the same escape sequences and variable interpolations aspreviously, with the addition of some new escape sequences.
The contents of the strings are interpreted as follows:
- all non-escaped characters are interpreted as a corresponding Unicode codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) => U+0061, Shift-JIS (0x92 0x69) => U+4E2D - existing PHP escape sequences are also interpreted as Unicode codepoints, including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
- two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or 6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 => U+10410
- a new escape sequence allows specifying a character by its full Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
The single-quoted string is more restrictive than the other two types: sofar the only escape sequence allowed inside of it was \', which specifiesa literal single quote. However, single quoted strings now support the newUnicode character escape sequences as well.
PHP allows variable interpolation inside the double-quoted and heredoc strings.However, the parser separates the string into literal and variable chunks duringcompilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that theliteral chunks can be handled in the normal way for as far as Unicodesupport is concerned.
Since all string literals become Unicode by default, one loses the abilityto specify byte-oriented or binary strings. In order to create binary stringliterals, a new syntax is necessary: prefixing a string literal with letter'b' creates a binary string.
$var = b'abc\001'; $var = b"abc\001"; $var = b<<<EOD abc\001 EOD;
The binary string literals support the same escape sequences as the currentPHP strings. If the Unicode switch is turned off, then the binary stringliterals generate normal string (IS_STRING) type internally, without anyeffect on the application.
The string operators have been changed to accomodate the new IS_UNICODE andIS_BINARY types. In more detail:
- The concatenation (.) operator has been changed to automatically coerce IS_STRING type to the more precise IS_UNICODE if its operands are of two different string types.
- The concatenation assignment operator (.=) has been changed similarly.
- The string indexing operator [] has been changed to accomodate IS_UNICODE type strings and extract the specified character. Note that the index specifies a code point, not a byte, or a code unit, thus supporting supplementary characters.
- Both Unicode and binary string types can be used as array keys. If the Unicode switch is on, the binary keys are converted to Unicode.
- Bitwise operators and increment/decrement operators do not work on Unicode strings. They do work on binary strings.
- Two new casting operators are introduced, (unicode) and (binary). The (string) operator will cast to Unicode type if the Unicode semantics switch is on, and to binary type otherwise.
- The comparison operators when applied to Unicode strings, perform comparison in binary code point order. They also do appropriate coersion if the strings are of differing types.
- The arithmetic operators use the same semantics as today for converting strings to numbers. A Unicode string is considered numeric if it represents a long or a double number in en_US_POSIX locale.
Inline HTML===========Because inline HTML blocks are intermixed with PHP ones, they are alsowritten in the script encoding. PHP transcodes the HTML blocks to the outputencoding as needed, resulting in direct passthrough if the script encodingmatches output encoding.
Identifiers===========Considering that scripts may be written in various encodings, we do notrestrict identifiers to be ASCII-only. PHP allows any valid identifier basedon the Unicode Standard Annex #31. The identifiers are case folded whennecessary (class and function names) and converted to normalization formNFKC, so that two identifiers written in two compatible ways refer to thesame thing.
Numbers=======Unlike identifiers, we restrict numbers to consist only of ASCII digits anddo not interpret them as written in a specific locale. The numbers areexpected to adhere to en_US_POSIX or C locale, i.e. having no thousandsseparator and fractional separator being (.) "full stop". Numeric stringsare supposed to adhere to the same rules, i.e. "10,3" is not interpreted asa number even if the current locale's fractional separator is comma.
Parameter Parsing API Modifications===================================
Internal PHP functions largely uses zend_parse_parameters() API in order toobtain the parameters passed to them by the user. For example:
char *str; int len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) { return; }
This forces the input parameter to be a string, and its value and length arestored in the variables specified by the caller.
There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
't' specifier ------------- This specifier indicates that the caller requires the incoming parameter to be string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for string value, length, and type.
void *str; int len; zend_uchar type;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) { return; } if (type == IS_UNICODE) { /* process Unicode string */ } else { /* process binary string */ }
For IS_STRING type, the length represents the number of bytes, and for IS_UNICODE the number of UChar's. When converting other types (numbers, booleans, etc) to strings, the exact behavior depends on the Unicode semantics switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
'u' specifier ------------- This specifier indicates that the caller requires the incoming parameter to be a Unicode encoded string. If a non-Unicode string is passed, the engine creates a copy of the string and automatically convert it to Unicode type before passing it to the internal function. No such conversion is necessary for Unicode strings, obviously.
UChar *str; int len;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) { return; } /* process Unicode string */
'T' specifier ------------- This specifier is useful when the function takes two or more strings and operates on them. Using 't' specifier for each one would be somewhat problematic if the passed-in strings are of mixed types, and multiple checks need to be performed in order to do anything. All parameters marked by the 'T' specifier are promoted to the same type. If at least one of the 'T' parameters is of Unicode type, then the rest of them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to IS_STRING type.
void *str1, *str2; int len1, len2; zend_uchar type1, type2;
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1, &type1, &str2, &len2, &type2) == FAILURE) { return; } if (type1 == IS_UNICODE) { /* process as Unicode, str2 is guaranteed to be Unicode as well */ } else { /* process as binary string, str2 is guaranteed to be the same */ }
The existing 's' specifier has been modified as well. If a Unicode string ispassed in, it automatically copies and converts the string to the runtimeencoding, and issues a warning. If a binary type is passed-in, no conversionis necessary.
The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strictabout the type of the passed-in parameter. If 'U' is specified and the binarystring is passed in, the engine will issue a warning instead of doing automaticconversion. The converse applies to the 'S' specifier.
Upgrading Existing Functions============================
Upgrading functions to work with new data types will be a deliberate andinvolved process, because one needs to consider not only the mechanisms forprocessing Unicode characters, for example, but also the semantics ofthe function.
The main tenet of the upgrade process should be that when processing Unicodestrings, the unit of operation is a code point, not a code unit or a byte.For example, strlen() returns the number of code points in the string.
strlen('abc') = 3 strlen('ab\U010000') = 3 strlen('ab\uD800\uDC00') = 3 /* not 4 */
Function upgrade guidelines are available in a separate document.
Document TODO==========================================- Streams support for Unicode - What stream filters will be provided?- User conversion error handler- INI files encoding - UTF-8? Do we support BOMs?- There are likely to be other issues which are missing from this document
Build System============
Unicode support in PHP is always enabled. The only configuration optionduring development should be the location of the ICU headers and libraries.
--with-icu-dir=<dir> <dir> parameter specifies the location of ICU header and library files.
After the initial development we have to repackage ICU library for our needsand bundle it with PHP.
Document History================ 0.6: Remove notion of native encoding string, only 2 string types are used now. Update conversion error behavior section and parameter parsing. Bring the document up-to-date with reality in general.
0.5: Updated per latest discussions. Removed tentative language in several places, since we have decided on everything described here already. Clarified details according to Phase II progress. 0.4: Updated to include all the latest discussions. Updated development phases.
0.3: Updated to include all the latest discussions.
0.2: Updated Phase I design proposal per discussion on unicode@php.net. Modified Internal Encoding section to contain only UTF-16 info.. Expanded Script Encoding section. Added Binary Data Type section. Amended Language Modifications section to describe string literals behavior. Amended Build System section.
0.1: Phase I design proposal
References==========
Unicode http://www.unicode.org
Unicode Glossary http://www.unicode.org/glossary/
UTF-8 http://www.utf-8.com/
UTF-16 http://www.ietf.org/rfc/rfc2781.txt
ICU Homepage http://www.ibm.com/software/globalization/icu/
ICU User Guide and API Reference http://icu.sourceforge.net/
Unicode Annex #31 http://www.unicode.org/reports/tr31/
PHP Parameter Parsing API http://www.php.net/manual/en/zend.arguments.retrieval.php
Authors======= Andrei Zmievski <andrei@gravitonic.com>
vim: set et :
|