You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

660 lines
24 KiB

21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
20 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
20 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
20 years ago
21 years ago
21 years ago
21 years ago
20 years ago
21 years ago
21 years ago
21 years ago
21 years ago
21 years ago
  1. Introduction
  2. ============
  3. As successful as PHP has proven to be in the past several years, it is still
  4. the only remaining member of the P-trinity of scripting languages - Perl and
  5. Python being the other two - that remains blithely ignorant of the
  6. multilingual and multinational environment around it. The software
  7. development community has been moving towards Unicode Standard for some time
  8. now, and PHP can no longer afford to be outside of this movement. Surely,
  9. some steps have been taken recently to allow for easier processing of
  10. multibyte data with the mbstring extension, but it is not enabled in PHP by
  11. default and is not as intuitive or transparent as it could be.
  12. The basic goal of this document is to describe how PHP 6 will support the
  13. Unicode Standard natively. Since the full implementation of the Unicode
  14. Standard is very involved, the idea is to use the already existing,
  15. well-tested, full-featured, and freely available ICU (International
  16. Components for Unicode) library. This will allow us to concentrate on the
  17. details of PHP integration and speed up the implementation.
  18. General Remarks
  19. ===============
  20. Backwards Compatibility
  21. -----------------------
  22. Throughout the design and implementation of Unicode support, backwards
  23. compatibility must be of paramount concern. PHP is used on an enormous number of
  24. sites and the upgrade to Unicode-enabled PHP has to be transparent. This means
  25. that the existing data types and functions must work as they have always
  26. done. However, the speed of certain operations may be affected, due to
  27. increased complexity of the code overall.
  28. Unicode Encoding
  29. ----------------
  30. The initial version will not support Byte Order Mark. Text processing will
  31. generally perform better if the characters are in Normalization Form C.
  32. Implementation Approach
  33. =======================
  34. The implementation is done in phases. This allows for more basic and
  35. low-level implementation issues to be ironed out and tested before
  36. proceeding to more advanced topics.
  37. Legend:
  38. - TODO
  39. + finished
  40. * in progress
  41. Phase I
  42. -------
  43. + Basic Unicode string support, including instantiation, concatenation,
  44. indexing
  45. + Simple output of Unicode strings via 'print' and 'echo' statements
  46. with appropriate output encoding conversion
  47. + Conversion of Unicode strings to/from various encodings via encode() and
  48. decode() functions
  49. + Determining length of Unicode strings via strlen() function, some
  50. simple string functions ported (substr).
  51. Phase II
  52. --------
  53. * HTTP input request decoding
  54. + Fixing remaining string-aware operators (assignment to [] etc)
  55. + Support for Unicode and binary strings in PHP streams
  56. + Support for Unicode identifiers
  57. + Configurable handling of conversion failures
  58. + \C{} escape sequence in strings
  59. Phase III
  60. ---------
  61. * Exposing ICU API
  62. * Porting all remaining functions to support Unicode and/or binary
  63. strings
  64. Encoding Names
  65. ==============
  66. All the encoding settings discussed in this document accept any valid
  67. encoding name supported by ICU. See ICU online documentation for the full
  68. list of encodings.
  69. Unicode Semantics Switch
  70. ========================
  71. Obviously, PHP cannot simply impose new Unicode support on everyone. There
  72. are many applications that do not care about Unicode and do not need it.
  73. Consequently, there is a switch that enables certain fundamental language
  74. changes related to Unicode. This switch is available only as a site-wide (per
  75. virtual server) INI setting.
  76. Note that having switch turned off does not imply that PHP is unaware of Unicode
  77. at all and that no Unicode strings can exist. It only affects certain aspects of
  78. the language, and Unicode strings can always be created programmatically. All
  79. the functions and operators will still support Unicode strings and work
  80. appropriately.
  81. unicode.semantics = On
  82. Internal Encoding
  83. =================
  84. UTF-16 is the internal encoding used for Unicode strings. UTF-16 consumes
  85. two bytes for any Unicode character in the Basic Multilingual Plane, which
  86. is where most of the current world's languages are represented. While being
  87. less memory efficient for basic ASCII text it simplifies the processing and
  88. makes interfacing with ICU easier, since ICU uses UTF-16 for its internal
  89. processing as well.
  90. Fallback Encoding
  91. =================
  92. This setting specifies the "fallback" encoding for all the other ones. So if
  93. a specific encoding setting is not set, PHP defaults it to the fallback
  94. encoding. If the fallback_encoding is not specified either, it is set to
  95. UTF-8.
  96. unicode.fallback_encoding = "iso-8859-1"
  97. Runtime Encoding
  98. ================
  99. Currently PHP neither specifies nor cares what the encoding of its strings
  100. is. However, the Unicode implementation needs to know what this encoding is
  101. for several reasons, including explicit (casting) and implicit (concatenation,
  102. comparison, parameter passing) type coersions. This setting specifies the
  103. runtime encoding.
  104. unicode.runtime_encoding = "iso-8859-1"
  105. Output Encoding
  106. ===============
  107. Automatic output encoding conversion is supported on the standard output
  108. stream. Therefore, commands such as 'print' and 'echo' automatically convert
  109. their arguments to the specified encoding. No automatic output encoding is
  110. performed for anything else. Therefore, when writing to files or external
  111. resources, the developer has to manually encode the data using functions
  112. provided by the unicode extension or rely on stream encoding features
  113. The existing default_charset setting so far has been used only for
  114. specifying the charset portion of the Content-Type MIME header. For several
  115. reasons, this setting is deprecated. Now it is only used when the Unicode
  116. semantics switch is disabled and does not affect the actual transcoding of
  117. the output stream. The output encoding setting takes precedence in all other
  118. cases. If the output encoding is set, PHP will automatically add 'charset'
  119. portion to the Conten-Type header.
  120. unicode.output_encoding = "utf-8"
  121. HTTP Input Encoding
  122. ===================
  123. There will be no explicit input encoding setting. Instead, PHP will rely on a
  124. couple of heuristics to determine what encoding the incoming request might be
  125. in. Firstly, PHP will attempt to decode the input using the value of the
  126. unicode.output_encoding setting, because that is the most logical choice if we
  127. assume that the clients send the data back in the encoding that the page with
  128. the form was in. If that is unsuccessful, we could fallback on the "_charset_"
  129. form parameter, if present. This parameter is sent by IE (and possibly Firefox)
  130. along with the form data and indicates the encoding of the request. Note that
  131. this parameter will be present only if the form contains a hidden field named
  132. "_charset_".
  133. The variables that are decoded successfully will be put into the request arrays
  134. as Unicode strings, those that fail -- as binary strings. PHP will set a
  135. flag (probably in the $_SERVER array) indicating that there were problems during
  136. the conversion. The user will have access to the raw input in case of
  137. failure via the input filter extension and can to access the request parameters
  138. via input_get_arg() function. The input filter extension always looks in
  139. the raw input data and not in the request arrays, and input_get_arg() has a
  140. 'charset' parameter that can be specified to tell PHP what charset the incoming
  141. data is in. This kills two birds with one stone: users have access to request
  142. arrays data on successful decoding as well as a standard and secure way to get
  143. at the data in case of failed decoding.
  144. Script Encoding
  145. ===============
  146. PHP scripts may be written in any encoding supported by ICU. The encoding
  147. of the scripts can be specified site-wide via an INI directive, or with a
  148. 'declare' pragma at the beginning of the script. The reason for pragma is that
  149. an application written in Shift-JIS, for example, should be executable on a
  150. system where the INI directive cannot be changed by the application itself. The
  151. pragma setting is valid only for the script it occurs in, and does not propagate
  152. to the included files.
  153. pragma:
  154. <?php declare(encoding = 'utf-8'); ?>
  155. INI setting:
  156. unicode.script_encoding = utf-8
  157. Conversion Semantics
  158. ====================
  159. Not all characters can be converted between Unicode and legacy encodings.
  160. Normally, when downconverting from Unicode, the default behavior of ICU
  161. converters is to substitute the missing sequence with the appropriate
  162. substitution sequence for that codepage, such as 0x1A (Control-Z) in
  163. ISO-8859-1. When upconverting to Unicode, if an encoding has a character
  164. which cannot be converted into Unicode, that sequence is replaced by the
  165. Unicode substitution character (U+FFFD).
  166. The conversion error behavior can be customized:
  167. - stop the conversion and return an empty string
  168. - skip any invalid characters
  169. - substibute invalid characters with a custom substitution character
  170. - escape the invalid character in various formats
  171. The global conversion error settings can be controlled with these two functions:
  172. unicode_set_error_mode(int direction, int mode)
  173. unicode_set_subst_char(unicode char)
  174. Where direction is either FROM_UNICODE or TO_UNICODE, and mode is one of these
  175. constants:
  176. U_CONV_ERROR_STOP
  177. U_CONV_ERROR_SKIP
  178. U_CONV_ERROR_SUBST
  179. U_CONV_ERROR_ESCAPE_UNICODE
  180. U_CONV_ERROR_ESCAPE_ICU
  181. U_CONV_ERROR_ESCAPE_JAVA
  182. U_CONV_ERROR_ESCAPE_XML_DEC
  183. U_CONV_ERROR_ESCAPE_XML_HEX
  184. Substitution character can be set only for FROM_UNICODE direction and has to
  185. exist in the target character set.
  186. Unicode String Type
  187. ===================
  188. Unicode string type (IS_UNICODE) is supposed to contain text data encoded in
  189. UTF-16 format. It is the main string type in PHP when Unicode semantics
  190. switch is turned on. Unicode strings can exist when the switch is off, but
  191. they have to be produced programmatically, via calls to functions that
  192. return Unicode type.
  193. The operational unit when working with Unicode strings is a code point, not
  194. code unit or byte. One code point in UTF-16 may be comprised of 1 or 2 code
  195. units, each of which is a 16-bit word. Working on the code point level is
  196. necessary because doing otherwise would mean offloading the processing of
  197. surrogate pairs onto PHP users, and that is less than desirable.
  198. The repercussions are that one cannot expect code point N to be at offset N in
  199. the Unicode string. Instead, one has to iterate from the beginning from the
  200. string using U16_FWD() macro until the desired codepoint is reached. This will
  201. be transparent to the end user who will work only with "character" offsets.
  202. The codepoint access is one of the primary areas targeted for optimization.
  203. Binary String Type
  204. ==================
  205. Binary string type (IS_STRING) serves two purposes: backwards compatibility and
  206. representing non-Unicode strings and binary data. When Unicode semantics switch
  207. is off, it is used for all strings in PHP, same in previous versions. When the
  208. switch is on, this type will be used to store text in other encodings as well as
  209. true binary data such as images, PDFs, etc.
  210. Printing binary data to the standard output passes it through as-is, independent
  211. of the output encoding.
  212. Zval Structure Changes
  213. ======================
  214. PHP is a type-agnostic language. Its data values are encapsulated in a zval
  215. (Zend value) structure that can change as necessary to accomodate various types.
  216. struct _zval_struct {
  217. /* Variable information */
  218. union {
  219. long lval; /* long value */
  220. double dval; /* double value */
  221. struct {
  222. char *val;
  223. int len;
  224. } str; /* string value */
  225. HashTable *ht; /* hash table value */
  226. zend_object_value obj; /* object value */
  227. } value;
  228. zend_uint refcount;
  229. zend_uchar type; /* active type */
  230. zend_uchar is_ref;
  231. };
  232. The type field determines what is stored in the union, IS_STRING being the only
  233. data type pertinent to this discussion. In the current version, the strings
  234. are binary-safe, but, for all intents and purposes, are assumed to be
  235. comprised of 8-bit characters. It is possible to treat the string value as
  236. an opaque type containing arbitrary binary data, and in fact that is how
  237. mbstring extension uses it, in order to store multibyte strings. However,
  238. many extensions and the Zend engine itself manipulate the string value
  239. directly without regard to its internals. Needless to say, this can lead to
  240. problems.
  241. For IS_UNICODE type, we need to add another structure to the union:
  242. union {
  243. ....
  244. struct {
  245. UChar *val; /* Unicode string value */
  246. int len; /* number of UChar's */
  247. } ustr;
  248. ....
  249. } value;
  250. This cleanly separates the two types of strings and helps preserve backwards
  251. compatibility.
  252. To optimize access to IS_STRING and IS_UNICODE storage at runtime, we need yet
  253. another structure:
  254. union {
  255. ....
  256. struct { /* Universal string type */
  257. zstr val;
  258. int len;
  259. } uni;
  260. ....
  261. } value;
  262. Where zstr ia union of char*, UChar*, and void*.
  263. Language Modifications
  264. ======================
  265. If a Unicode switch is turned on, PHP string literals - single-quoted,
  266. double-quoted, and heredocs - become Unicode strings (IS_UNICODE type).
  267. They support all the same escape sequences and variable interpolations as
  268. previously, with the addition of some new escape sequences.
  269. The contents of the strings are interpreted as follows:
  270. - all non-escaped characters are interpreted as a corresponding Unicode
  271. codepoint based on the current script encoding, e.g. ASCII 'a' (0x51) =>
  272. U+0061, Shift-JIS (0x92 0x69) => U+4E2D
  273. - existing PHP escape sequences are also interpreted as Unicode codepoints,
  274. including \xXX (hex) and \OOO (octal) numbers, e.g. "\x20" => U+0020
  275. - two new escape sequences, \uXXXX and \UXXXXXX are interpreted as a 4 or
  276. 6-hex Unicode codepoint value, e.g. \u0221 => U+0221, \U010410 =>
  277. U+10410
  278. - a new escape sequence allows specifying a character by its full
  279. Unicode name, e.g. \C{THAI CHARACTER PHO SAMPHAO} => U+0E20
  280. The single-quoted string is more restrictive than the other two types: so
  281. far the only escape sequence allowed inside of it was \', which specifies
  282. a literal single quote. However, single quoted strings now support the new
  283. Unicode character escape sequences as well.
  284. PHP allows variable interpolation inside the double-quoted and heredoc strings.
  285. However, the parser separates the string into literal and variable chunks during
  286. compilation, e.g. "abc $var def" -> "abc" . $var . "def". This means that the
  287. literal chunks can be handled in the normal way for as far as Unicode
  288. support is concerned.
  289. Since all string literals become Unicode by default, one loses the ability
  290. to specify byte-oriented or binary strings. In order to create binary string
  291. literals, a new syntax is necessary: prefixing a string literal with letter
  292. 'b' creates a binary string.
  293. $var = b'abc\001';
  294. $var = b"abc\001";
  295. $var = b<<<EOD
  296. abc\001
  297. EOD;
  298. The binary string literals support the same escape sequences as the current
  299. PHP strings. If the Unicode switch is turned off, then the binary string
  300. literals generate normal string (IS_STRING) type internally, without any
  301. effect on the application.
  302. The string operators have been changed to accomodate the new IS_UNICODE and
  303. IS_BINARY types. In more detail:
  304. - The concatenation (.) operator has been changed to automatically coerce
  305. IS_STRING type to the more precise IS_UNICODE if its operands are of two
  306. different string types.
  307. - The concatenation assignment operator (.=) has been changed similarly.
  308. - The string indexing operator [] has been changed to accomodate IS_UNICODE
  309. type strings and extract the specified character. Note that the index
  310. specifies a code point, not a byte, or a code unit, thus supporting
  311. supplementary characters.
  312. - Both Unicode and binary string types can be used as array keys. If the
  313. Unicode switch is on, the binary keys are converted to Unicode.
  314. - Bitwise operators and increment/decrement operators do not work on
  315. Unicode strings. They do work on binary strings.
  316. - Two new casting operators are introduced, (unicode) and (binary). The
  317. (string) operator will cast to Unicode type if the Unicode semantics switch is
  318. on, and to binary type otherwise.
  319. - The comparison operators when applied to Unicode strings, perform
  320. comparison in binary code point order. They also do appropriate coersion
  321. if the strings are of differing types.
  322. - The arithmetic operators use the same semantics as today for converting
  323. strings to numbers. A Unicode string is considered numeric if it
  324. represents a long or a double number in en_US_POSIX locale.
  325. Inline HTML
  326. ===========
  327. Because inline HTML blocks are intermixed with PHP ones, they are also
  328. written in the script encoding. PHP transcodes the HTML blocks to the output
  329. encoding as needed, resulting in direct passthrough if the script encoding
  330. matches output encoding.
  331. Identifiers
  332. ===========
  333. Considering that scripts may be written in various encodings, we do not
  334. restrict identifiers to be ASCII-only. PHP allows any valid identifier based
  335. on the Unicode Standard Annex #31. The identifiers are case folded when
  336. necessary (class and function names) and converted to normalization form
  337. NFKC, so that two identifiers written in two compatible ways refer to the
  338. same thing.
  339. Numbers
  340. =======
  341. Unlike identifiers, we restrict numbers to consist only of ASCII digits and
  342. do not interpret them as written in a specific locale. The numbers are
  343. expected to adhere to en_US_POSIX or C locale, i.e. having no thousands
  344. separator and fractional separator being (.) "full stop". Numeric strings
  345. are supposed to adhere to the same rules, i.e. "10,3" is not interpreted as
  346. a number even if the current locale's fractional separator is comma.
  347. Parameter Parsing API Modifications
  348. ===================================
  349. Internal PHP functions largely uses zend_parse_parameters() API in order to
  350. obtain the parameters passed to them by the user. For example:
  351. char *str;
  352. int len;
  353. if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "s", &str, &len) == FAILURE) {
  354. return;
  355. }
  356. This forces the input parameter to be a string, and its value and length are
  357. stored in the variables specified by the caller.
  358. There are now five new specifiers: 'u', 't', 'T', 'U', and 'S'.
  359. 't' specifier
  360. -------------
  361. This specifier indicates that the caller requires the incoming parameter to be
  362. string data (IS_STRING, IS_UNICODE). The caller has to provide the storage for
  363. string value, length, and type.
  364. void *str;
  365. int len;
  366. zend_uchar type;
  367. if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "t", &str, &len, &type) == FAILURE) {
  368. return;
  369. }
  370. if (type == IS_UNICODE) {
  371. /* process Unicode string */
  372. } else {
  373. /* process binary string */
  374. }
  375. For IS_STRING type, the length represents the number of bytes, and for
  376. IS_UNICODE the number of UChar's. When converting other types (numbers,
  377. booleans, etc) to strings, the exact behavior depends on the Unicode semantics
  378. switch: if on, they are converted to IS_UNICODE, otherwise to IS_STRING.
  379. 'u' specifier
  380. -------------
  381. This specifier indicates that the caller requires the incoming parameter
  382. to be a Unicode encoded string. If a non-Unicode string is passed, the engine
  383. creates a copy of the string and automatically convert it to Unicode type before
  384. passing it to the internal function. No such conversion is necessary for Unicode
  385. strings, obviously.
  386. UChar *str;
  387. int len;
  388. if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "u", &str, &len) == FAILURE) {
  389. return;
  390. }
  391. /* process Unicode string */
  392. 'T' specifier
  393. -------------
  394. This specifier is useful when the function takes two or more strings and
  395. operates on them. Using 't' specifier for each one would be somewhat
  396. problematic if the passed-in strings are of mixed types, and multiple
  397. checks need to be performed in order to do anything. All parameters
  398. marked by the 'T' specifier are promoted to the same type.
  399. If at least one of the 'T' parameters is of Unicode type, then the rest of
  400. them are converted to IS_UNICODE. Otherwise all 'T' parameters are conveted to
  401. IS_STRING type.
  402. void *str1, *str2;
  403. int len1, len2;
  404. zend_uchar type1, type2;
  405. if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "TT", &str1, &len1,
  406. &type1, &str2, &len2, &type2) == FAILURE) {
  407. return;
  408. }
  409. if (type1 == IS_UNICODE) {
  410. /* process as Unicode, str2 is guaranteed to be Unicode as well */
  411. } else {
  412. /* process as binary string, str2 is guaranteed to be the same */
  413. }
  414. The existing 's' specifier has been modified as well. If a Unicode string is
  415. passed in, it automatically copies and converts the string to the runtime
  416. encoding, and issues a warning. If a binary type is passed-in, no conversion
  417. is necessary.
  418. The 'U' and 'S' specifiers are similar to 'u' and 's' but they are more strict
  419. about the type of the passed-in parameter. If 'U' is specified and the binary
  420. string is passed in, the engine will issue a warning instead of doing automatic
  421. conversion. The converse applies to the 'S' specifier.
  422. Upgrading Existing Functions
  423. ============================
  424. Upgrading functions to work with new data types will be a deliberate and
  425. involved process, because one needs to consider not only the mechanisms for
  426. processing Unicode characters, for example, but also the semantics of
  427. the function.
  428. The main tenet of the upgrade process should be that when processing Unicode
  429. strings, the unit of operation is a code point, not a code unit or a byte.
  430. For example, strlen() returns the number of code points in the string.
  431. strlen('abc') = 3
  432. strlen('ab\U010000') = 3
  433. strlen('ab\uD800\uDC00') = 3 /* not 4 */
  434. Function upgrade guidelines are available in a separate document.
  435. Document TODO
  436. ==========================================
  437. - Streams support for Unicode - What stream filters will be provided?
  438. - User conversion error handler
  439. - INI files encoding - UTF-8? Do we support BOMs?
  440. - There are likely to be other issues which are missing from this document
  441. Build System
  442. ============
  443. Unicode support in PHP is always enabled. The only configuration option
  444. during development should be the location of the ICU headers and libraries.
  445. --with-icu-dir=<dir> <dir> parameter specifies the location of ICU
  446. header and library files.
  447. After the initial development we have to repackage ICU library for our needs
  448. and bundle it with PHP.
  449. Document History
  450. ================
  451. 0.6: Remove notion of native encoding string, only 2 string types are used
  452. now. Update conversion error behavior section and parameter parsing.
  453. Bring the document up-to-date with reality in general.
  454. 0.5: Updated per latest discussions. Removed tentative language in several
  455. places, since we have decided on everything described here already.
  456. Clarified details according to Phase II progress.
  457. 0.4: Updated to include all the latest discussions. Updated development
  458. phases.
  459. 0.3: Updated to include all the latest discussions.
  460. 0.2: Updated Phase I design proposal per discussion on unicode@php.net.
  461. Modified Internal Encoding section to contain only UTF-16 info..
  462. Expanded Script Encoding section.
  463. Added Binary Data Type section.
  464. Amended Language Modifications section to describe string literals
  465. behavior.
  466. Amended Build System section.
  467. 0.1: Phase I design proposal
  468. References
  469. ==========
  470. Unicode
  471. http://www.unicode.org
  472. Unicode Glossary
  473. http://www.unicode.org/glossary/
  474. UTF-8
  475. http://www.utf-8.com/
  476. UTF-16
  477. http://www.ietf.org/rfc/rfc2781.txt
  478. ICU Homepage
  479. http://www.ibm.com/software/globalization/icu/
  480. ICU User Guide and API Reference
  481. http://icu.sourceforge.net/
  482. Unicode Annex #31
  483. http://www.unicode.org/reports/tr31/
  484. PHP Parameter Parsing API
  485. http://www.php.net/manual/en/zend.arguments.retrieval.php
  486. Authors
  487. =======
  488. Andrei Zmievski <andrei@gravitonic.com>
  489. vim: set et :