You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1307 lines
46 KiB

Merged revisions 55817-55961 via svnmerge from svn+ssh://pythondev@svn.python.org/python/branches/p3yk ................ r55837 | guido.van.rossum | 2007-06-08 16:04:42 -0700 (Fri, 08 Jun 2007) | 2 lines PEP 3119 -- the abc module. ................ r55838 | guido.van.rossum | 2007-06-08 17:38:55 -0700 (Fri, 08 Jun 2007) | 2 lines Implement part of PEP 3119 -- One Trick Ponies. ................ r55847 | guido.van.rossum | 2007-06-09 08:28:06 -0700 (Sat, 09 Jun 2007) | 2 lines Different way to do one trick ponies, allowing registration (per PEP strawman). ................ r55849 | guido.van.rossum | 2007-06-09 18:06:38 -0700 (Sat, 09 Jun 2007) | 3 lines Make sure that the magic looking for __hash__ (etc.) doesn't apply to real subclasses of Hashable. ................ r55852 | guido.van.rossum | 2007-06-10 08:29:51 -0700 (Sun, 10 Jun 2007) | 2 lines Add some more examples, e.g. generators and dict views. ................ r55853 | guido.van.rossum | 2007-06-10 08:31:59 -0700 (Sun, 10 Jun 2007) | 2 lines keys() and items() *are* containers -- just values() isn't. ................ r55864 | georg.brandl | 2007-06-10 15:29:40 -0700 (Sun, 10 Jun 2007) | 2 lines PEP 3127: new octal literals, binary literals. ................ r55865 | georg.brandl | 2007-06-10 15:31:37 -0700 (Sun, 10 Jun 2007) | 2 lines Some octal literal fixes in Tools. ................ r55866 | georg.brandl | 2007-06-10 15:37:43 -0700 (Sun, 10 Jun 2007) | 2 lines Tokenizer changes for PEP 3127. ................ r55867 | georg.brandl | 2007-06-10 15:37:55 -0700 (Sun, 10 Jun 2007) | 2 lines Some docs for PEP 3127. ................ r55868 | georg.brandl | 2007-06-10 15:44:39 -0700 (Sun, 10 Jun 2007) | 2 lines Missed a place in intobject.c. Is that used anymore anyway? ................ r55871 | neal.norwitz | 2007-06-10 18:31:49 -0700 (Sun, 10 Jun 2007) | 182 lines Merged revisions 55729-55868 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r55731 | neal.norwitz | 2007-06-01 00:29:12 -0700 (Fri, 01 Jun 2007) | 7 lines SF 1668596/1720897: distutils now copies data files even if package_dir is empty. This needs to be backported. I'm too tired tonight. It would be great if someone backports this if the buildbots are ok with it. Otherwise, I will try to get to it tomorrow. ........ r55732 | georg.brandl | 2007-06-01 04:33:33 -0700 (Fri, 01 Jun 2007) | 2 lines Bug #1722484: remove docstrings again when running with -OO. ........ r55735 | georg.brandl | 2007-06-01 12:20:27 -0700 (Fri, 01 Jun 2007) | 2 lines Fix wrong issue number. ........ r55739 | brett.cannon | 2007-06-01 20:02:29 -0700 (Fri, 01 Jun 2007) | 3 lines Have configure raise an error when building on AtheOS. Code specific to AtheOS will be removed in Python 2.7. ........ r55746 | neal.norwitz | 2007-06-02 11:33:53 -0700 (Sat, 02 Jun 2007) | 1 line Update expected birthday of 2.6 ........ r55751 | neal.norwitz | 2007-06-03 13:32:50 -0700 (Sun, 03 Jun 2007) | 10 lines Backout the original 'fix' to 1721309 which had no effect. Different versions of Berkeley DB handle this differently. The comments and bug report should have the details. Memory is allocated in 4.4 (and presumably earlier), but not in 4.5. Thus 4.5 has the free error, but not earlier versions. Mostly update comments, plus make the free conditional. This fix was already applied to the 2.5 branch. ........ r55752 | brett.cannon | 2007-06-03 16:13:41 -0700 (Sun, 03 Jun 2007) | 6 lines Make _strptime.TimeRE().pattern() use ``\s+`` for matching whitespace instead of ``\s*``. This prevents patterns from "stealing" bits from other patterns in order to make a match work. Closes bug #1730389. Will be backported. ........ r55766 | hyeshik.chang | 2007-06-05 11:16:52 -0700 (Tue, 05 Jun 2007) | 4 lines Fix build on FreeBSD. Bluetooth HCI API in FreeBSD is quite different from Linux's. Just fix the build for now but the code doesn't support the complete capability of HCI on FreeBSD yet. ........ r55770 | hyeshik.chang | 2007-06-05 11:58:51 -0700 (Tue, 05 Jun 2007) | 4 lines Bug #1728403: Fix a bug that CJKCodecs StreamReader hangs when it reads a file that ends with incomplete sequence and sizehint argument for .read() is specified. ........ r55775 | hyeshik.chang | 2007-06-05 12:28:15 -0700 (Tue, 05 Jun 2007) | 2 lines Fix for Windows: close a temporary file before trying to delete it. ........ r55783 | guido.van.rossum | 2007-06-05 14:24:47 -0700 (Tue, 05 Jun 2007) | 2 lines Patch by Tim Delany (missing DECREF). SF #1731330. ........ r55785 | collin.winter | 2007-06-05 17:17:35 -0700 (Tue, 05 Jun 2007) | 3 lines Patch #1731049: make threading.py use a proper "raise" when checking internal state, rather than assert statements (which get stripped out by -O). ........ r55786 | facundo.batista | 2007-06-06 08:13:37 -0700 (Wed, 06 Jun 2007) | 4 lines FTP.ntransfercmd method now uses create_connection when passive, using the timeout received in connection time. ........ r55792 | facundo.batista | 2007-06-06 10:15:23 -0700 (Wed, 06 Jun 2007) | 7 lines Added an optional timeout parameter to function urllib2.urlopen, with tests in test_urllib2net.py (must have network resource enabled to execute them). Also modified test_urllib2.py because testing mock classes must take it into acount. Docs are also updated. ........ r55793 | thomas.heller | 2007-06-06 13:19:19 -0700 (Wed, 06 Jun 2007) | 1 line Build _ctypes and _ctypes_test in the ReleaseAMD64 configuration. ........ r55802 | georg.brandl | 2007-06-07 06:23:24 -0700 (Thu, 07 Jun 2007) | 3 lines Disallow function calls like foo(None=1). Backport from py3k rev. 55708 by Guido. ........ r55804 | georg.brandl | 2007-06-07 06:30:24 -0700 (Thu, 07 Jun 2007) | 2 lines Make reindent.py executable. ........ r55805 | georg.brandl | 2007-06-07 06:34:10 -0700 (Thu, 07 Jun 2007) | 2 lines Patch #1667860: Fix UnboundLocalError in urllib2. ........ r55821 | kristjan.jonsson | 2007-06-07 16:53:49 -0700 (Thu, 07 Jun 2007) | 1 line Fixing changes to getbuildinfo.c that broke linux builds ........ r55828 | thomas.heller | 2007-06-08 09:10:27 -0700 (Fri, 08 Jun 2007) | 1 line Make this test work with older Python releases where struct has no 't' format character. ........ r55829 | martin.v.loewis | 2007-06-08 10:29:20 -0700 (Fri, 08 Jun 2007) | 3 lines Bug #1733488: Fix compilation of bufferobject.c on AIX. Will backport to 2.5. ........ r55831 | thomas.heller | 2007-06-08 11:20:09 -0700 (Fri, 08 Jun 2007) | 2 lines [ 1715718 ] x64 clean compile patch for _ctypes, by Kristj?n Valur with small modifications. ........ r55832 | thomas.heller | 2007-06-08 12:01:06 -0700 (Fri, 08 Jun 2007) | 1 line Fix gcc warnings intruduced by passing Py_ssize_t to PyErr_Format calls. ........ r55833 | thomas.heller | 2007-06-08 12:08:31 -0700 (Fri, 08 Jun 2007) | 2 lines Fix wrong documentation, and correct the punktuation. Closes [1700455]. ........ r55834 | thomas.heller | 2007-06-08 12:14:23 -0700 (Fri, 08 Jun 2007) | 1 line Fix warnings by using proper function prototype. ........ r55839 | neal.norwitz | 2007-06-08 20:36:34 -0700 (Fri, 08 Jun 2007) | 7 lines Prevent expandtabs() on string and unicode objects from causing a segfault when a large width is passed on 32-bit platforms. Found by Google. It would be good for people to review this especially carefully and verify I don't have an off by one error and there is no other way to cause overflow. ........ r55841 | neal.norwitz | 2007-06-08 21:48:22 -0700 (Fri, 08 Jun 2007) | 1 line Use macro version of GET_SIZE to avoid Coverity warning (#150) about a possible error. ........ r55842 | martin.v.loewis | 2007-06-09 00:42:52 -0700 (Sat, 09 Jun 2007) | 3 lines Patch #1733960: Allow T_LONGLONG to accept ints. Will backport to 2.5. ........ r55843 | martin.v.loewis | 2007-06-09 00:58:05 -0700 (Sat, 09 Jun 2007) | 2 lines Fix Windows build. ........ r55845 | martin.v.loewis | 2007-06-09 03:10:26 -0700 (Sat, 09 Jun 2007) | 2 lines Provide LLONG_MAX for S390. ........ r55854 | thomas.heller | 2007-06-10 08:59:17 -0700 (Sun, 10 Jun 2007) | 4 lines First version of build scripts for Windows/AMD64 (no external components are built yet, and 'kill_python' is disabled). ........ r55855 | thomas.heller | 2007-06-10 10:55:51 -0700 (Sun, 10 Jun 2007) | 3 lines For now, disable the _bsddb, _sqlite3, _ssl, _testcapi, _tkinter modules in the ReleaseAMD64 configuration because they do not compile. ........ r55856 | thomas.heller | 2007-06-10 11:27:54 -0700 (Sun, 10 Jun 2007) | 1 line Need to set the environment variables, otherwise devenv.com is not found. ........ r55860 | thomas.heller | 2007-06-10 14:01:17 -0700 (Sun, 10 Jun 2007) | 1 line Revert commit 55855. ........ ................ r55880 | neal.norwitz | 2007-06-10 22:07:36 -0700 (Sun, 10 Jun 2007) | 5 lines Fix the refleak counter on test_collections. The ABC metaclass creates a registry which must be cleared on each run. Otherwise, there *seem* to be refleaks when there really aren't any. (The class is held within the registry even though it's no longer needed.) ................ r55884 | neal.norwitz | 2007-06-10 22:46:33 -0700 (Sun, 10 Jun 2007) | 1 line These tests have been removed, so they are no longer needed here ................ r55886 | georg.brandl | 2007-06-11 00:26:37 -0700 (Mon, 11 Jun 2007) | 3 lines Optimize access to True and False in the compiler (if True) and the peepholer (LOAD_NAME True). ................ r55905 | georg.brandl | 2007-06-11 10:02:26 -0700 (Mon, 11 Jun 2007) | 5 lines Remove __oct__ and __hex__ and use __index__ for converting non-ints before formatting in a base. Add a bin() builtin. ................ r55906 | georg.brandl | 2007-06-11 10:04:44 -0700 (Mon, 11 Jun 2007) | 2 lines int(x, 0) does not "guess". ................ r55907 | georg.brandl | 2007-06-11 10:05:47 -0700 (Mon, 11 Jun 2007) | 2 lines Add a comment to explain that nb_oct and nb_hex are nonfunctional. ................ r55908 | guido.van.rossum | 2007-06-11 10:49:18 -0700 (Mon, 11 Jun 2007) | 2 lines Get rid of unused imports and comment. ................ r55910 | guido.van.rossum | 2007-06-11 13:05:17 -0700 (Mon, 11 Jun 2007) | 2 lines _Abstract.__new__ now requires either no arguments or __init__ overridden. ................ r55911 | guido.van.rossum | 2007-06-11 13:07:49 -0700 (Mon, 11 Jun 2007) | 7 lines Move the collections ABCs to a separate file, _abcoll.py, in order to avoid needing to import _collections.so during the bootstrap (this will become apparent in the next submit of os.py). Add (plain and mutable) ABCs for Set, Mapping, Sequence. ................ r55912 | guido.van.rossum | 2007-06-11 13:09:31 -0700 (Mon, 11 Jun 2007) | 2 lines Rewrite the _Environ class to use the new collections ABCs. ................ r55913 | guido.van.rossum | 2007-06-11 13:59:45 -0700 (Mon, 11 Jun 2007) | 72 lines Merged revisions 55869-55912 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r55869 | neal.norwitz | 2007-06-10 17:42:11 -0700 (Sun, 10 Jun 2007) | 1 line Add Atul Varma for patch # 1667860 ........ r55870 | neal.norwitz | 2007-06-10 18:22:03 -0700 (Sun, 10 Jun 2007) | 1 line Ignore valgrind problems on Ubuntu from ld ........ r55872 | neal.norwitz | 2007-06-10 18:48:46 -0700 (Sun, 10 Jun 2007) | 2 lines Ignore config.status.lineno which seems new (new autoconf?) ........ r55873 | neal.norwitz | 2007-06-10 19:14:39 -0700 (Sun, 10 Jun 2007) | 1 line Prevent these tests from running on Win64 since they don\'t apply there either ........ r55874 | neal.norwitz | 2007-06-10 19:16:10 -0700 (Sun, 10 Jun 2007) | 5 lines Fix a bug when there was a newline in the string expandtabs was called on. This also catches another condition that can overflow. Will backport. ........ r55879 | neal.norwitz | 2007-06-10 21:52:37 -0700 (Sun, 10 Jun 2007) | 1 line Prevent hang if the port cannot be opened. ........ r55881 | neal.norwitz | 2007-06-10 22:28:45 -0700 (Sun, 10 Jun 2007) | 4 lines Add all of the distuils modules that don't seem to have explicit tests. :-( Move an import in mworkscompiler so that this module can be imported on any platform. Hopefully this works on all platforms. ........ r55882 | neal.norwitz | 2007-06-10 22:35:10 -0700 (Sun, 10 Jun 2007) | 4 lines SF #1734732, lower case the module names per PEP 8. Will backport. ........ r55885 | neal.norwitz | 2007-06-10 23:16:48 -0700 (Sun, 10 Jun 2007) | 4 lines Not sure why this only fails sometimes on Unix machines. Better to disable it and only import msvccompiler on Windows since that's the only place it can work anyways. ........ r55887 | neal.norwitz | 2007-06-11 00:29:43 -0700 (Mon, 11 Jun 2007) | 4 lines Bug #1734723: Fix repr.Repr() so it doesn't ignore the maxtuple attribute. Will backport ........ r55889 | neal.norwitz | 2007-06-11 00:36:24 -0700 (Mon, 11 Jun 2007) | 1 line Reflow long line ........ r55896 | thomas.heller | 2007-06-11 08:58:33 -0700 (Mon, 11 Jun 2007) | 3 lines Use "O&" in calls to PyArg_Parse when we need a 'void*' instead of "k" or "K" codes. ........ r55901 | facundo.batista | 2007-06-11 09:27:08 -0700 (Mon, 11 Jun 2007) | 5 lines Added versionchanged flag to all the methods which received a new optional timeout parameter, and a versionadded flag to the socket.create_connection function. ........ ................ r55914 | guido.van.rossum | 2007-06-11 14:19:50 -0700 (Mon, 11 Jun 2007) | 3 lines New super() implementation, for PEP 3135 (though the PEP is not yet updated to this design, and small tweaks may still be made later). ................ r55923 | guido.van.rossum | 2007-06-11 21:15:24 -0700 (Mon, 11 Jun 2007) | 4 lines I'm guessing this module broke when Neal ripped out the types module -- it used 'list' both as a local variable and as the built-in list type. Renamed the local variable since that was easier. ................ r55924 | guido.van.rossum | 2007-06-11 21:20:05 -0700 (Mon, 11 Jun 2007) | 5 lines Change all occurrences of super(<thisclass>, <firstarg>) to super(). Seems to have worked, all the tests still pass. Exception: test_descr and test_descrtut, which have tons of these and are there to test the various usages. ................ r55939 | collin.winter | 2007-06-12 13:57:33 -0700 (Tue, 12 Jun 2007) | 1 line Patch #1735485: remove StandardError from the exception hierarchy. ................ r55954 | neal.norwitz | 2007-06-12 21:56:32 -0700 (Tue, 12 Jun 2007) | 51 lines Merged revisions 55913-55950 via svnmerge from svn+ssh://pythondev@svn.python.org/python/trunk ........ r55926 | marc-andre.lemburg | 2007-06-12 02:09:58 -0700 (Tue, 12 Jun 2007) | 3 lines Apply patch #1734945 to support TurboLinux as distribution. ........ r55927 | marc-andre.lemburg | 2007-06-12 02:26:49 -0700 (Tue, 12 Jun 2007) | 3 lines Add patch #1726668: Windows Vista support. ........ r55929 | thomas.heller | 2007-06-12 08:36:22 -0700 (Tue, 12 Jun 2007) | 1 line Checkout, but do not yet try to build, exernal sources. ........ r55930 | thomas.heller | 2007-06-12 09:08:27 -0700 (Tue, 12 Jun 2007) | 6 lines Add bufferoverflowU.lib to the libraries needed by _ssl (is this the right thing to do?). Set the /XP64 /RETAIL build enviroment in the makefile when building ReleaseAMD64. ........ r55931 | thomas.heller | 2007-06-12 09:23:19 -0700 (Tue, 12 Jun 2007) | 5 lines Revert this change, since it breaks the win32 build: Add bufferoverflowU.lib to the libraries needed by _ssl (is this the right thing to do?). ........ r55934 | thomas.heller | 2007-06-12 10:28:31 -0700 (Tue, 12 Jun 2007) | 3 lines Specify the bufferoverflowU.lib to the makefile on the command line (for ReleaseAMD64 builds). ........ r55937 | thomas.heller | 2007-06-12 12:02:59 -0700 (Tue, 12 Jun 2007) | 3 lines Add bufferoverflowU.lib to PCBuild\_bsddb.vcproj. Build sqlite3.dll and bsddb. ........ r55938 | thomas.heller | 2007-06-12 12:56:12 -0700 (Tue, 12 Jun 2007) | 2 lines Don't rebuild Berkeley DB if not needed (this was committed by accident). ........ r55948 | martin.v.loewis | 2007-06-12 20:42:19 -0700 (Tue, 12 Jun 2007) | 3 lines Provide PY_LLONG_MAX on all systems having long long. Will backport to 2.5. ........ ................ r55959 | guido.van.rossum | 2007-06-13 09:22:41 -0700 (Wed, 13 Jun 2007) | 2 lines Fix a compilation warning. ................
19 years ago
Merged revisions 55007-55179 via svnmerge from svn+ssh://pythondev@svn.python.org/python/branches/p3yk ........ r55077 | guido.van.rossum | 2007-05-02 11:54:37 -0700 (Wed, 02 May 2007) | 2 lines Use the new print syntax, at least. ........ r55142 | fred.drake | 2007-05-04 21:27:30 -0700 (Fri, 04 May 2007) | 1 line remove old cruftiness ........ r55143 | fred.drake | 2007-05-04 21:52:16 -0700 (Fri, 04 May 2007) | 1 line make this work with the new Python ........ r55162 | neal.norwitz | 2007-05-06 22:29:18 -0700 (Sun, 06 May 2007) | 1 line Get asdl code gen working with Python 2.3. Should continue to work with 3.0 ........ r55164 | neal.norwitz | 2007-05-07 00:00:38 -0700 (Mon, 07 May 2007) | 1 line Verify checkins to p3yk (sic) branch go to 3000 list. ........ r55166 | neal.norwitz | 2007-05-07 00:12:35 -0700 (Mon, 07 May 2007) | 1 line Fix this test so it runs again by importing warnings_test properly. ........ r55167 | neal.norwitz | 2007-05-07 01:03:22 -0700 (Mon, 07 May 2007) | 8 lines So long xrange. range() now supports values that are outside -sys.maxint to sys.maxint. floats raise a TypeError. This has been sitting for a long time. It probably has some problems and needs cleanup. Objects/rangeobject.c now uses 4-space indents since it is almost completely new. ........ r55171 | guido.van.rossum | 2007-05-07 10:21:26 -0700 (Mon, 07 May 2007) | 4 lines Fix two tests that were previously depending on significant spaces at the end of a line (and before that on Python 2.x print behavior that has no exact equivalent in 3.0). ........
19 years ago
  1. #
  2. # (re)generate unicode property and type databases
  3. #
  4. # this script converts a unicode 3.2 database file to
  5. # Modules/unicodedata_db.h, Modules/unicodename_db.h,
  6. # and Objects/unicodetype_db.h
  7. #
  8. # history:
  9. # 2000-09-24 fl created (based on bits and pieces from unidb)
  10. # 2000-09-25 fl merged tim's splitbin fixes, separate decomposition table
  11. # 2000-09-25 fl added character type table
  12. # 2000-09-26 fl added LINEBREAK, DECIMAL, and DIGIT flags/fields (2.0)
  13. # 2000-11-03 fl expand first/last ranges
  14. # 2001-01-19 fl added character name tables (2.1)
  15. # 2001-01-21 fl added decomp compression; dynamic phrasebook threshold
  16. # 2002-09-11 wd use string methods
  17. # 2002-10-18 mvl update to Unicode 3.2
  18. # 2002-10-22 mvl generate NFC tables
  19. # 2002-11-24 mvl expand all ranges, sort names version-independently
  20. # 2002-11-25 mvl add UNIDATA_VERSION
  21. # 2004-05-29 perky add east asian width information
  22. # 2006-03-10 mvl update to Unicode 4.1; add UCD 3.2 delta
  23. # 2008-06-11 gb add PRINTABLE_MASK for Atsuo Ishimoto's ascii() patch
  24. # 2011-10-21 ezio add support for name aliases and named sequences
  25. # 2012-01 benjamin add full case mappings
  26. #
  27. # written by Fredrik Lundh (fredrik@pythonware.com)
  28. #
  29. import os
  30. import sys
  31. import zipfile
  32. from textwrap import dedent
  33. from operator import itemgetter
  34. SCRIPT = sys.argv[0]
  35. VERSION = "3.2"
  36. # The Unicode Database
  37. UNIDATA_VERSION = "6.0.0"
  38. UNICODE_DATA = "UnicodeData%s.txt"
  39. COMPOSITION_EXCLUSIONS = "CompositionExclusions%s.txt"
  40. EASTASIAN_WIDTH = "EastAsianWidth%s.txt"
  41. UNIHAN = "Unihan%s.zip"
  42. DERIVED_CORE_PROPERTIES = "DerivedCoreProperties%s.txt"
  43. DERIVEDNORMALIZATION_PROPS = "DerivedNormalizationProps%s.txt"
  44. LINE_BREAK = "LineBreak%s.txt"
  45. NAME_ALIASES = "NameAliases%s.txt"
  46. NAMED_SEQUENCES = "NamedSequences%s.txt"
  47. SPECIAL_CASING = "SpecialCasing%s.txt"
  48. # Private Use Areas -- in planes 1, 15, 16
  49. PUA_1 = range(0xE000, 0xF900)
  50. PUA_15 = range(0xF0000, 0xFFFFE)
  51. PUA_16 = range(0x100000, 0x10FFFE)
  52. # we use this ranges of PUA_15 to store name aliases and named sequences
  53. NAME_ALIASES_START = 0xF0000
  54. NAMED_SEQUENCES_START = 0xF0100
  55. old_versions = ["3.2.0"]
  56. CATEGORY_NAMES = [ "Cn", "Lu", "Ll", "Lt", "Mn", "Mc", "Me", "Nd",
  57. "Nl", "No", "Zs", "Zl", "Zp", "Cc", "Cf", "Cs", "Co", "Cn", "Lm",
  58. "Lo", "Pc", "Pd", "Ps", "Pe", "Pi", "Pf", "Po", "Sm", "Sc", "Sk",
  59. "So" ]
  60. BIDIRECTIONAL_NAMES = [ "", "L", "LRE", "LRO", "R", "AL", "RLE", "RLO",
  61. "PDF", "EN", "ES", "ET", "AN", "CS", "NSM", "BN", "B", "S", "WS",
  62. "ON" ]
  63. EASTASIANWIDTH_NAMES = [ "F", "H", "W", "Na", "A", "N" ]
  64. MANDATORY_LINE_BREAKS = [ "BK", "CR", "LF", "NL" ]
  65. # note: should match definitions in Objects/unicodectype.c
  66. ALPHA_MASK = 0x01
  67. DECIMAL_MASK = 0x02
  68. DIGIT_MASK = 0x04
  69. LOWER_MASK = 0x08
  70. LINEBREAK_MASK = 0x10
  71. SPACE_MASK = 0x20
  72. TITLE_MASK = 0x40
  73. UPPER_MASK = 0x80
  74. XID_START_MASK = 0x100
  75. XID_CONTINUE_MASK = 0x200
  76. PRINTABLE_MASK = 0x400
  77. NUMERIC_MASK = 0x800
  78. CASE_IGNORABLE_MASK = 0x1000
  79. CASED_MASK = 0x2000
  80. EXTENDED_CASE_MASK = 0x4000
  81. # these ranges need to match unicodedata.c:is_unified_ideograph
  82. cjk_ranges = [
  83. ('3400', '4DB5'),
  84. ('4E00', '9FCB'),
  85. ('20000', '2A6D6'),
  86. ('2A700', '2B734'),
  87. ('2B740', '2B81D')
  88. ]
  89. def maketables(trace=0):
  90. print("--- Reading", UNICODE_DATA % "", "...")
  91. version = ""
  92. unicode = UnicodeData(UNIDATA_VERSION)
  93. print(len(list(filter(None, unicode.table))), "characters")
  94. for version in old_versions:
  95. print("--- Reading", UNICODE_DATA % ("-"+version), "...")
  96. old_unicode = UnicodeData(version, cjk_check=False)
  97. print(len(list(filter(None, old_unicode.table))), "characters")
  98. merge_old_version(version, unicode, old_unicode)
  99. makeunicodename(unicode, trace)
  100. makeunicodedata(unicode, trace)
  101. makeunicodetype(unicode, trace)
  102. # --------------------------------------------------------------------
  103. # unicode character properties
  104. def makeunicodedata(unicode, trace):
  105. dummy = (0, 0, 0, 0, 0, 0)
  106. table = [dummy]
  107. cache = {0: dummy}
  108. index = [0] * len(unicode.chars)
  109. FILE = "Modules/unicodedata_db.h"
  110. print("--- Preparing", FILE, "...")
  111. # 1) database properties
  112. for char in unicode.chars:
  113. record = unicode.table[char]
  114. if record:
  115. # extract database properties
  116. category = CATEGORY_NAMES.index(record[2])
  117. combining = int(record[3])
  118. bidirectional = BIDIRECTIONAL_NAMES.index(record[4])
  119. mirrored = record[9] == "Y"
  120. eastasianwidth = EASTASIANWIDTH_NAMES.index(record[15])
  121. normalizationquickcheck = record[17]
  122. item = (
  123. category, combining, bidirectional, mirrored, eastasianwidth,
  124. normalizationquickcheck
  125. )
  126. # add entry to index and item tables
  127. i = cache.get(item)
  128. if i is None:
  129. cache[item] = i = len(table)
  130. table.append(item)
  131. index[char] = i
  132. # 2) decomposition data
  133. decomp_data = [0]
  134. decomp_prefix = [""]
  135. decomp_index = [0] * len(unicode.chars)
  136. decomp_size = 0
  137. comp_pairs = []
  138. comp_first = [None] * len(unicode.chars)
  139. comp_last = [None] * len(unicode.chars)
  140. for char in unicode.chars:
  141. record = unicode.table[char]
  142. if record:
  143. if record[5]:
  144. decomp = record[5].split()
  145. if len(decomp) > 19:
  146. raise Exception("character %x has a decomposition too large for nfd_nfkd" % char)
  147. # prefix
  148. if decomp[0][0] == "<":
  149. prefix = decomp.pop(0)
  150. else:
  151. prefix = ""
  152. try:
  153. i = decomp_prefix.index(prefix)
  154. except ValueError:
  155. i = len(decomp_prefix)
  156. decomp_prefix.append(prefix)
  157. prefix = i
  158. assert prefix < 256
  159. # content
  160. decomp = [prefix + (len(decomp)<<8)] + [int(s, 16) for s in decomp]
  161. # Collect NFC pairs
  162. if not prefix and len(decomp) == 3 and \
  163. char not in unicode.exclusions and \
  164. unicode.table[decomp[1]][3] == "0":
  165. p, l, r = decomp
  166. comp_first[l] = 1
  167. comp_last[r] = 1
  168. comp_pairs.append((l,r,char))
  169. try:
  170. i = decomp_data.index(decomp)
  171. except ValueError:
  172. i = len(decomp_data)
  173. decomp_data.extend(decomp)
  174. decomp_size = decomp_size + len(decomp) * 2
  175. else:
  176. i = 0
  177. decomp_index[char] = i
  178. f = l = 0
  179. comp_first_ranges = []
  180. comp_last_ranges = []
  181. prev_f = prev_l = None
  182. for i in unicode.chars:
  183. if comp_first[i] is not None:
  184. comp_first[i] = f
  185. f += 1
  186. if prev_f is None:
  187. prev_f = (i,i)
  188. elif prev_f[1]+1 == i:
  189. prev_f = prev_f[0],i
  190. else:
  191. comp_first_ranges.append(prev_f)
  192. prev_f = (i,i)
  193. if comp_last[i] is not None:
  194. comp_last[i] = l
  195. l += 1
  196. if prev_l is None:
  197. prev_l = (i,i)
  198. elif prev_l[1]+1 == i:
  199. prev_l = prev_l[0],i
  200. else:
  201. comp_last_ranges.append(prev_l)
  202. prev_l = (i,i)
  203. comp_first_ranges.append(prev_f)
  204. comp_last_ranges.append(prev_l)
  205. total_first = f
  206. total_last = l
  207. comp_data = [0]*(total_first*total_last)
  208. for f,l,char in comp_pairs:
  209. f = comp_first[f]
  210. l = comp_last[l]
  211. comp_data[f*total_last+l] = char
  212. print(len(table), "unique properties")
  213. print(len(decomp_prefix), "unique decomposition prefixes")
  214. print(len(decomp_data), "unique decomposition entries:", end=' ')
  215. print(decomp_size, "bytes")
  216. print(total_first, "first characters in NFC")
  217. print(total_last, "last characters in NFC")
  218. print(len(comp_pairs), "NFC pairs")
  219. print("--- Writing", FILE, "...")
  220. fp = open(FILE, "w")
  221. print("/* this file was generated by %s %s */" % (SCRIPT, VERSION), file=fp)
  222. print(file=fp)
  223. print('#define UNIDATA_VERSION "%s"' % UNIDATA_VERSION, file=fp)
  224. print("/* a list of unique database records */", file=fp)
  225. print("const _PyUnicode_DatabaseRecord _PyUnicode_Database_Records[] = {", file=fp)
  226. for item in table:
  227. print(" {%d, %d, %d, %d, %d, %d}," % item, file=fp)
  228. print("};", file=fp)
  229. print(file=fp)
  230. print("/* Reindexing of NFC first characters. */", file=fp)
  231. print("#define TOTAL_FIRST",total_first, file=fp)
  232. print("#define TOTAL_LAST",total_last, file=fp)
  233. print("struct reindex{int start;short count,index;};", file=fp)
  234. print("static struct reindex nfc_first[] = {", file=fp)
  235. for start,end in comp_first_ranges:
  236. print(" { %d, %d, %d}," % (start,end-start,comp_first[start]), file=fp)
  237. print(" {0,0,0}", file=fp)
  238. print("};\n", file=fp)
  239. print("static struct reindex nfc_last[] = {", file=fp)
  240. for start,end in comp_last_ranges:
  241. print(" { %d, %d, %d}," % (start,end-start,comp_last[start]), file=fp)
  242. print(" {0,0,0}", file=fp)
  243. print("};\n", file=fp)
  244. # FIXME: <fl> the following tables could be made static, and
  245. # the support code moved into unicodedatabase.c
  246. print("/* string literals */", file=fp)
  247. print("const char *_PyUnicode_CategoryNames[] = {", file=fp)
  248. for name in CATEGORY_NAMES:
  249. print(" \"%s\"," % name, file=fp)
  250. print(" NULL", file=fp)
  251. print("};", file=fp)
  252. print("const char *_PyUnicode_BidirectionalNames[] = {", file=fp)
  253. for name in BIDIRECTIONAL_NAMES:
  254. print(" \"%s\"," % name, file=fp)
  255. print(" NULL", file=fp)
  256. print("};", file=fp)
  257. print("const char *_PyUnicode_EastAsianWidthNames[] = {", file=fp)
  258. for name in EASTASIANWIDTH_NAMES:
  259. print(" \"%s\"," % name, file=fp)
  260. print(" NULL", file=fp)
  261. print("};", file=fp)
  262. print("static const char *decomp_prefix[] = {", file=fp)
  263. for name in decomp_prefix:
  264. print(" \"%s\"," % name, file=fp)
  265. print(" NULL", file=fp)
  266. print("};", file=fp)
  267. # split record index table
  268. index1, index2, shift = splitbins(index, trace)
  269. print("/* index tables for the database records */", file=fp)
  270. print("#define SHIFT", shift, file=fp)
  271. Array("index1", index1).dump(fp, trace)
  272. Array("index2", index2).dump(fp, trace)
  273. # split decomposition index table
  274. index1, index2, shift = splitbins(decomp_index, trace)
  275. print("/* decomposition data */", file=fp)
  276. Array("decomp_data", decomp_data).dump(fp, trace)
  277. print("/* index tables for the decomposition data */", file=fp)
  278. print("#define DECOMP_SHIFT", shift, file=fp)
  279. Array("decomp_index1", index1).dump(fp, trace)
  280. Array("decomp_index2", index2).dump(fp, trace)
  281. index, index2, shift = splitbins(comp_data, trace)
  282. print("/* NFC pairs */", file=fp)
  283. print("#define COMP_SHIFT", shift, file=fp)
  284. Array("comp_index", index).dump(fp, trace)
  285. Array("comp_data", index2).dump(fp, trace)
  286. # Generate delta tables for old versions
  287. for version, table, normalization in unicode.changed:
  288. cversion = version.replace(".","_")
  289. records = [table[0]]
  290. cache = {table[0]:0}
  291. index = [0] * len(table)
  292. for i, record in enumerate(table):
  293. try:
  294. index[i] = cache[record]
  295. except KeyError:
  296. index[i] = cache[record] = len(records)
  297. records.append(record)
  298. index1, index2, shift = splitbins(index, trace)
  299. print("static const change_record change_records_%s[] = {" % cversion, file=fp)
  300. for record in records:
  301. print("\t{ %s }," % ", ".join(map(str,record)), file=fp)
  302. print("};", file=fp)
  303. Array("changes_%s_index" % cversion, index1).dump(fp, trace)
  304. Array("changes_%s_data" % cversion, index2).dump(fp, trace)
  305. print("static const change_record* get_change_%s(Py_UCS4 n)" % cversion, file=fp)
  306. print("{", file=fp)
  307. print("\tint index;", file=fp)
  308. print("\tif (n >= 0x110000) index = 0;", file=fp)
  309. print("\telse {", file=fp)
  310. print("\t\tindex = changes_%s_index[n>>%d];" % (cversion, shift), file=fp)
  311. print("\t\tindex = changes_%s_data[(index<<%d)+(n & %d)];" % \
  312. (cversion, shift, ((1<<shift)-1)), file=fp)
  313. print("\t}", file=fp)
  314. print("\treturn change_records_%s+index;" % cversion, file=fp)
  315. print("}\n", file=fp)
  316. print("static Py_UCS4 normalization_%s(Py_UCS4 n)" % cversion, file=fp)
  317. print("{", file=fp)
  318. print("\tswitch(n) {", file=fp)
  319. for k, v in normalization:
  320. print("\tcase %s: return 0x%s;" % (hex(k), v), file=fp)
  321. print("\tdefault: return 0;", file=fp)
  322. print("\t}\n}\n", file=fp)
  323. fp.close()
  324. # --------------------------------------------------------------------
  325. # unicode character type tables
  326. def makeunicodetype(unicode, trace):
  327. FILE = "Objects/unicodetype_db.h"
  328. print("--- Preparing", FILE, "...")
  329. # extract unicode types
  330. dummy = (0, 0, 0, 0, 0, 0)
  331. table = [dummy]
  332. cache = {0: dummy}
  333. index = [0] * len(unicode.chars)
  334. numeric = {}
  335. spaces = []
  336. linebreaks = []
  337. extra_casing = []
  338. for char in unicode.chars:
  339. record = unicode.table[char]
  340. if record:
  341. # extract database properties
  342. category = record[2]
  343. bidirectional = record[4]
  344. properties = record[16]
  345. flags = 0
  346. delta = True
  347. if category in ["Lm", "Lt", "Lu", "Ll", "Lo"]:
  348. flags |= ALPHA_MASK
  349. if "Lowercase" in properties:
  350. flags |= LOWER_MASK
  351. if 'Line_Break' in properties or bidirectional == "B":
  352. flags |= LINEBREAK_MASK
  353. linebreaks.append(char)
  354. if category == "Zs" or bidirectional in ("WS", "B", "S"):
  355. flags |= SPACE_MASK
  356. spaces.append(char)
  357. if category == "Lt":
  358. flags |= TITLE_MASK
  359. if "Uppercase" in properties:
  360. flags |= UPPER_MASK
  361. if char == ord(" ") or category[0] not in ("C", "Z"):
  362. flags |= PRINTABLE_MASK
  363. if "XID_Start" in properties:
  364. flags |= XID_START_MASK
  365. if "XID_Continue" in properties:
  366. flags |= XID_CONTINUE_MASK
  367. if "Cased" in properties:
  368. flags |= CASED_MASK
  369. if "Case_Ignorable" in properties:
  370. flags |= CASE_IGNORABLE_MASK
  371. sc = unicode.special_casing.get(char)
  372. if sc is None:
  373. if record[12]:
  374. upper = int(record[12], 16)
  375. else:
  376. upper = char
  377. if record[13]:
  378. lower = int(record[13], 16)
  379. else:
  380. lower = char
  381. if record[14]:
  382. title = int(record[14], 16)
  383. else:
  384. title = upper
  385. if upper == lower == title:
  386. upper = lower = title = 0
  387. else:
  388. # This happens when some character maps to more than one
  389. # character in uppercase, lowercase, or titlecase. The extra
  390. # characters are stored in a different array.
  391. flags |= EXTENDED_CASE_MASK
  392. lower = len(extra_casing) | (len(sc[0]) << 24)
  393. extra_casing.extend(sc[0])
  394. upper = len(extra_casing) | (len(sc[2]) << 24)
  395. extra_casing.extend(sc[2])
  396. # Title is probably equal to upper.
  397. if sc[1] == sc[2]:
  398. title = upper
  399. else:
  400. title = len(extra_casing) | (len(sc[1]) << 24)
  401. extra_casing.extend(sc[1])
  402. # decimal digit, integer digit
  403. decimal = 0
  404. if record[6]:
  405. flags |= DECIMAL_MASK
  406. decimal = int(record[6])
  407. digit = 0
  408. if record[7]:
  409. flags |= DIGIT_MASK
  410. digit = int(record[7])
  411. if record[8]:
  412. flags |= NUMERIC_MASK
  413. numeric.setdefault(record[8], []).append(char)
  414. item = (
  415. upper, lower, title, decimal, digit, flags
  416. )
  417. # add entry to index and item tables
  418. i = cache.get(item)
  419. if i is None:
  420. cache[item] = i = len(table)
  421. table.append(item)
  422. index[char] = i
  423. print(len(table), "unique character type entries")
  424. print(sum(map(len, numeric.values())), "numeric code points")
  425. print(len(spaces), "whitespace code points")
  426. print(len(linebreaks), "linebreak code points")
  427. print(len(extra_casing), "extended case array")
  428. print("--- Writing", FILE, "...")
  429. fp = open(FILE, "w")
  430. print("/* this file was generated by %s %s */" % (SCRIPT, VERSION), file=fp)
  431. print(file=fp)
  432. print("/* a list of unique character type descriptors */", file=fp)
  433. print("const _PyUnicode_TypeRecord _PyUnicode_TypeRecords[] = {", file=fp)
  434. for item in table:
  435. print(" {%d, %d, %d, %d, %d, %d}," % item, file=fp)
  436. print("};", file=fp)
  437. print(file=fp)
  438. print("/* extended case mappings */", file=fp)
  439. print(file=fp)
  440. print("const Py_UCS4 _PyUnicode_ExtendedCase[] = {", file=fp)
  441. for c in extra_casing:
  442. print(" %d," % c, file=fp)
  443. print("};", file=fp)
  444. print(file=fp)
  445. # split decomposition index table
  446. index1, index2, shift = splitbins(index, trace)
  447. print("/* type indexes */", file=fp)
  448. print("#define SHIFT", shift, file=fp)
  449. Array("index1", index1).dump(fp, trace)
  450. Array("index2", index2).dump(fp, trace)
  451. # Generate code for _PyUnicode_ToNumeric()
  452. numeric_items = sorted(numeric.items())
  453. print('/* Returns the numeric value as double for Unicode characters', file=fp)
  454. print(' * having this property, -1.0 otherwise.', file=fp)
  455. print(' */', file=fp)
  456. print('double _PyUnicode_ToNumeric(Py_UCS4 ch)', file=fp)
  457. print('{', file=fp)
  458. print(' switch (ch) {', file=fp)
  459. for value, codepoints in numeric_items:
  460. # Turn text into float literals
  461. parts = value.split('/')
  462. parts = [repr(float(part)) for part in parts]
  463. value = '/'.join(parts)
  464. codepoints.sort()
  465. for codepoint in codepoints:
  466. print(' case 0x%04X:' % (codepoint,), file=fp)
  467. print(' return (double) %s;' % (value,), file=fp)
  468. print(' }', file=fp)
  469. print(' return -1.0;', file=fp)
  470. print('}', file=fp)
  471. print(file=fp)
  472. # Generate code for _PyUnicode_IsWhitespace()
  473. print("/* Returns 1 for Unicode characters having the bidirectional", file=fp)
  474. print(" * type 'WS', 'B' or 'S' or the category 'Zs', 0 otherwise.", file=fp)
  475. print(" */", file=fp)
  476. print('int _PyUnicode_IsWhitespace(register const Py_UCS4 ch)', file=fp)
  477. print('{', file=fp)
  478. print(' switch (ch) {', file=fp)
  479. for codepoint in sorted(spaces):
  480. print(' case 0x%04X:' % (codepoint,), file=fp)
  481. print(' return 1;', file=fp)
  482. print(' }', file=fp)
  483. print(' return 0;', file=fp)
  484. print('}', file=fp)
  485. print(file=fp)
  486. # Generate code for _PyUnicode_IsLinebreak()
  487. print("/* Returns 1 for Unicode characters having the line break", file=fp)
  488. print(" * property 'BK', 'CR', 'LF' or 'NL' or having bidirectional", file=fp)
  489. print(" * type 'B', 0 otherwise.", file=fp)
  490. print(" */", file=fp)
  491. print('int _PyUnicode_IsLinebreak(register const Py_UCS4 ch)', file=fp)
  492. print('{', file=fp)
  493. print(' switch (ch) {', file=fp)
  494. for codepoint in sorted(linebreaks):
  495. print(' case 0x%04X:' % (codepoint,), file=fp)
  496. print(' return 1;', file=fp)
  497. print(' }', file=fp)
  498. print(' return 0;', file=fp)
  499. print('}', file=fp)
  500. print(file=fp)
  501. fp.close()
  502. # --------------------------------------------------------------------
  503. # unicode name database
  504. def makeunicodename(unicode, trace):
  505. FILE = "Modules/unicodename_db.h"
  506. print("--- Preparing", FILE, "...")
  507. # collect names
  508. names = [None] * len(unicode.chars)
  509. for char in unicode.chars:
  510. record = unicode.table[char]
  511. if record:
  512. name = record[1].strip()
  513. if name and name[0] != "<":
  514. names[char] = name + chr(0)
  515. print(len(list(n for n in names if n is not None)), "distinct names")
  516. # collect unique words from names (note that we differ between
  517. # words inside a sentence, and words ending a sentence. the
  518. # latter includes the trailing null byte.
  519. words = {}
  520. n = b = 0
  521. for char in unicode.chars:
  522. name = names[char]
  523. if name:
  524. w = name.split()
  525. b = b + len(name)
  526. n = n + len(w)
  527. for w in w:
  528. l = words.get(w)
  529. if l:
  530. l.append(None)
  531. else:
  532. words[w] = [len(words)]
  533. print(n, "words in text;", b, "bytes")
  534. wordlist = list(words.items())
  535. # sort on falling frequency, then by name
  536. def word_key(a):
  537. aword, alist = a
  538. return -len(alist), aword
  539. wordlist.sort(key=word_key)
  540. # figure out how many phrasebook escapes we need
  541. escapes = 0
  542. while escapes * 256 < len(wordlist):
  543. escapes = escapes + 1
  544. print(escapes, "escapes")
  545. short = 256 - escapes
  546. assert short > 0
  547. print(short, "short indexes in lexicon")
  548. # statistics
  549. n = 0
  550. for i in range(short):
  551. n = n + len(wordlist[i][1])
  552. print(n, "short indexes in phrasebook")
  553. # pick the most commonly used words, and sort the rest on falling
  554. # length (to maximize overlap)
  555. wordlist, wordtail = wordlist[:short], wordlist[short:]
  556. wordtail.sort(key=lambda a: a[0], reverse=True)
  557. wordlist.extend(wordtail)
  558. # generate lexicon from words
  559. lexicon_offset = [0]
  560. lexicon = ""
  561. words = {}
  562. # build a lexicon string
  563. offset = 0
  564. for w, x in wordlist:
  565. # encoding: bit 7 indicates last character in word (chr(128)
  566. # indicates the last character in an entire string)
  567. ww = w[:-1] + chr(ord(w[-1])+128)
  568. # reuse string tails, when possible
  569. o = lexicon.find(ww)
  570. if o < 0:
  571. o = offset
  572. lexicon = lexicon + ww
  573. offset = offset + len(w)
  574. words[w] = len(lexicon_offset)
  575. lexicon_offset.append(o)
  576. lexicon = list(map(ord, lexicon))
  577. # generate phrasebook from names and lexicon
  578. phrasebook = [0]
  579. phrasebook_offset = [0] * len(unicode.chars)
  580. for char in unicode.chars:
  581. name = names[char]
  582. if name:
  583. w = name.split()
  584. phrasebook_offset[char] = len(phrasebook)
  585. for w in w:
  586. i = words[w]
  587. if i < short:
  588. phrasebook.append(i)
  589. else:
  590. # store as two bytes
  591. phrasebook.append((i>>8) + short)
  592. phrasebook.append(i&255)
  593. assert getsize(phrasebook) == 1
  594. #
  595. # unicode name hash table
  596. # extract names
  597. data = []
  598. for char in unicode.chars:
  599. record = unicode.table[char]
  600. if record:
  601. name = record[1].strip()
  602. if name and name[0] != "<":
  603. data.append((name, char))
  604. # the magic number 47 was chosen to minimize the number of
  605. # collisions on the current data set. if you like, change it
  606. # and see what happens...
  607. codehash = Hash("code", data, 47)
  608. print("--- Writing", FILE, "...")
  609. fp = open(FILE, "w")
  610. print("/* this file was generated by %s %s */" % (SCRIPT, VERSION), file=fp)
  611. print(file=fp)
  612. print("#define NAME_MAXLEN", 256, file=fp)
  613. print(file=fp)
  614. print("/* lexicon */", file=fp)
  615. Array("lexicon", lexicon).dump(fp, trace)
  616. Array("lexicon_offset", lexicon_offset).dump(fp, trace)
  617. # split decomposition index table
  618. offset1, offset2, shift = splitbins(phrasebook_offset, trace)
  619. print("/* code->name phrasebook */", file=fp)
  620. print("#define phrasebook_shift", shift, file=fp)
  621. print("#define phrasebook_short", short, file=fp)
  622. Array("phrasebook", phrasebook).dump(fp, trace)
  623. Array("phrasebook_offset1", offset1).dump(fp, trace)
  624. Array("phrasebook_offset2", offset2).dump(fp, trace)
  625. print("/* name->code dictionary */", file=fp)
  626. codehash.dump(fp, trace)
  627. print(file=fp)
  628. print('static const unsigned int aliases_start = %#x;' %
  629. NAME_ALIASES_START, file=fp)
  630. print('static const unsigned int aliases_end = %#x;' %
  631. (NAME_ALIASES_START + len(unicode.aliases)), file=fp)
  632. print('static const unsigned int name_aliases[] = {', file=fp)
  633. for name, codepoint in unicode.aliases:
  634. print(' 0x%04X,' % codepoint, file=fp)
  635. print('};', file=fp)
  636. # In Unicode 6.0.0, the sequences contain at most 4 BMP chars,
  637. # so we are using Py_UCS2 seq[4]. This needs to be updated if longer
  638. # sequences or sequences with non-BMP chars are added.
  639. # unicodedata_lookup should be adapted too.
  640. print(dedent("""
  641. typedef struct NamedSequence {
  642. int seqlen;
  643. Py_UCS2 seq[4];
  644. } named_sequence;
  645. """), file=fp)
  646. print('static const unsigned int named_sequences_start = %#x;' %
  647. NAMED_SEQUENCES_START, file=fp)
  648. print('static const unsigned int named_sequences_end = %#x;' %
  649. (NAMED_SEQUENCES_START + len(unicode.named_sequences)), file=fp)
  650. print('static const named_sequence named_sequences[] = {', file=fp)
  651. for name, sequence in unicode.named_sequences:
  652. seq_str = ', '.join('0x%04X' % cp for cp in sequence)
  653. print(' {%d, {%s}},' % (len(sequence), seq_str), file=fp)
  654. print('};', file=fp)
  655. fp.close()
  656. def merge_old_version(version, new, old):
  657. # Changes to exclusion file not implemented yet
  658. if old.exclusions != new.exclusions:
  659. raise NotImplementedError("exclusions differ")
  660. # In these change records, 0xFF means "no change"
  661. bidir_changes = [0xFF]*0x110000
  662. category_changes = [0xFF]*0x110000
  663. decimal_changes = [0xFF]*0x110000
  664. mirrored_changes = [0xFF]*0x110000
  665. # In numeric data, 0 means "no change",
  666. # -1 means "did not have a numeric value
  667. numeric_changes = [0] * 0x110000
  668. # normalization_changes is a list of key-value pairs
  669. normalization_changes = []
  670. for i in range(0x110000):
  671. if new.table[i] is None:
  672. # Characters unassigned in the new version ought to
  673. # be unassigned in the old one
  674. assert old.table[i] is None
  675. continue
  676. # check characters unassigned in the old version
  677. if old.table[i] is None:
  678. # category 0 is "unassigned"
  679. category_changes[i] = 0
  680. continue
  681. # check characters that differ
  682. if old.table[i] != new.table[i]:
  683. for k in range(len(old.table[i])):
  684. if old.table[i][k] != new.table[i][k]:
  685. value = old.table[i][k]
  686. if k == 1 and i in PUA_15:
  687. # the name is not set in the old.table, but in the
  688. # new.table we are using it for aliases and named seq
  689. assert value == ''
  690. elif k == 2:
  691. #print "CATEGORY",hex(i), old.table[i][k], new.table[i][k]
  692. category_changes[i] = CATEGORY_NAMES.index(value)
  693. elif k == 4:
  694. #print "BIDIR",hex(i), old.table[i][k], new.table[i][k]
  695. bidir_changes[i] = BIDIRECTIONAL_NAMES.index(value)
  696. elif k == 5:
  697. #print "DECOMP",hex(i), old.table[i][k], new.table[i][k]
  698. # We assume that all normalization changes are in 1:1 mappings
  699. assert " " not in value
  700. normalization_changes.append((i, value))
  701. elif k == 6:
  702. #print "DECIMAL",hex(i), old.table[i][k], new.table[i][k]
  703. # we only support changes where the old value is a single digit
  704. assert value in "0123456789"
  705. decimal_changes[i] = int(value)
  706. elif k == 8:
  707. # print "NUMERIC",hex(i), `old.table[i][k]`, new.table[i][k]
  708. # Since 0 encodes "no change", the old value is better not 0
  709. if not value:
  710. numeric_changes[i] = -1
  711. else:
  712. numeric_changes[i] = float(value)
  713. assert numeric_changes[i] not in (0, -1)
  714. elif k == 9:
  715. if value == 'Y':
  716. mirrored_changes[i] = '1'
  717. else:
  718. mirrored_changes[i] = '0'
  719. elif k == 11:
  720. # change to ISO comment, ignore
  721. pass
  722. elif k == 12:
  723. # change to simple uppercase mapping; ignore
  724. pass
  725. elif k == 13:
  726. # change to simple lowercase mapping; ignore
  727. pass
  728. elif k == 14:
  729. # change to simple titlecase mapping; ignore
  730. pass
  731. elif k == 16:
  732. # derived property changes; not yet
  733. pass
  734. elif k == 17:
  735. # normalization quickchecks are not performed
  736. # for older versions
  737. pass
  738. else:
  739. class Difference(Exception):pass
  740. raise Difference(hex(i), k, old.table[i], new.table[i])
  741. new.changed.append((version, list(zip(bidir_changes, category_changes,
  742. decimal_changes, mirrored_changes,
  743. numeric_changes)),
  744. normalization_changes))
  745. def open_data(template, version):
  746. local = template % ('-'+version,)
  747. if not os.path.exists(local):
  748. import urllib.request
  749. if version == '3.2.0':
  750. # irregular url structure
  751. url = 'http://www.unicode.org/Public/3.2-Update/' + local
  752. else:
  753. url = ('http://www.unicode.org/Public/%s/ucd/'+template) % (version, '')
  754. urllib.request.urlretrieve(url, filename=local)
  755. if local.endswith('.txt'):
  756. return open(local, encoding='utf-8')
  757. else:
  758. # Unihan.zip
  759. return open(local, 'rb')
  760. # --------------------------------------------------------------------
  761. # the following support code is taken from the unidb utilities
  762. # Copyright (c) 1999-2000 by Secret Labs AB
  763. # load a unicode-data file from disk
  764. class UnicodeData:
  765. # Record structure:
  766. # [ID, name, category, combining, bidi, decomp, (6)
  767. # decimal, digit, numeric, bidi-mirrored, Unicode-1-name, (11)
  768. # ISO-comment, uppercase, lowercase, titlecase, ea-width, (16)
  769. # derived-props] (17)
  770. def __init__(self, version,
  771. linebreakprops=False,
  772. expand=1,
  773. cjk_check=True):
  774. self.changed = []
  775. table = [None] * 0x110000
  776. with open_data(UNICODE_DATA, version) as file:
  777. while 1:
  778. s = file.readline()
  779. if not s:
  780. break
  781. s = s.strip().split(";")
  782. char = int(s[0], 16)
  783. table[char] = s
  784. cjk_ranges_found = []
  785. # expand first-last ranges
  786. if expand:
  787. field = None
  788. for i in range(0, 0x110000):
  789. s = table[i]
  790. if s:
  791. if s[1][-6:] == "First>":
  792. s[1] = ""
  793. field = s
  794. elif s[1][-5:] == "Last>":
  795. if s[1].startswith("<CJK Ideograph"):
  796. cjk_ranges_found.append((field[0],
  797. s[0]))
  798. s[1] = ""
  799. field = None
  800. elif field:
  801. f2 = field[:]
  802. f2[0] = "%X" % i
  803. table[i] = f2
  804. if cjk_check and cjk_ranges != cjk_ranges_found:
  805. raise ValueError("CJK ranges deviate: have %r" % cjk_ranges_found)
  806. # public attributes
  807. self.filename = UNICODE_DATA % ''
  808. self.table = table
  809. self.chars = list(range(0x110000)) # unicode 3.2
  810. # check for name aliases and named sequences, see #12753
  811. # aliases and named sequences are not in 3.2.0
  812. if version != '3.2.0':
  813. self.aliases = []
  814. # store aliases in the Private Use Area 15, in range U+F0000..U+F00FF,
  815. # in order to take advantage of the compression and lookup
  816. # algorithms used for the other characters
  817. pua_index = NAME_ALIASES_START
  818. with open_data(NAME_ALIASES, version) as file:
  819. for s in file:
  820. s = s.strip()
  821. if not s or s.startswith('#'):
  822. continue
  823. char, name = s.split(';')
  824. char = int(char, 16)
  825. self.aliases.append((name, char))
  826. # also store the name in the PUA 1
  827. self.table[pua_index][1] = name
  828. pua_index += 1
  829. assert pua_index - NAME_ALIASES_START == len(self.aliases)
  830. self.named_sequences = []
  831. # store named seqences in the PUA 1, in range U+F0100..,
  832. # in order to take advantage of the compression and lookup
  833. # algorithms used for the other characters.
  834. pua_index = NAMED_SEQUENCES_START
  835. with open_data(NAMED_SEQUENCES, version) as file:
  836. for s in file:
  837. s = s.strip()
  838. if not s or s.startswith('#'):
  839. continue
  840. name, chars = s.split(';')
  841. chars = tuple(int(char, 16) for char in chars.split())
  842. # check that the structure defined in makeunicodename is OK
  843. assert 2 <= len(chars) <= 4, "change the Py_UCS2 array size"
  844. assert all(c <= 0xFFFF for c in chars), ("use Py_UCS4 in "
  845. "the NamedSequence struct and in unicodedata_lookup")
  846. self.named_sequences.append((name, chars))
  847. # also store these in the PUA 1
  848. self.table[pua_index][1] = name
  849. pua_index += 1
  850. assert pua_index - NAMED_SEQUENCES_START == len(self.named_sequences)
  851. self.exclusions = {}
  852. with open_data(COMPOSITION_EXCLUSIONS, version) as file:
  853. for s in file:
  854. s = s.strip()
  855. if not s:
  856. continue
  857. if s[0] == '#':
  858. continue
  859. char = int(s.split()[0],16)
  860. self.exclusions[char] = 1
  861. widths = [None] * 0x110000
  862. with open_data(EASTASIAN_WIDTH, version) as file:
  863. for s in file:
  864. s = s.strip()
  865. if not s:
  866. continue
  867. if s[0] == '#':
  868. continue
  869. s = s.split()[0].split(';')
  870. if '..' in s[0]:
  871. first, last = [int(c, 16) for c in s[0].split('..')]
  872. chars = list(range(first, last+1))
  873. else:
  874. chars = [int(s[0], 16)]
  875. for char in chars:
  876. widths[char] = s[1]
  877. for i in range(0, 0x110000):
  878. if table[i] is not None:
  879. table[i].append(widths[i])
  880. for i in range(0, 0x110000):
  881. if table[i] is not None:
  882. table[i].append(set())
  883. with open_data(DERIVED_CORE_PROPERTIES, version) as file:
  884. for s in file:
  885. s = s.split('#', 1)[0].strip()
  886. if not s:
  887. continue
  888. r, p = s.split(";")
  889. r = r.strip()
  890. p = p.strip()
  891. if ".." in r:
  892. first, last = [int(c, 16) for c in r.split('..')]
  893. chars = list(range(first, last+1))
  894. else:
  895. chars = [int(r, 16)]
  896. for char in chars:
  897. if table[char]:
  898. # Some properties (e.g. Default_Ignorable_Code_Point)
  899. # apply to unassigned code points; ignore them
  900. table[char][-1].add(p)
  901. with open_data(LINE_BREAK, version) as file:
  902. for s in file:
  903. s = s.partition('#')[0]
  904. s = [i.strip() for i in s.split(';')]
  905. if len(s) < 2 or s[1] not in MANDATORY_LINE_BREAKS:
  906. continue
  907. if '..' not in s[0]:
  908. first = last = int(s[0], 16)
  909. else:
  910. first, last = [int(c, 16) for c in s[0].split('..')]
  911. for char in range(first, last+1):
  912. table[char][-1].add('Line_Break')
  913. # We only want the quickcheck properties
  914. # Format: NF?_QC; Y(es)/N(o)/M(aybe)
  915. # Yes is the default, hence only N and M occur
  916. # In 3.2.0, the format was different (NF?_NO)
  917. # The parsing will incorrectly determine these as
  918. # "yes", however, unicodedata.c will not perform quickchecks
  919. # for older versions, and no delta records will be created.
  920. quickchecks = [0] * 0x110000
  921. qc_order = 'NFD_QC NFKD_QC NFC_QC NFKC_QC'.split()
  922. with open_data(DERIVEDNORMALIZATION_PROPS, version) as file:
  923. for s in file:
  924. if '#' in s:
  925. s = s[:s.index('#')]
  926. s = [i.strip() for i in s.split(';')]
  927. if len(s) < 2 or s[1] not in qc_order:
  928. continue
  929. quickcheck = 'MN'.index(s[2]) + 1 # Maybe or No
  930. quickcheck_shift = qc_order.index(s[1])*2
  931. quickcheck <<= quickcheck_shift
  932. if '..' not in s[0]:
  933. first = last = int(s[0], 16)
  934. else:
  935. first, last = [int(c, 16) for c in s[0].split('..')]
  936. for char in range(first, last+1):
  937. assert not (quickchecks[char]>>quickcheck_shift)&3
  938. quickchecks[char] |= quickcheck
  939. for i in range(0, 0x110000):
  940. if table[i] is not None:
  941. table[i].append(quickchecks[i])
  942. with open_data(UNIHAN, version) as file:
  943. zip = zipfile.ZipFile(file)
  944. if version == '3.2.0':
  945. data = zip.open('Unihan-3.2.0.txt').read()
  946. else:
  947. data = zip.open('Unihan_NumericValues.txt').read()
  948. for line in data.decode("utf-8").splitlines():
  949. if not line.startswith('U+'):
  950. continue
  951. code, tag, value = line.split(None, 3)[:3]
  952. if tag not in ('kAccountingNumeric', 'kPrimaryNumeric',
  953. 'kOtherNumeric'):
  954. continue
  955. value = value.strip().replace(',', '')
  956. i = int(code[2:], 16)
  957. # Patch the numeric field
  958. if table[i] is not None:
  959. table[i][8] = value
  960. sc = self.special_casing = {}
  961. with open_data(SPECIAL_CASING, version) as file:
  962. for s in file:
  963. s = s[:-1].split('#', 1)[0]
  964. if not s:
  965. continue
  966. data = s.split("; ")
  967. if data[4]:
  968. # We ignore all conditionals (since they depend on
  969. # languages) except for one, which is hardcoded. See
  970. # handle_capital_sigma in unicodeobject.c.
  971. continue
  972. c = int(data[0], 16)
  973. lower = [int(char, 16) for char in data[1].split()]
  974. title = [int(char, 16) for char in data[2].split()]
  975. upper = [int(char, 16) for char in data[3].split()]
  976. sc[c] = (lower, title, upper)
  977. def uselatin1(self):
  978. # restrict character range to ISO Latin 1
  979. self.chars = list(range(256))
  980. # hash table tools
  981. # this is a straight-forward reimplementation of Python's built-in
  982. # dictionary type, using a static data structure, and a custom string
  983. # hash algorithm.
  984. def myhash(s, magic):
  985. h = 0
  986. for c in map(ord, s.upper()):
  987. h = (h * magic) + c
  988. ix = h & 0xff000000
  989. if ix:
  990. h = (h ^ ((ix>>24) & 0xff)) & 0x00ffffff
  991. return h
  992. SIZES = [
  993. (4,3), (8,3), (16,3), (32,5), (64,3), (128,3), (256,29), (512,17),
  994. (1024,9), (2048,5), (4096,83), (8192,27), (16384,43), (32768,3),
  995. (65536,45), (131072,9), (262144,39), (524288,39), (1048576,9),
  996. (2097152,5), (4194304,3), (8388608,33), (16777216,27)
  997. ]
  998. class Hash:
  999. def __init__(self, name, data, magic):
  1000. # turn a (key, value) list into a static hash table structure
  1001. # determine table size
  1002. for size, poly in SIZES:
  1003. if size > len(data):
  1004. poly = size + poly
  1005. break
  1006. else:
  1007. raise AssertionError("ran out of polynomials")
  1008. print(size, "slots in hash table")
  1009. table = [None] * size
  1010. mask = size-1
  1011. n = 0
  1012. hash = myhash
  1013. # initialize hash table
  1014. for key, value in data:
  1015. h = hash(key, magic)
  1016. i = (~h) & mask
  1017. v = table[i]
  1018. if v is None:
  1019. table[i] = value
  1020. continue
  1021. incr = (h ^ (h >> 3)) & mask;
  1022. if not incr:
  1023. incr = mask
  1024. while 1:
  1025. n = n + 1
  1026. i = (i + incr) & mask
  1027. v = table[i]
  1028. if v is None:
  1029. table[i] = value
  1030. break
  1031. incr = incr << 1
  1032. if incr > mask:
  1033. incr = incr ^ poly
  1034. print(n, "collisions")
  1035. self.collisions = n
  1036. for i in range(len(table)):
  1037. if table[i] is None:
  1038. table[i] = 0
  1039. self.data = Array(name + "_hash", table)
  1040. self.magic = magic
  1041. self.name = name
  1042. self.size = size
  1043. self.poly = poly
  1044. def dump(self, file, trace):
  1045. # write data to file, as a C array
  1046. self.data.dump(file, trace)
  1047. file.write("#define %s_magic %d\n" % (self.name, self.magic))
  1048. file.write("#define %s_size %d\n" % (self.name, self.size))
  1049. file.write("#define %s_poly %d\n" % (self.name, self.poly))
  1050. # stuff to deal with arrays of unsigned integers
  1051. class Array:
  1052. def __init__(self, name, data):
  1053. self.name = name
  1054. self.data = data
  1055. def dump(self, file, trace=0):
  1056. # write data to file, as a C array
  1057. size = getsize(self.data)
  1058. if trace:
  1059. print(self.name+":", size*len(self.data), "bytes", file=sys.stderr)
  1060. file.write("static ")
  1061. if size == 1:
  1062. file.write("unsigned char")
  1063. elif size == 2:
  1064. file.write("unsigned short")
  1065. else:
  1066. file.write("unsigned int")
  1067. file.write(" " + self.name + "[] = {\n")
  1068. if self.data:
  1069. s = " "
  1070. for item in self.data:
  1071. i = str(item) + ", "
  1072. if len(s) + len(i) > 78:
  1073. file.write(s + "\n")
  1074. s = " " + i
  1075. else:
  1076. s = s + i
  1077. if s.strip():
  1078. file.write(s + "\n")
  1079. file.write("};\n\n")
  1080. def getsize(data):
  1081. # return smallest possible integer size for the given array
  1082. maxdata = max(data)
  1083. if maxdata < 256:
  1084. return 1
  1085. elif maxdata < 65536:
  1086. return 2
  1087. else:
  1088. return 4
  1089. def splitbins(t, trace=0):
  1090. """t, trace=0 -> (t1, t2, shift). Split a table to save space.
  1091. t is a sequence of ints. This function can be useful to save space if
  1092. many of the ints are the same. t1 and t2 are lists of ints, and shift
  1093. is an int, chosen to minimize the combined size of t1 and t2 (in C
  1094. code), and where for each i in range(len(t)),
  1095. t[i] == t2[(t1[i >> shift] << shift) + (i & mask)]
  1096. where mask is a bitmask isolating the last "shift" bits.
  1097. If optional arg trace is non-zero (default zero), progress info
  1098. is printed to sys.stderr. The higher the value, the more info
  1099. you'll get.
  1100. """
  1101. if trace:
  1102. def dump(t1, t2, shift, bytes):
  1103. print("%d+%d bins at shift %d; %d bytes" % (
  1104. len(t1), len(t2), shift, bytes), file=sys.stderr)
  1105. print("Size of original table:", len(t)*getsize(t), \
  1106. "bytes", file=sys.stderr)
  1107. n = len(t)-1 # last valid index
  1108. maxshift = 0 # the most we can shift n and still have something left
  1109. if n > 0:
  1110. while n >> 1:
  1111. n >>= 1
  1112. maxshift += 1
  1113. del n
  1114. bytes = sys.maxsize # smallest total size so far
  1115. t = tuple(t) # so slices can be dict keys
  1116. for shift in range(maxshift + 1):
  1117. t1 = []
  1118. t2 = []
  1119. size = 2**shift
  1120. bincache = {}
  1121. for i in range(0, len(t), size):
  1122. bin = t[i:i+size]
  1123. index = bincache.get(bin)
  1124. if index is None:
  1125. index = len(t2)
  1126. bincache[bin] = index
  1127. t2.extend(bin)
  1128. t1.append(index >> shift)
  1129. # determine memory size
  1130. b = len(t1)*getsize(t1) + len(t2)*getsize(t2)
  1131. if trace > 1:
  1132. dump(t1, t2, shift, b)
  1133. if b < bytes:
  1134. best = t1, t2, shift
  1135. bytes = b
  1136. t1, t2, shift = best
  1137. if trace:
  1138. print("Best:", end=' ', file=sys.stderr)
  1139. dump(t1, t2, shift, bytes)
  1140. if __debug__:
  1141. # exhaustively verify that the decomposition is correct
  1142. mask = ~((~0) << shift) # i.e., low-bit mask of shift bits
  1143. for i in range(len(t)):
  1144. assert t[i] == t2[(t1[i >> shift] << shift) + (i & mask)]
  1145. return best
  1146. if __name__ == "__main__":
  1147. maketables(1)