You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

212 lines
7.3 KiB

  1. """A simple non-validating parser for C99.
  2. The functions and regex patterns here are not entirely suitable for
  3. validating C syntax. Please rely on a proper compiler for that.
  4. Instead our goal here is merely matching and extracting information from
  5. valid C code.
  6. Furthermore, the grammar rules for the C syntax (particularly as
  7. described in the K&R book) actually describe a superset, of which the
  8. full C langage is a proper subset. Here are some of the extra
  9. conditions that must be applied when parsing C code:
  10. * ...
  11. (see: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf)
  12. We have taken advantage of the elements of the C grammar that are used
  13. only in a few limited contexts, mostly as delimiters. They allow us to
  14. focus the regex patterns confidently. Here are the relevant tokens and
  15. in which grammar rules they are used:
  16. separators:
  17. * ";"
  18. + (decl) struct/union: at end of each member decl
  19. + (decl) declaration: at end of each (non-compound) decl
  20. + (stmt) expr stmt: at end of each stmt
  21. + (stmt) for: between exprs in "header"
  22. + (stmt) goto: at end
  23. + (stmt) continue: at end
  24. + (stmt) break: at end
  25. + (stmt) return: at end
  26. * ","
  27. + (decl) struct/union: between member declators
  28. + (decl) param-list: between params
  29. + (decl) enum: between enumerators
  30. + (decl) initializer (compound): between initializers
  31. + (expr) postfix: between func call args
  32. + (expr) expression: between "assignment" exprs
  33. * ":"
  34. + (decl) struct/union: in member declators
  35. + (stmt) label: between label and stmt
  36. + (stmt) case: between expression and stmt
  37. + (stmt) default: between "default" and stmt
  38. * "="
  39. + (decl) delaration: between decl and initializer
  40. + (decl) enumerator: between identifier and "initializer"
  41. + (expr) assignment: between "var" and expr
  42. wrappers:
  43. * "(...)"
  44. + (decl) declarator (func ptr): to wrap ptr/name
  45. + (decl) declarator (func ptr): around params
  46. + (decl) declarator: around sub-declarator (for readability)
  47. + (expr) postfix (func call): around args
  48. + (expr) primary: around sub-expr
  49. + (stmt) if: around condition
  50. + (stmt) switch: around source expr
  51. + (stmt) while: around condition
  52. + (stmt) do-while: around condition
  53. + (stmt) for: around "header"
  54. * "{...}"
  55. + (decl) enum: around enumerators
  56. + (decl) func: around body
  57. + (stmt) compound: around stmts
  58. * "[...]"
  59. * (decl) declarator: for arrays
  60. * (expr) postfix: array access
  61. other:
  62. * "*"
  63. + (decl) declarator: for pointer types
  64. + (expr) unary: for pointer deref
  65. To simplify the regular expressions used here, we've takens some
  66. shortcuts and made certain assumptions about the code we are parsing.
  67. Some of these allow us to skip context-sensitive matching (e.g. braces)
  68. or otherwise still match arbitrary C code unambiguously. However, in
  69. some cases there are certain corner cases where the patterns are
  70. ambiguous relative to arbitrary C code. However, they are still
  71. unambiguous in the specific code we are parsing.
  72. Here are the cases where we've taken shortcuts or made assumptions:
  73. * there is no overlap syntactically between the local context (func
  74. bodies) and the global context (other than variable decls), so we
  75. do not need to worry about ambiguity due to the overlap:
  76. + the global context has no expressions or statements
  77. + the local context has no function definitions or type decls
  78. * no "inline" type declarations (struct, union, enum) in function
  79. parameters ~(including function pointers)~
  80. * no "inline" type decls in function return types
  81. * no superflous parentheses in declarators
  82. * var decls in for loops are always "simple" (e.g. no inline types)
  83. * only inline struct/union/enum decls may be anonymouns (without a name)
  84. * no function pointers in function pointer parameters
  85. * for loop "headers" do not have curly braces (e.g. compound init)
  86. * syntactically, variable decls do not overlap with stmts/exprs, except
  87. in the following case:
  88. spam (*eggs) (...)
  89. This could be either a function pointer variable named "eggs"
  90. or a call to a function named "spam", which returns a function
  91. pointer that gets called. The only differentiator is the
  92. syntax used in the "..." part. It will be comma-separated
  93. parameters for the former and comma-separated expressions for
  94. the latter. Thus, if we expect such decls or calls then we must
  95. parse the decl params.
  96. """
  97. """
  98. TODO:
  99. * extract CPython-specific code
  100. * drop include injection (or only add when needed)
  101. * track position instead of slicing "text"
  102. * Parser class instead of the _iter_source() mess
  103. * alt impl using a state machine (& tokenizer or split on delimiters)
  104. """
  105. from ..info import ParsedItem
  106. from ._info import SourceInfo
  107. def parse(srclines):
  108. if isinstance(srclines, str): # a filename
  109. raise NotImplementedError
  110. anon_name = anonymous_names()
  111. for result in _parse(srclines, anon_name):
  112. yield ParsedItem.from_raw(result)
  113. # XXX Later: Add a separate function to deal with preprocessor directives
  114. # parsed out of raw source.
  115. def anonymous_names():
  116. counter = 1
  117. def anon_name(prefix='anon-'):
  118. nonlocal counter
  119. name = f'{prefix}{counter}'
  120. counter += 1
  121. return name
  122. return anon_name
  123. #############################
  124. # internal impl
  125. import logging
  126. _logger = logging.getLogger(__name__)
  127. def _parse(srclines, anon_name):
  128. from ._global import parse_globals
  129. source = _iter_source(srclines)
  130. #source = _iter_source(srclines, showtext=True)
  131. for result in parse_globals(source, anon_name):
  132. # XXX Handle blocks here instead of in parse_globals().
  133. yield result
  134. def _iter_source(lines, *, maxtext=20_000, maxlines=700, showtext=False):
  135. maxtext = maxtext if maxtext and maxtext > 0 else None
  136. maxlines = maxlines if maxlines and maxlines > 0 else None
  137. filestack = []
  138. allinfo = {}
  139. # "lines" should be (fileinfo, data), as produced by the preprocessor code.
  140. for fileinfo, line in lines:
  141. if fileinfo.filename in filestack:
  142. while fileinfo.filename != filestack[-1]:
  143. filename = filestack.pop()
  144. del allinfo[filename]
  145. filename = fileinfo.filename
  146. srcinfo = allinfo[filename]
  147. else:
  148. filename = fileinfo.filename
  149. srcinfo = SourceInfo(filename)
  150. filestack.append(filename)
  151. allinfo[filename] = srcinfo
  152. _logger.debug(f'-> {line}')
  153. srcinfo._add_line(line, fileinfo.lno)
  154. if srcinfo.too_much(maxtext, maxlines):
  155. break
  156. while srcinfo._used():
  157. yield srcinfo
  158. if showtext:
  159. _logger.debug(f'=> {srcinfo.text}')
  160. else:
  161. if not filestack:
  162. srcinfo = SourceInfo('???')
  163. else:
  164. filename = filestack[-1]
  165. srcinfo = allinfo[filename]
  166. while srcinfo._used():
  167. yield srcinfo
  168. if showtext:
  169. _logger.debug(f'=> {srcinfo.text}')
  170. yield srcinfo
  171. if showtext:
  172. _logger.debug(f'=> {srcinfo.text}')
  173. if not srcinfo._ready:
  174. return
  175. # At this point either the file ended prematurely
  176. # or there's "too much" text.
  177. filename, lno, text = srcinfo.filename, srcinfo._start, srcinfo.text
  178. if len(text) > 500:
  179. text = text[:500] + '...'
  180. raise Exception(f'unmatched text ({filename} starting at line {lno}):\n{text}')