Lexer.html 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <!-- Created by GNU Texinfo 5.2, http://www.gnu.org/software/texinfo/ -->
  4. <head>
  5. <title>The GNU C Preprocessor Internals: Lexer</title>
  6. <meta name="description" content="The GNU C Preprocessor Internals: Lexer">
  7. <meta name="keywords" content="The GNU C Preprocessor Internals: Lexer">
  8. <meta name="resource-type" content="document">
  9. <meta name="distribution" content="global">
  10. <meta name="Generator" content="makeinfo">
  11. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  12. <link href="index.html#Top" rel="start" title="Top">
  13. <link href="Concept-Index.html#Concept-Index" rel="index" title="Concept Index">
  14. <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
  15. <link href="index.html#Top" rel="up" title="Top">
  16. <link href="Hash-Nodes.html#Hash-Nodes" rel="next" title="Hash Nodes">
  17. <link href="Conventions.html#Conventions" rel="prev" title="Conventions">
  18. <style type="text/css">
  19. <!--
  20. a.summary-letter {text-decoration: none}
  21. blockquote.smallquotation {font-size: smaller}
  22. div.display {margin-left: 3.2em}
  23. div.example {margin-left: 3.2em}
  24. div.indentedblock {margin-left: 3.2em}
  25. div.lisp {margin-left: 3.2em}
  26. div.smalldisplay {margin-left: 3.2em}
  27. div.smallexample {margin-left: 3.2em}
  28. div.smallindentedblock {margin-left: 3.2em; font-size: smaller}
  29. div.smalllisp {margin-left: 3.2em}
  30. kbd {font-style:oblique}
  31. pre.display {font-family: inherit}
  32. pre.format {font-family: inherit}
  33. pre.menu-comment {font-family: serif}
  34. pre.menu-preformatted {font-family: serif}
  35. pre.smalldisplay {font-family: inherit; font-size: smaller}
  36. pre.smallexample {font-size: smaller}
  37. pre.smallformat {font-family: inherit; font-size: smaller}
  38. pre.smalllisp {font-size: smaller}
  39. span.nocodebreak {white-space:nowrap}
  40. span.nolinebreak {white-space:nowrap}
  41. span.roman {font-family:serif; font-weight:normal}
  42. span.sansserif {font-family:sans-serif; font-weight:normal}
  43. ul.no-bullet {list-style: none}
  44. -->
  45. </style>
  46. </head>
  47. <body lang="en" bgcolor="#FFFFFF" text="#000000" link="#0000FF" vlink="#800080" alink="#FF0000">
  48. <a name="Lexer"></a>
  49. <div class="header">
  50. <p>
  51. Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="prev">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
  52. </div>
  53. <hr>
  54. <a name="The-Lexer"></a>
  55. <h2 class="unnumbered">The Lexer</h2>
  56. <a name="index-lexer"></a>
  57. <a name="index-newlines"></a>
  58. <a name="index-escaped-newlines"></a>
  59. <a name="Overview"></a>
  60. <h3 class="section">Overview</h3>
  61. <p>The lexer is contained in the file <samp>lex.c</samp>. It is a hand-coded
  62. lexer, and not implemented as a state machine. It can understand C, C++
  63. and Objective-C source code, and has been extended to allow reasonably
  64. successful preprocessing of assembly language. The lexer does not make
  65. an initial pass to strip out trigraphs and escaped newlines, but handles
  66. them as they are encountered in a single pass of the input file. It
  67. returns preprocessing tokens individually, not a line at a time.
  68. </p>
  69. <p>It is mostly transparent to users of the library, since the library&rsquo;s
  70. interface for obtaining the next token, <code>cpp_get_token</code>, takes care
  71. of lexing new tokens, handling directives, and expanding macros as
  72. necessary. However, the lexer does expose some functionality so that
  73. clients of the library can easily spell a given token, such as
  74. <code>cpp_spell_token</code> and <code>cpp_token_len</code>. These functions are
  75. useful when generating diagnostics, and for emitting the preprocessed
  76. output.
  77. </p>
  78. <a name="Lexing-a-token"></a>
  79. <h3 class="section">Lexing a token</h3>
  80. <p>Lexing of an individual token is handled by <code>_cpp_lex_direct</code> and
  81. its subroutines. In its current form the code is quite complicated,
  82. with read ahead characters and such-like, since it strives to not step
  83. back in the character stream in preparation for handling non-ASCII file
  84. encodings. The current plan is to convert any such files to UTF-8
  85. before processing them. This complexity is therefore unnecessary and
  86. will be removed, so I&rsquo;ll not discuss it further here.
  87. </p>
  88. <p>The job of <code>_cpp_lex_direct</code> is simply to lex a token. It is not
  89. responsible for issues like directive handling, returning lookahead
  90. tokens directly, multiple-include optimization, or conditional block
  91. skipping. It necessarily has a minor r&ocirc;le to play in memory
  92. management of lexed lines. I discuss these issues in a separate section
  93. (see <a href="#Lexing-a-line">Lexing a line</a>).
  94. </p>
  95. <p>The lexer places the token it lexes into storage pointed to by the
  96. variable <code>cur_token</code>, and then increments it. This variable is
  97. important for correct diagnostic positioning. Unless a specific line
  98. and column are passed to the diagnostic routines, they will examine the
  99. <code>line</code> and <code>col</code> values of the token just before the location
  100. that <code>cur_token</code> points to, and use that location to report the
  101. diagnostic.
  102. </p>
  103. <p>The lexer does not consider whitespace to be a token in its own right.
  104. If whitespace (other than a new line) precedes a token, it sets the
  105. <code>PREV_WHITE</code> bit in the token&rsquo;s flags. Each token has its
  106. <code>line</code> and <code>col</code> variables set to the line and column of the
  107. first character of the token. This line number is the line number in
  108. the translation unit, and can be converted to a source (file, line) pair
  109. using the line map code.
  110. </p>
  111. <p>The first token on a logical, i.e. unescaped, line has the flag
  112. <code>BOL</code> set for beginning-of-line. This flag is intended for
  113. internal use, both to distinguish a &lsquo;<samp>#</samp>&rsquo; that begins a directive
  114. from one that doesn&rsquo;t, and to generate a call-back to clients that want
  115. to be notified about the start of every non-directive line with tokens
  116. on it. Clients cannot reliably determine this for themselves: the first
  117. token might be a macro, and the tokens of a macro expansion do not have
  118. the <code>BOL</code> flag set. The macro expansion may even be empty, and the
  119. next token on the line certainly won&rsquo;t have the <code>BOL</code> flag set.
  120. </p>
  121. <p>New lines are treated specially; exactly how the lexer handles them is
  122. context-dependent. The C standard mandates that directives are
  123. terminated by the first unescaped newline character, even if it appears
  124. in the middle of a macro expansion. Therefore, if the state variable
  125. <code>in_directive</code> is set, the lexer returns a <code>CPP_EOF</code> token,
  126. which is normally used to indicate end-of-file, to indicate
  127. end-of-directive. In a directive a <code>CPP_EOF</code> token never means
  128. end-of-file. Conveniently, if the caller was <code>collect_args</code>, it
  129. already handles <code>CPP_EOF</code> as if it were end-of-file, and reports an
  130. error about an unterminated macro argument list.
  131. </p>
  132. <p>The C standard also specifies that a new line in the middle of the
  133. arguments to a macro is treated as whitespace. This white space is
  134. important in case the macro argument is stringized. The state variable
  135. <code>parsing_args</code> is nonzero when the preprocessor is collecting the
  136. arguments to a macro call. It is set to 1 when looking for the opening
  137. parenthesis to a function-like macro, and 2 when collecting the actual
  138. arguments up to the closing parenthesis, since these two cases need to
  139. be distinguished sometimes. One such time is here: the lexer sets the
  140. <code>PREV_WHITE</code> flag of a token if it meets a new line when
  141. <code>parsing_args</code> is set to 2. It doesn&rsquo;t set it if it meets a new
  142. line when <code>parsing_args</code> is 1, since then code like
  143. </p>
  144. <div class="smallexample">
  145. <pre class="smallexample">#define foo() bar
  146. foo
  147. baz
  148. </pre></div>
  149. <p>would be output with an erroneous space before &lsquo;<samp>baz</samp>&rsquo;:
  150. </p>
  151. <div class="smallexample">
  152. <pre class="smallexample">foo
  153. baz
  154. </pre></div>
  155. <p>This is a good example of the subtlety of getting token spacing correct
  156. in the preprocessor; there are plenty of tests in the testsuite for
  157. corner cases like this.
  158. </p>
  159. <p>The lexer is written to treat each of &lsquo;<samp>\r</samp>&rsquo;, &lsquo;<samp>\n</samp>&rsquo;, &lsquo;<samp>\r\n</samp>&rsquo;
  160. and &lsquo;<samp>\n\r</samp>&rsquo; as a single new line indicator. This allows it to
  161. transparently preprocess MS-DOS, Macintosh and Unix files without their
  162. needing to pass through a special filter beforehand.
  163. </p>
  164. <p>We also decided to treat a backslash, either &lsquo;<samp>\</samp>&rsquo; or the trigraph
  165. &lsquo;<samp>??/</samp>&rsquo;, separated from one of the above newline indicators by
  166. non-comment whitespace only, as intending to escape the newline. It
  167. tends to be a typing mistake, and cannot reasonably be mistaken for
  168. anything else in any of the C-family grammars. Since handling it this
  169. way is not strictly conforming to the ISO standard, the library issues a
  170. warning wherever it encounters it.
  171. </p>
  172. <p>Handling newlines like this is made simpler by doing it in one place
  173. only. The function <code>handle_newline</code> takes care of all newline
  174. characters, and <code>skip_escaped_newlines</code> takes care of arbitrarily
  175. long sequences of escaped newlines, deferring to <code>handle_newline</code>
  176. to handle the newlines themselves.
  177. </p>
  178. <p>The most painful aspect of lexing ISO-standard C and C++ is handling
  179. trigraphs and backlash-escaped newlines. Trigraphs are processed before
  180. any interpretation of the meaning of a character is made, and unfortunately
  181. there is a trigraph representation for a backslash, so it is possible for
  182. the trigraph &lsquo;<samp>??/</samp>&rsquo; to introduce an escaped newline.
  183. </p>
  184. <p>Escaped newlines are tedious because theoretically they can occur
  185. anywhere&mdash;between the &lsquo;<samp>+</samp>&rsquo; and &lsquo;<samp>=</samp>&rsquo; of the &lsquo;<samp>+=</samp>&rsquo; token,
  186. within the characters of an identifier, and even between the &lsquo;<samp>*</samp>&rsquo;
  187. and &lsquo;<samp>/</samp>&rsquo; that terminates a comment. Moreover, you cannot be sure
  188. there is just one&mdash;there might be an arbitrarily long sequence of them.
  189. </p>
  190. <p>So, for example, the routine that lexes a number, <code>parse_number</code>,
  191. cannot assume that it can scan forwards until the first non-number
  192. character and be done with it, because this could be the &lsquo;<samp>\</samp>&rsquo;
  193. introducing an escaped newline, or the &lsquo;<samp>?</samp>&rsquo; introducing the trigraph
  194. sequence that represents the &lsquo;<samp>\</samp>&rsquo; of an escaped newline. If it
  195. encounters a &lsquo;<samp>?</samp>&rsquo; or &lsquo;<samp>\</samp>&rsquo;, it calls <code>skip_escaped_newlines</code>
  196. to skip over any potential escaped newlines before checking whether the
  197. number has been finished.
  198. </p>
  199. <p>Similarly code in the main body of <code>_cpp_lex_direct</code> cannot simply
  200. check for a &lsquo;<samp>=</samp>&rsquo; after a &lsquo;<samp>+</samp>&rsquo; character to determine whether it
  201. has a &lsquo;<samp>+=</samp>&rsquo; token; it needs to be prepared for an escaped newline of
  202. some sort. Such cases use the function <code>get_effective_char</code>, which
  203. returns the first character after any intervening escaped newlines.
  204. </p>
  205. <p>The lexer needs to keep track of the correct column position, including
  206. counting tabs as specified by the <samp>-ftabstop=</samp> option. This
  207. should be done even within C-style comments; they can appear in the
  208. middle of a line, and we want to report diagnostics in the correct
  209. position for text appearing after the end of the comment.
  210. </p>
  211. <a name="Invalid-identifiers"></a><p>Some identifiers, such as <code>__VA_ARGS__</code> and poisoned identifiers,
  212. may be invalid and require a diagnostic. However, if they appear in a
  213. macro expansion we don&rsquo;t want to complain with each use of the macro.
  214. It is therefore best to catch them during the lexing stage, in
  215. <code>parse_identifier</code>. In both cases, whether a diagnostic is needed
  216. or not is dependent upon the lexer&rsquo;s state. For example, we don&rsquo;t want
  217. to issue a diagnostic for re-poisoning a poisoned identifier, or for
  218. using <code>__VA_ARGS__</code> in the expansion of a variable-argument macro.
  219. Therefore <code>parse_identifier</code> makes use of state flags to determine
  220. whether a diagnostic is appropriate. Since we change state on a
  221. per-token basis, and don&rsquo;t lex whole lines at a time, this is not a
  222. problem.
  223. </p>
  224. <p>Another place where state flags are used to change behavior is whilst
  225. lexing header names. Normally, a &lsquo;<samp>&lt;</samp>&rsquo; would be lexed as a single
  226. token. After a <code>#include</code> directive, though, it should be lexed as
  227. a single token as far as the nearest &lsquo;<samp>&gt;</samp>&rsquo; character. Note that we
  228. don&rsquo;t allow the terminators of header names to be escaped; the first
  229. &lsquo;<samp>&quot;</samp>&rsquo; or &lsquo;<samp>&gt;</samp>&rsquo; terminates the header name.
  230. </p>
  231. <p>Interpretation of some character sequences depends upon whether we are
  232. lexing C, C++ or Objective-C, and on the revision of the standard in
  233. force. For example, &lsquo;<samp>::</samp>&rsquo; is a single token in C++, but in C it is
  234. two separate &lsquo;<samp>:</samp>&rsquo; tokens and almost certainly a syntax error. Such
  235. cases are handled by <code>_cpp_lex_direct</code> based upon command-line
  236. flags stored in the <code>cpp_options</code> structure.
  237. </p>
  238. <p>Once a token has been lexed, it leads an independent existence. The
  239. spelling of numbers, identifiers and strings is copied to permanent
  240. storage from the original input buffer, so a token remains valid and
  241. correct even if its source buffer is freed with <code>_cpp_pop_buffer</code>.
  242. The storage holding the spellings of such tokens remains until the
  243. client program calls cpp_destroy, probably at the end of the translation
  244. unit.
  245. </p>
  246. <a name="Lexing-a-line"></a><a name="Lexing-a-line-1"></a>
  247. <h3 class="section">Lexing a line</h3>
  248. <a name="index-token-run"></a>
  249. <p>When the preprocessor was changed to return pointers to tokens, one
  250. feature I wanted was some sort of guarantee regarding how long a
  251. returned pointer remains valid. This is important to the stand-alone
  252. preprocessor, the future direction of the C family front ends, and even
  253. to cpplib itself internally.
  254. </p>
  255. <p>Occasionally the preprocessor wants to be able to peek ahead in the
  256. token stream. For example, after the name of a function-like macro, it
  257. wants to check the next token to see if it is an opening parenthesis.
  258. Another example is that, after reading the first few tokens of a
  259. <code>#pragma</code> directive and not recognizing it as a registered pragma,
  260. it wants to backtrack and allow the user-defined handler for unknown
  261. pragmas to access the full <code>#pragma</code> token stream. The stand-alone
  262. preprocessor wants to be able to test the current token with the
  263. previous one to see if a space needs to be inserted to preserve their
  264. separate tokenization upon re-lexing (paste avoidance), so it needs to
  265. be sure the pointer to the previous token is still valid. The
  266. recursive-descent C++ parser wants to be able to perform tentative
  267. parsing arbitrarily far ahead in the token stream, and then to be able
  268. to jump back to a prior position in that stream if necessary.
  269. </p>
  270. <p>The rule I chose, which is fairly natural, is to arrange that the
  271. preprocessor lex all tokens on a line consecutively into a token buffer,
  272. which I call a <em>token run</em>, and when meeting an unescaped new line
  273. (newlines within comments do not count either), to start lexing back at
  274. the beginning of the run. Note that we do <em>not</em> lex a line of
  275. tokens at once; if we did that <code>parse_identifier</code> would not have
  276. state flags available to warn about invalid identifiers (see <a href="#Invalid-identifiers">Invalid identifiers</a>).
  277. </p>
  278. <p>In other words, accessing tokens that appeared earlier in the current
  279. line is valid, but since each logical line overwrites the tokens of the
  280. previous line, tokens from prior lines are unavailable. In particular,
  281. since a directive only occupies a single logical line, this means that
  282. the directive handlers like the <code>#pragma</code> handler can jump around
  283. in the directive&rsquo;s tokens if necessary.
  284. </p>
  285. <p>Two issues remain: what about tokens that arise from macro expansions,
  286. and what happens when we have a long line that overflows the token run?
  287. </p>
  288. <p>Since we promise clients that we preserve the validity of pointers that
  289. we have already returned for tokens that appeared earlier in the line,
  290. we cannot reallocate the run. Instead, on overflow it is expanded by
  291. chaining a new token run on to the end of the existing one.
  292. </p>
  293. <p>The tokens forming a macro&rsquo;s replacement list are collected by the
  294. <code>#define</code> handler, and placed in storage that is only freed by
  295. <code>cpp_destroy</code>. So if a macro is expanded in the line of tokens,
  296. the pointers to the tokens of its expansion that are returned will always
  297. remain valid. However, macros are a little trickier than that, since
  298. they give rise to three sources of fresh tokens. They are the built-in
  299. macros like <code>__LINE__</code>, and the &lsquo;<samp>#</samp>&rsquo; and &lsquo;<samp>##</samp>&rsquo; operators
  300. for stringizing and token pasting. I handled this by allocating
  301. space for these tokens from the lexer&rsquo;s token run chain. This means
  302. they automatically receive the same lifetime guarantees as lexed tokens,
  303. and we don&rsquo;t need to concern ourselves with freeing them.
  304. </p>
  305. <p>Lexing into a line of tokens solves some of the token memory management
  306. issues, but not all. The opening parenthesis after a function-like
  307. macro name might lie on a different line, and the front ends definitely
  308. want the ability to look ahead past the end of the current line. So
  309. cpplib only moves back to the start of the token run at the end of a
  310. line if the variable <code>keep_tokens</code> is zero. Line-buffering is
  311. quite natural for the preprocessor, and as a result the only time cpplib
  312. needs to increment this variable is whilst looking for the opening
  313. parenthesis to, and reading the arguments of, a function-like macro. In
  314. the near future cpplib will export an interface to increment and
  315. decrement this variable, so that clients can share full control over the
  316. lifetime of token pointers too.
  317. </p>
  318. <p>The routine <code>_cpp_lex_token</code> handles moving to new token runs,
  319. calling <code>_cpp_lex_direct</code> to lex new tokens, or returning
  320. previously-lexed tokens if we stepped back in the token stream. It also
  321. checks each token for the <code>BOL</code> flag, which might indicate a
  322. directive that needs to be handled, or require a start-of-line call-back
  323. to be made. <code>_cpp_lex_token</code> also handles skipping over tokens in
  324. failed conditional blocks, and invalidates the control macro of the
  325. multiple-include optimization if a token was successfully lexed outside
  326. a directive. In other words, its callers do not need to concern
  327. themselves with such issues.
  328. </p>
  329. <hr>
  330. <div class="header">
  331. <p>
  332. Next: <a href="Hash-Nodes.html#Hash-Nodes" accesskey="n" rel="next">Hash Nodes</a>, Previous: <a href="Conventions.html#Conventions" accesskey="p" rel="prev">Conventions</a>, Up: <a href="index.html#Top" accesskey="u" rel="up">Top</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Concept-Index.html#Concept-Index" title="Index" rel="index">Index</a>]</p>
  333. </div>
  334. </body>
  335. </html>