tokens
This documents the internal structure of the tokens used by pwa/p2js. It is of no help whatsoever
to anyone actually using pwa/p2js, only those seeking to improve, extend, or fix some bug in it. The
structure of an individual token is quite unlikely to change significantly ever again, however at
the moment p2js_tok creates a complete table of them for the entire source code, but it is reasonably
likely that might change to a one-at-a-time, on demand mechanism, should that offer any performance
or memory savings, in which case the next seven constants will probably be replaced by equivalent but
lower-case seven global variables, and obviously far less subscripting.
All token fields are integer-only and must be handled both by p2js.exw:syntax_colour() and pwa\src\p2js_parse.e:parse().
Each token can/should be accessed using the following constants (as defined in pwa\src\p2js_basics.e)
Just to keep you on your toes, TOKLINE is 1-based but TOKCOL is 0-based, simply because the main usage of TOKCOL is prefix with repeat(' ',tok[TOKCOL]) when reporting an error, and obviously we want 0 spaces when pointing at column 1.
To clarify the examples below, imagine the following is line 10 in the source code, starting on byte 401:
As you can see, tokens really are simplicity personified, but it won’t necessarily feel like that when you’re staring at a wall of numbers.
All token fields are integer-only and must be handled both by p2js.exw:syntax_colour() and pwa\src\p2js_parse.e:parse().
Each token can/should be accessed using the following constants (as defined in pwa\src\p2js_basics.e)
global enum TOKTYPE, TOKSTART, TOKFINISH, TOKLINE, TOKCOL, TOKTTIDX, TOKENDLINE=$ -- (one token) -- TOKTYPE as per table below, use tok_name() to get a human-readable string -- TOKTTIDX is only set on LETTER tokens, "", and can be compared to T_integer, etc. -- TOKENDLINE only on ’`’ (aka `"""`) and BLK_CMT (no other tokens span lines) -- TOKALTYPE = TOKCOL is only/also used on parse tree LETTER tokens.The source code is retrieved as a single string with embedded '\n'. We identify, for instance, a block comment using {start,finish} as opposed to {start_line,start_col,finish_line,finish_col}. Since we find line and col independently useful, we keep those as well, so there is no real saving of the suggested 2 vs. 4 variety, there is however a gain in a single read and not splitting. Most tokens have 5 elements, except LETTER, BLK_CMT, and '`' (used for both backtick and triple quote strings), all three of which have 6 elements.
Just to keep you on your toes, TOKLINE is 1-based but TOKCOL is 0-based, simply because the main usage of TOKCOL is prefix with repeat(' ',tok[TOKCOL]) when reporting an error, and obviously we want 0 spaces when pointing at column 1.
To clarify the examples below, imagine the following is line 10 in the source code, starting on byte 401:
integer count = 21; count += 123; ?{count,"abc"} 12345678901234567890123456789012345678901234567890 10 20 30 40 50
TOKTYPE | Meaning |
---|---|
DIGIT (3), LETTER (4), COMMENT (5), BLK_CMT (6) |
For example, 123 on line 10 column 30 might be {3,430,432,10,29}, where src[430] is '1' and src[432] is '3', and a known identifier such as integer might be {4,401,407,10,0,T_integer}. |
33..126 |
Single character operators, one of `!"#$%&'()*+,-./:;<=>?[\]{|}~` or '`'. In the case of quotes and #, the token is of course the whole literal string or hex constant, eg '=' might be {61,415,415,10,14} and "abc" might be {34,443,437,10,42}. |
odd numbers 129..201 |
Multiple character operators, eg `+=` might be {PLUSEQ(=131),427,428,10,26}. Note the definition of PLUSEQ in p2js_basics.e is mixed in with values for node not token use. For instance a token of '{' may lead to a node with that on the right side of a statement, but a node with MASS(=201) when on the left, ie a multiple assignment statement. |
As you can see, tokens really are simplicity personified, but it won’t necessarily feel like that when you’re staring at a wall of numbers.