Expand/Shrink

parse tree

This documents the internal structure of the parse tree used by pwa/p2js. It is of no help whatsoever to anyone actually using pwa/p2js, only those seeking to improve, extend, or fix some bug in it. The structure is by no means set in stone and may change significantly between releases.

Aside: This is the "abstract syntax tree", ie the definition of it. The software uses a variable named ast, but that’s a misnomer - it actually holds a concrete parse tree. Apart from that one rogue identifier, I tend to prefer the term parse tree, as per the button, and apart from the slightly subtle distinction between abstract and concrete you can assume I mean the same thing.

The top-level of a parse tree is always {"program",children} where children is a list of nested nodes as described below. It is generated from the tokens by parse() in pwa\src\p2js_parse.e and must be handled by both [1] treeify() in p2js.exw which converts it into a form suitable for IupTreeView() and IupTreeAddNodes(), as well as [2] generate_source() in pwa\src\p2js_emit.e which fairly obviously creates the transpiled output.
It is quite unlikely the parse tree will be modified to cater for [1], and the opposite for [2].
When p2js_parse was written, a fair bit of that involved going back to p2js_tok and fixing that, and the emphasis was very much on getting through a successful parse, and not caring too much about whatever gibberish was being created in the stead of a decent parse tree. When p2js_emit was written, very nearly half the work was going back and fixing p2js_parse, and that will quite probably remain true for all future work on p2js_emit.

The following table lists the permitted elements of children. Numerical node types can be looked up in pwa\src\p2js_basics.e or pwa\src\p2js_keywords.e or programatically converted to human readable form by tok_name() which is defined in p2js_basics.e (not that you should need to know that). The numerical values are subject to change, so much so in fact that p2js.exw will (prompt and) overwrite p2js_keywords.e before automatically restarting itself, should any keywords be added, edited, or removed.

Obviously pwa/p2js honours any parenthesis when parsing expressions, but does not actually keep them.
Instead, it reconstitutes any required/desired parenthesis as needed, with the aim of preventing subtle differences in the operator precedences between Phix and JavaScript from triggering any misbehaviour.
In other words, the parenthesis needed for Phix is not necessarily the same as that needed for JavaScript, so always completely discard and recreate it. As a bonus, in if condition then <===> if (condition) { we can omit the enclosing () in Phix using an outer precedence of 0, and force them in JavaScript by using an outer precedence of 12. In practice, we only ever omit unnecessary "common sense" parenthesis on +-/* and put them on any other nested operator, making subtle differences in (other) operator precedences totally moot.

One thing I am looking out for in particular is anywhere I can squidge in [,comment] and not spanner everything...

Expressions

Node type Contents
DIGIT (3), LETTER (4), '"' (34), '#' (35) '$' (36), `'` (39), '`' (96), '~' (126), The node is a token, eg {3,407,408,10,16} might be "21" on line 10 column 17.
(~ is a shorthand/prefix operator for length(), and obviously a "PROC"/T_length node is created instead when the longhand is used)
'{' (123) {'{',children}. A sequence (constructor), children will be {} to represent {}. See also MASS below, which is also/actually handled by expr() in p2js_emit.e even though it is technically not an expression.
??? TBC...

Statements/Declarations

Node type Contents
COMMENT (5), BLK_CMT (6) The node is a token. Note that consecutive comments are herded into a single "comments" node by treeify() for display in the Parse Tree window, but that is not the case for the actual parse tree, as generated by parse() and passed to generate_source().
T_exit, T_return, T_break, T_fallthrough {T_xxx,Line_number}. It makes sense to just use a token, so that may happen, for one or all.
"MASS" {"MASS",{MASS (201),children},expression}. A multiple assignment statement such as {a,b,c} = d.
Note that eg {string s, integer i} triggers a "let" prefix.
T_switch {T_switch,{expr,{T_case,block},{T_default|T_else,block}}}. A single expression followed by pairs of {T_case,block} and at most one default/else.
T_for {T_for,{ctrl,bPreDef,lim,step,{T_block,block}}} where ctrl is {"vardef",{{LETTER,...T_for},{LETTER,...<ctrlvar>},expr}},
bPreDef is a plain bool indicating whether it is predefined, so we can emit "for(let i=..." or "for (i=..." accordingly, lim is a normal expression, step can/will be {} if omitted, and the standard T_block pairing allows >1 statement in the loop body(block). Note that negative steps require an explicit '-' to determine whether <= or >= should be used, in contrast to desktop/Phix which emits generic code to determine that at runtime, that is when presented with a runtime computed step, and obviously should p be -2 then by -p will go horribly wrong - see the "for" examples in mappings.

Alternatively, in the case of "for [i,]e in expr do":
{T_for,{ctrl,{T_block,block}}} where ctrl is {"vardef",{{LETTER,...T_in},i,e,predefined[,from][,to]},expr}, where
e is {LETTER,...<e>}, i is similar or 0, and predefined is 0..3, ie e:0b01 + i:0b10, and of course 0b10 can only get set in the i,e form.
T_include {T_include,{string file, integer srcdx[, integer line]}} where if files is "" it is a restore point and line is not present.
An include statement will have everything present, and the srcdx will be shown in the parse tree so that subsequent restore points can be made sense of. The sequence of tokens emitted by p2js_tok.e all pertain to the same source file, however those in the parse tree may refer to different files, and to handle that dummy T_include nodes are placed at the top level in the parse tree as necessary, recognisable by having the empty string instead of a file, along with the needed index. The restore operation is very cheap, so there is no point optimising away unnecessary back-to-back restore points.
T_enum {T_enum,{COMMENT|BLK_CMT|tok|{'=',{tok,'$'|expr}}}} where tok is the identifier, '$' is a token and expr is expected to be a DIGIT token - some further tightening of what parse() leaves for emit() to deal with may be in order. Since JavaScript has no enum type, they are converted to explicit const statements, and no attempt is made to covert such back into enums. There is (as yet?) no attempt to deal with "by 2", etc.
T_try {T_try,{T_block,try_body,{T_catch,ename},{T_block,catch_clause}} Note that I have made no attempt to emit catch (let e) instead of catch (e), but it appears to work. [DEV check this]
??? TBC...

Note also that the list of reserved words must be extended to cover both Phix and JavaScript, see here.