Expand/Shrink

xml

The file builtins\xml.e (not an autoinclude) allows conversion of xml (text) <--> DOM (nested structure).

Deliberately kept as simple as possible, to simplify modification. (I fully expect problems the first time this is used in anger!)

Does not use/validate against DTDs (Document Type Definitions) or XSLs (eXtensible Stylesheet Language).

Comments are only supported after the XMLdeclaration (if present) and either before or after the top-level element, not within it.

Unicode handling: via utf-8. I have tested this on some fairly outlandish-looking samples without any problems.
However it contains very little code to actually deal with unicode, instead relying on utf8 to not embed any critical control characters (such as '<') within any multibyte encodings (and even wrote a quick test ditty).

Should you need to process utf-16 (or utf-32) then it must be converted to utf-8 beforehand, and possibly the output back.
One thing it does actually do is skip a utf-8 BOM at the start of the xml input (string/file), however there is nothing in here to help with writing one back, not that prefixing one on the output [externally] should in any way prove difficult.

Note: json is widely considered a better choice for data transfer.
It is of course more efficient, but also less descriptive and does not support comments or any form of self-validation, and may prove more brittle, unless the provider has the common sense to include a field that adequately specifies the precise version/format being sent (but in my experience they rarely do). The bottom line is you should use xml in cases where you really benefit from it, which is not everywhere, eg: use xml for config-type-data, but json for bulk data.

Technically these routines are fully supported by pwa/p2js, however that may be of little practical concern since the likes of libcurl and sockets are not.

Example:

include xml.e
constant eg1 = """
<?xml version="1.0" ?>
<root>
  <element>Some text here</element>
</root>
"""
pp(xml_parse(eg1),{pp_Nest,5,pp_Pause,0})
-- output:
--          {"document",                    -- XML_DOCUMENT
--           {`<?xml version="1.0" ?>`},    -- XML_PROLOGUE
--           {"root",                       -- XML_CONTENTS[XML_TAGNAME]
--            {},                           --  XML_ATTRIBUTES
--            {{"element",                  --  XML_CONTENTS[XML_TAGNAME]
--              {},                         --   XML_ATTRIBUTES
--              "Some text here"}}},        --   XML_CONTENTS
--           {}}                            -- XML_EPILOGUE
Note the three uses of XML_CONTENTS. The first is the one and only top-level element, the second is a sequence of elements, which happens to be one long, and the third is a string of the "string, or sequence of nested tags" fame. The difference between the first two of those cannot be stressed enough: top-level has precisely one '{' before it, whereas any and all more deeply nested elements always have two, ie "{{", except of course like in the third use above, where it is actually just the lowest-level string contents, rather than a further nested element.
Obviously in the above XML_CONTENTS[XML_TAGNAME] means that XML_CONTENTS is a sequence of length 3 starting at that point, and XML_TAGNAME is the first element of that.

Another:

include xml.e
constant eg2 = """
<Address>
  <Number Flat="b">2</Number>
  <Street>Erdzinderand Beat</Street>
  <District>Stooingder</District>
  <City>Bush</City>
</Address>
"""
pp(xml_parse(eg2),{pp_Nest,5,pp_Pause,0})
-- output:
--          {"document",                    -- XML_DOCUMENT
--           {},                            -- XML_PROLOGUE
--           {"Address",                    -- XML_CONTENTS[XML_TAGNAME]
--            {},                           --  XML_ATTRIBUTES
--            {{"Number",                   --  XML_CONTENTS[XML_TAGNAME]
--              {{"Flat"},                  --   XML_ATTRIBUTES[XML_ATTRNAMES]
--               {"b"}},                    --    XML_ATTRVALUES
--              "2"},                       --   XML_CONTENTS
--             {"Street",                   --  XML_CONTENTS[XML_TAGNAME]
--              {},                         --   XML_ATTRIBUTES
--              "Erdzinderand Beat"},       --   XML_CONTENTS
--             {"District",                 --  XML_CONTENTS[XML_TAGNAME]
--              {},                         --   XML_ATTRIBUTES
--              "Stooingder"},              --   XML_CONTENTS
--             {"City",                     --  XML_CONTENTS[XML_TAGNAME]
--              {},                         --   XML_ATTRIBUTES
--              "Bush"}}},                  --   XML_CONTENTS
--           {}}                            -- XML_EPILOGUE


constants

Note the precise content of the resulting xml structure is not documented beyond these constants; the programmer is expected to examine the ouput from increasingly more complex, but still valid xml, until they understand the structure and how to use the XML_XXX constants, all quite straightforward really, once you get used to it.
The examples above should get you started.  At this point in time the structure is quite likely to change with each new release as more fuctionality is added, and of course more contants and routines are also quite likely to be added with each new release.

global enum XML_DOCUMENT,  -- must be "document"
XML_PROLOGUE,  -- {} or eg {doctype,comments}
XML_CONTENTS,  -- (must be a single element)
XML_EPILOGUE,  -- {} or {final_comments}
XML_DOCLEN  = $ -- 4

global enum XML_TAGNAME,  -- eg "Students"
XML_ATTRIBUTES,  -- {XML_ATTRNAMES,XML_ATTRVALUES}, or {}
-- XML_CONTENTS,  -- (string, or sequence of nested tags)
XML_ELEMLEN  -- 3

global enum XML_ATTRNAMES,  -- eg {"Name","Gender",...}
XML_ATTRVALUES  -- eg {"Alan","M",...}

global constant XML_DECODE,  = #0001, -- convert eg &gt; to '>' in attribute values
XML_ENCODE  = #0002  -- reverse "" (in xml_sprint)

routines

string s =  xml_decode(string s) -- convert all eg &lt; to '<', but leaving any CDATA as-is.
string s = 
xml_encode(string s) -- Inverse of xml_decode
No re-coding of anything except the five critical entities (<>&'").
No CDATA handling, obviously there is no attempt to preserve CDATA on a round trip.
(The above two routines are really internal that are sometimes useful directly.)
sequence res = 
xml_parse(string xml, integer options=NULL) -- Convert an xml string into a nested structure.
options may be XML_DECODE
Returns {-1,"message",...} if xml could not be parsed.
Success can be determined by checking whether result[1] is a string, or -1, or better yet =="document".
string res = 
xml_sprint(sequence xml, integer options=NULL) -- convert xml structure to a string. options may be XML_ENCODE
sequence res = 
xml_new_doc(sequence contents={}, prolog=std_prolog, epilog={}) -- create a new xml structure.
note: the default contents is not legal until res[XML_CONTENTS] gets an xml_new_element().
sequence elem = 
xml_new_element(string tagname, sequence contents) -- returns {tagname,{},contents}, where {} represents an empty set of attributes
contents should be a string or a sequence of nested elements
string res =  xml_get_attribute(sequence elem, string name, dlft="") -- returns attribute value or dflt if it does not exist
sequence elem =  xml_set_attribute(sequence elem, string attrib_name, attrib_value) -- set an attribute, or remove it if attrib_value is "".
sequence res = 
xml_get_nodes(sequence xml, string tagname) - return a sequence of all nodes matching tagname
xml can be an entire document or an individual element (but not a sequence of elements)
sequence xml = 
xml_add_comment(sequence xml, string comment, bool as_prolog=true) -- add a comment to the prolog or epilog
note that comments on individual elements are not supported, xml must be the entire top-level document.
Everything apart from xml_parse() and xml_sprint() are all pretty trivial and could easily be accomplished directly.

Note that none of these routines have yet undergone any significant real-world testing, but should be easy to fix/enhance as needed.

html (experimental)

For html handling, a few additional constants have been defined:

global enum HTML_TAGNAME,  -- 1, eg "html", "body", "div", etc.
HTML_ATTRIBS,  -- 2, can be accessed using XML_ATTRNAMES and XML_ATTRVALUES
HTML_CONTENTS   -- 3, can be plaintext, or nested elements

global constant HTML_INPUT,  = #0004, -- input is html, provided automatically by strict_html_parse()
CRASHFATAL  = #1000  -- can help or hinder debugging


sequence res = 
strict_html_parse(string html, integer options=NULL) -- parse perfectly balanced html
options can be CRASHFATAL, which can sometimes help and sometimes hinder development.


strict_html_parse() has found use in pwa/p2js on machine-generated html,
and quite some time ago during the (partial/incomplete) docs->pmwiki effort,
both on input previously thoroughly verified both by the Edita/Edit Re-Indent tool and makephix.exw,
both of which would often complain bitterly over (eg) unbalanced tags.

Be advised that the error handling (of strict_html_parse) is not exactly slick, and attempting to use this on any old/unsanitised/unbalanced crud from some random website is quite unlikely to go as smoothly as you might hope.

The biggest challenge in writing a non-strict "html_parse()" is how to close unbalanced tags: at some point you have to pretend an open tag was in fact self-closing, and reposition any would-be children into siblings, and cope with things like "<b><i>hey</b></i>". I imagine that most browsers just use start/end indexes, rather than nested/structured trees like I’m trying to do here.

Finally, the obligatory quote:

    “XML is crap. Really. There are no excuses.
    “XML is nasty to parse for humans, and it’s a disaster to parse even for computers.
     There’s just no reason for that horrible crap to exist.”
                     - Linus Torvalds