xml
The file builtins\xml.e (not an autoinclude) allows conversion of xml (text) <--> DOM (nested structure).
Deliberately kept as simple as possible, to simplify modification. (I fully expect problems the first time this is used in anger!)
Does not use/validate against DTDs (Document Type Definitions) or XSLs (eXtensible Stylesheet Language).
Comments are only supported after the XMLdeclaration (if present) and either before or after the top-level element, not within it.
Unicode handling: via utf-8. I have tested this on some fairly outlandish-looking samples without any problems.
However it contains very little code to actually deal with unicode, instead relying on utf8 to not embed any critical control characters (such as '<') within any multibyte encodings (and even wrote a quick test ditty).
Should you need to process utf-16 (or utf-32) then it must be converted to utf-8 beforehand, and possibly the output back.
One thing it does actually do is skip a utf-8 BOM at the start of the xml input (string/file), however there is nothing in here to help with writing one back, not that prefixing one on the output [externally] should in any way prove difficult.
Note: json is widely considered a better choice for data transfer.
It is of course more efficient, but also less descriptive and does not support comments or any form of self-validation, and may prove more brittle, unless the provider has the common sense to include a field that adequately specifies the precise version/format being sent (but in my experience they rarely do). The bottom line is you should use xml in cases where you really benefit from it, which is not everywhere, eg: use xml for config-type-data, but json for bulk data.
Technically these routines are fully supported by pwa/p2js, however that may be of little practical concern since the likes of libcurl and sockets are not.
Note the three uses of XML_CONTENTS. The first is the one and only top-level element, the second is a sequence of elements,
which happens to be one long, and the third is a string of the "string, or sequence of nested tags" fame.
The difference between the first two of those cannot be stressed enough: top-level has precisely one '{' before it, whereas
any and all more deeply nested elements always have two, ie "{{", except of course like in the third use above, where it is
actually just the lowest-level string contents, rather than a further nested element.
Obviously in the above XML_CONTENTS[XML_TAGNAME] means that XML_CONTENTS is a sequence of length 3 starting at that point, and XML_TAGNAME is the first element of that.
Everything apart from xml_parse() and xml_sprint() are all pretty trivial and could easily be accomplished directly.
Note that none of these routines have yet undergone any significant real-world testing, but should be easy to fix/enhance as needed.
strict_html_parse() has found use in pwa/p2js on machine-generated html,
and quite some time ago during the (partial/incomplete) docs->pmwiki effort,
both on input previously thoroughly verified both by the Edita/Edit Re-Indent tool and makephix.exw,
both of which would often complain bitterly over (eg) unbalanced tags.
Be advised that the error handling (of strict_html_parse) is not exactly slick, and attempting to use this on any old/unsanitised/unbalanced crud from some random website is quite unlikely to go as smoothly as you might hope.
The biggest challenge in writing a non-strict "html_parse()" is how to close unbalanced tags: at some point you have to pretend an open tag was in fact self-closing, and reposition any would-be children into siblings, and cope with things like "<b><i>hey</b></i>". I imagine that most browsers just use start/end indexes, rather than nested/structured trees like I’m trying to do here.
Finally, the obligatory quote:
Deliberately kept as simple as possible, to simplify modification. (I fully expect problems the first time this is used in anger!)
Does not use/validate against DTDs (Document Type Definitions) or XSLs (eXtensible Stylesheet Language).
Comments are only supported after the XMLdeclaration (if present) and either before or after the top-level element, not within it.
Unicode handling: via utf-8. I have tested this on some fairly outlandish-looking samples without any problems.
However it contains very little code to actually deal with unicode, instead relying on utf8 to not embed any critical control characters (such as '<') within any multibyte encodings (and even wrote a quick test ditty).
Should you need to process utf-16 (or utf-32) then it must be converted to utf-8 beforehand, and possibly the output back.
One thing it does actually do is skip a utf-8 BOM at the start of the xml input (string/file), however there is nothing in here to help with writing one back, not that prefixing one on the output [externally] should in any way prove difficult.
Note: json is widely considered a better choice for data transfer.
It is of course more efficient, but also less descriptive and does not support comments or any form of self-validation, and may prove more brittle, unless the provider has the common sense to include a field that adequately specifies the precise version/format being sent (but in my experience they rarely do). The bottom line is you should use xml in cases where you really benefit from it, which is not everywhere, eg: use xml for config-type-data, but json for bulk data.
Technically these routines are fully supported by pwa/p2js, however that may be of little practical concern since the likes of libcurl and sockets are not.
Example:
include xml.e
constant eg1 = """
<?xml version="1.0" ?>
<root>
<element>Some text here</element>
</root>
"""
pp(xml_parse(eg1),{pp_Nest,5,pp_Pause,0})
-- output:
-- {"document", -- XML_DOCUMENT
-- {`<?xml version="1.0" ?>`}, -- XML_PROLOGUE
-- {"root", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- {{"element", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- "Some text here"}}}, -- XML_CONTENTS
-- {}} -- XML_EPILOGUE
Obviously in the above XML_CONTENTS[XML_TAGNAME] means that XML_CONTENTS is a sequence of length 3 starting at that point, and XML_TAGNAME is the first element of that.
Another:
include xml.e
constant eg2 = """
<Address>
<Number Flat="b">2</Number>
<Street>Erdzinderand Beat</Street>
<District>Stooingder</District>
<City>Bush</City>
</Address>
"""
pp(xml_parse(eg2),{pp_Nest,5,pp_Pause,0})
-- output:
-- {"document", -- XML_DOCUMENT
-- {}, -- XML_PROLOGUE
-- {"Address", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- {{"Number", -- XML_CONTENTS[XML_TAGNAME]
-- {{"Flat"}, -- XML_ATTRIBUTES[XML_ATTRNAMES]
-- {"b"}}, -- XML_ATTRVALUES
-- "2"}, -- XML_CONTENTS
-- {"Street", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- "Erdzinderand Beat"}, -- XML_CONTENTS
-- {"District", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- "Stooingder"}, -- XML_CONTENTS
-- {"City", -- XML_CONTENTS[XML_TAGNAME]
-- {}, -- XML_ATTRIBUTES
-- "Bush"}}}, -- XML_CONTENTS
-- {}} -- XML_EPILOGUE
constants
| Note the precise content of the resulting xml structure is not documented beyond these constants; the programmer is expected to examine the ouput from increasingly more complex, but still valid xml, until they understand the structure and how to use the XML_XXX constants, all quite straightforward really, once you get used to it. | |
|
|
At this point in time the structure is quite likely to change with each new release as more fuctionality is added, and
of course more contants and routines are also quite likely to be added with each new release. |
|
| -- must be "document" |
| XML_PROLOGUE, | -- {} or eg {doctype,comments} |
| XML_CONTENTS, | -- (must be a single element) |
| XML_EPILOGUE, | -- {} or {final_comments} |
| XML_DOCLEN |
= $ -- 4 |
| global enum XML_TAGNAME, | -- eg "Students" |
| XML_ATTRIBUTES, | -- {XML_ATTRNAMES,XML_ATTRVALUES}, or {} |
| -- XML_CONTENTS, | -- (string, or sequence of nested tags) |
| XML_ELEMLEN |
-- 3 |
| global enum XML_ATTRNAMES, | -- eg {"Name","Gender",...} |
| XML_ATTRVALUES |
-- eg {"Alan","M",...} |
| global constant XML_DECODE, | = #0001, -- convert eg > to '>' in attribute values |
| XML_ENCODE | = #0002 -- reverse "" (in xml_sprint) |
routines
| string s = | xml_decode(string s) -- convert all eg < to '<', but leaving any CDATA as-is. |
| string s = |
|
|
|
|
| string res = |
|
| sequence res = |
|
|
|
|
| string res = | xml_get_attribute(sequence elem, string name, dlft="") -- returns attribute value or dflt if it does not exist |
| sequence elem = | xml_set_attribute(sequence elem, string attrib_name, attrib_value) -- set an attribute, or remove it if attrib_value is "". |
| sequence res = |
|
|
|
|
Note that none of these routines have yet undergone any significant real-world testing, but should be easy to fix/enhance as needed.
html (experimental)
|
For html handling, a few additional constants have been defined:
| |
|
| -- 1, eg "html", "body", "div", etc. |
| HTML_ATTRIBS, | -- 2, can be accessed using XML_ATTRNAMES and XML_ATTRVALUES |
| HTML_CONTENTS | -- 3, can be plaintext, or nested elements |
|
| |
| global constant HTML_INPUT, | = #0004, -- input is html, provided automatically by strict_html_parse() |
| CRASHFATAL | = #1000 -- can help or hinder debugging |
|
|
|
strict_html_parse() has found use in pwa/p2js on machine-generated html,
and quite some time ago during the (partial/incomplete) docs->pmwiki effort,
both on input previously thoroughly verified both by the Edita/Edit Re-Indent tool and makephix.exw,
both of which would often complain bitterly over (eg) unbalanced tags.
Be advised that the error handling (of strict_html_parse) is not exactly slick, and attempting to use this on any old/unsanitised/unbalanced crud from some random website is quite unlikely to go as smoothly as you might hope.
The biggest challenge in writing a non-strict "html_parse()" is how to close unbalanced tags: at some point you have to pretend an open tag was in fact self-closing, and reposition any would-be children into siblings, and cope with things like "<b><i>hey</b></i>". I imagine that most browsers just use start/end indexes, rather than nested/structured trees like I’m trying to do here.
Finally, the obligatory quote:
“XML is crap. Really. There are no excuses.
“XML is nasty to parse for humans, and it’s a disaster to parse even for computers.
There’s just no reason for that horrible crap to exist.”
- Linus Torvalds