"Markup" in XML and HTML refers to the use of "tags" built with less than and greater than signs, also called angle brackets, in the document to add structure and meaning to the text. With release 8.0, CMarkup has been changed slightly to support other forms of markup than just well-formed XML. This allows CMarkup to navigate generic markup documents such as:
HTML was always used very loosely and browsers are built to deal with inconsistencies in the way HTML markup is used in web pages. For example, an HTML page is expected to begin with an <HTML> start tag and end with an </HTML> end tag, but browsers do not complain if either tag is not there. More significantly, if a TABLE element is not ended the browser assumes that it was meant to end at the end of the document, but if a table row TR element is not ended the browser assumes that it was meant to end at the next row or at the end of the table, not the end of the document.
XML, on the other hand, is intended to strictly adhere to rules about how the tags are used in order to eliminate potential differences in interpretation. After populating a CMarkup object with Load(filename)
or SetDoc(docstring)
, the IsWellFormed
method tells whether those tags are arranged correctly. An HTML document will generally return false
unless it is nested properly and has no non-ended tags (such as XHTML which is an XML format designed to work in HTML viewers). Whether or not it is well-formed, the tags of the markup document can be navigated with the same methods used to navigate XML.
While existing support for XML has not been compromised, the CMarkup parser (see Inside the CMarkup Parser) has been changed to forge on even after discovering ill-formed XML, only recording the first error encountered. And even though the CMarkup object maintains a Containment Hierarchy of the elements in the ill-formed document, the SetDoc and IsWellFormed methods return false
.