Rants: Mark up that text! -- Web Writing that Works

Markup is at least two millennia old. Text itself started out as a set of marks on a surface, so as the surfaces and tools changed, people naturally added extra marks that told the reader something useful about the text.

A look back

This meta-information, at first, was fairly crude. For instance, in Roman days, most scrolling text was an unending stream of characters, like this:

armavirumquecanoquiprimiseabtroia
Isingofarmsandthemanwhofirstfromthe
shoresofTroy

Then someone invented white space, marking the division between words with an empty slot.

I sing of arms and the man who first from the shores of Troy

Someone else thought of marking the beginning and end of sentences. Soon, capital letters and periods set off sentences.

Easy is the path leading down to hell, but long and hard the climb out.

Then came a triumph of markup--paragraphing.

Easy is the path leading down to hell, but long and hard the climb out.
The ghost led me down past the caves, and the smoke, to the river Styx. Charon, the boatman of the dead, poled out of the mist.

Headings, too, came along as another way of adding information about a chunk of text, saying, "Hey, this is what this section is about."

The Song of Solomon
Book One

In fact, many of the formats we take for granted were invented to give information about the passage. For instance, bulleted lists could be considered a kind of markup, indicating that the items all belong together.

The commons belong to the people.

They may run their cattle on the green.

They may use the green for their dances.

No man may fence in the commons.

All souls tend the green equally.

A rhetoric of objects

Complexity theory as a way of understanding the Web

Structuring complex interactive information

What kind of thing am I creating? (Full chapter from Hot Text, in PDF, 728K, or 12 minutes at 56K)

Printing demands tags

When printing became established, a printing shop could plan changes in emphasis, weight, width, size, slant. To signal these to the people selecting type from the trays, someone had to mark up the text, indicating what formats to use where, using editorial markup symbols and abbreviations.

I have learned to look on nature not as in the hour of thoughtless youth, but hearing oftentimes the still, sad music of humanity, nor harsh nor grating, though of ample power to chasten and subdue.

When 19th century printers came up with typesetting machines that poured lead into molds, generating hot chunks of metal that could be dropped into a tray, to print with, the operators used levers to shift from boldface to plain text and back, following the editor's markup.

From hot type to cold type

In the late 1960s and early 1970s, when programmers were inventing word processing, they looked at those levers, and decided to insert corresponding tags into the stream of characters.

Each tag corresponded to a lever that the operators had pulled, when they were generating hot lead. Of course, the result was now cold type.

I have learned to look on <bf>nature</bf> not as in the hour of thoughtless youth, but hearing oftentimes the still, sad music of <i>humanity</i>, nor harsh nor grating, though of ample power to <bf><i>chasten and subdue </bf></i>.

Each word processing machine and each program had its own proprietary set of tags, with different tags for each element of markup.

Boldface might be indicated by [b] on a document coming from one vendor, {bold} in a document from another vendor, and #--emph--# in a third.

But organizations such as airplane and tank manufacturers that had to bring together hundreds or thousands of documents from different word processors ran into a problem. They had to pay to have some programmer convert all the different tags into a single tag system, or vocabulary, so that a single printer could run the job. Eventually, the Pentagon and its suppliers asked computer vendors, publishers, and librarians to come up with a solution.

Coming up with a solution--SGML

Working at IBM, a team came up with a new approach.

They saw that book designers typically analyzed a manuscript looking for the major components and diagnosing the structural relationships between them (this is a major heading, that is a minor heading, and this is just running text).

Assuming that most documents had some kind of hierarchical structure, the team decided to use tags to indicate the structural role of each component (H1, H2, P)

To record the abstract structure, the team invented another document, the Document Type Definition, which simply described all the content tags and showed how those elements related to each other, in the larger structure.

Then, in a stroke of genius, the team suggested separating format from content. In a classic mainframe approach, they moved all formatting rules into their own file. So now the team had three different documents:

The original document, with tags indicating the role of each item in the larger structure (this is a major heading, this is a caption).
A file full of formatting instructions, arranged in a series of equations. If the tag indicates this text is a major heading, then format it in 24 points blue.
A file describing the abstract structure that all documents of this type must follow, laying out all the official tags in a kind of vocabulary.

The General Markup Language of 1969 grew into the Standard General Markup Language (SGML) of 1974, and became an international standard in 1986. It is called a language, but really it is a machine for producing vocabularies--sets of standard tags that define the structure and content of a particular type of document.

SGML had a lot of pluses:

It aided large scale publishing (in areas such as aircraft manufacturing and maintenance, nuclear power, telecommunications, space) because it let corporations, industries, and governmental agencies create a standard set of tags that could be inserted into ASCII documents instead of proprietary formatting codes, enabling a printer to accept thousands of documents from hundreds of different subcontractors, to blend together into a single document set, without paying for extensive translation of different formatting tags.
It worked fine with long-lasting documents, particularly those requiring many revisions, updates, and cross references, because the structural model remained the same, despite all the tinkering.
It ensured consistency across many pages, and across many books.
It gave you a single source of raw text that could be traded back and forth among many vendors, because it was just ASCII text.

But, yielding to the influence of its requesters, SGML soon acquired so many options, exceptions, and tangled syntax that it became too complicated for most mortals to use.

Corporations had to have specialists work on a project for a few years to get the right set of tags, and then they had to convert all the old documents, at great expense.
Also, the language was aimed at publishing a lot of paper. No one had heard of the Web back then.
Because the SGML software ran on large computers, the code sprawled and grew heavy--not an easy applet to download over the Internet.

Enter hypertext and HTML

In the 1980s, people began to see that you could not only create text electronically, but you could deliver it that way--in e-mail, help files, CD-ROMs.

And the invention of hypertext gave users a way to move around in that electronic text. Click here and go there.

Taking off from these ideas, Tim Berners Lee used SGML to come up with a Document Type Definition--a specific set of tags--for hypertext. With the Hypertext Markup Language (HTML), the Web was made possible.

HTML is just a bunch of tags. H1 means Level One Heading, and P stands for paragraph. Despite the term "language," HTML is really a vocabulary list--it is not a factory for creating new tags, like its generator, SGML.

New software, called a browser, downloaded the HTML page, read the tags, and, using the Document Type Definition that Berners-Lee had created, applied its own format to each element, and then displayed the result on-screen.

HTML encouraged the explosion of Web pages because its tags were

Flexible enough to allow you to format the text in a lot of different ways
Easy to write in a text editor or word processor, and (relatively) easy for human beings to read
Just part of regular text files (rather than proprietary binary code), making for faster transmission, low overhead, and portability from one application to another.

Problems with HTML

When the Web took off, and individual sites grew to millions of pages, HTML revealed some limitations, such as:

HTML tags do not identify meaningful content. Yes, it is a heading, but what is it about?
HTML tags can appear in almost any order, because the Document Type Definition is purposely loose about structure (to encourage almost anyone to put almost anything up on the Web). But that lack of definition means that software has only a crude model of the structure of a document, and so, instead of zipping right to a structural element such as price or author, must read every character, hoping for a clue to indicate which is which. Searching and manipulation of the contents were therefore tedious and unreliable.
E-commerce demanded database exchanges, and HTML took a lot of massaging just to get its content in and out of databases.

All that HTML tells the software is general information such as:

This is a the header part of the document.
This is the body part of the document.
This is a heading.
This is a paragraph.
This is a table.
This is the anchor for a link; if someone clicks it, the browser should follow the path to this other page.

HTML does not tell the software that this element is a book title, and this one over here is an author name, and that both live within an element called Book Description. In HTML all of those elements might be tagged as paragraphs. In those circumstances, it is hard to tell software how to look within a Book Description to find the particular paragraph that has a book title, as opposed to an author name.

For ordinary consumers, this failing seems trivial. But if a company wants to be able to order products, and sell them over the Web, tags distinguish one type of content from another become critical.

Also, over the years customers drove the vendors to expand the HTML tag set, to include more tags that could help format the page more attractively. But because users were so sloppy, the browser makers had to write a lot of code handling exceptions, mistakes, and problems, so the code became bloated and slow.

XML as a solution

To resolve these problems, the eXtensible Markup Language (XML) was created in 1996, becoming a standard in 1998.

XML is really just SGML lite.

Like SGML, XML lets you create your own vocabularies of tags, for different purposes. Because it is a subset of SGML, any old SGML processor can read XML. Because it is still fairly new, only the latest Web browsers can handle XML in its pure form, so web servers often have to transform the XML document into an HTML one.

But XML is well on its way to outshining HTML as an all-purpose markup machine, a way of generating your own tags for your own purposes.

XML tags identify the content, not the format. XML says things like:

This document is a catalog.
This is a book description.
This is the book's title.
This is the book's subtitle, if any.
This is the book's page count.
This is the book's price, in U.S. dollars.
This is the book's ISBN.

Example of XML

Here's an example of the XML markup of the body of a marketing article. This genre is called featuresandbenefits, and it consists of a challenge followed by a solution. Over and out.

<featuresandbenefits>

<challenge> Today, even a small business needs a web site. To make it easy for customers to find out about your products, see testimonials from other satisfied customers, get a map to your store, you need a web site.</challenge>
<solution>The Pr Express offers a fast and easy way to build a web site that expresses your business case.</solution>
</featuresandbenefits>

The tags are defined in a document called the Document Type Definition, which declares each element, shows what components should appear inside it, in what order, and how often. For instance, to define the tags used in the marketing article, the team declares an element called featuresandbenefits, which contains a challenge, and a solution. Then the DTD declares the challenge consists of ordinary text (grandly known as data made up of characters that will be checked by software called a parser, i.e, parsed character data). Ditto for the solution. That's it.

<!ELEMENT featuresandbenefits (challenge, solution)>

<!ELEMENT challenge (#PCDATA)>
<!ELEMENT solution (#PCDATA)>

Content tags appeal to software

Content tags resemble the names of records and fields, or rows and columns, in a relational database.

Looked at another way, the tagged elements resemble objects in an object-oriented database.

Either way, using XML tags, you have created what programmers call "structured" information (as opposed to the sloppy irregularly organized and unpredictable documents you usually produce), so a programmer can easily grab your content and send it to a database, or pull facts out of the database and plug them into your document, without inadvertently putting the address where the phone number belongs. In effect, your document becomes a little database, capable of talking to other databases.

In this way XML makes it easy to write programs that manipulate the content of XML documents.

Also, because it enforces a strict model of the content, and insists that writers observe a rigorous discipline on tagging (no exceptions, no mistakes, no craziness), XML limits the number of possible tags and organizational models, so processing software like parsers and browsers can be lightweight. Any vocabulary generated in XML is methodical and precise, and therefore unambiguous (at least from the point of view of software).

Benefits of XML--for humans

Amazingly, with a little training, anyone--even you--can read an XML document.

If the original tags are written well, even a newcomer can figure out what each element is, and where it belongs.

And XML files are text files using standard codes for every character, such as the American Standard Code for Information Interchange (ASCII), or one of the two forms of Unicode. Therefore almost any software in the world can read these documents.

Plus, the markup adds metadata, making the file "self describing," and therefore much easier for software to digest. Total strangers, with software you never heard of, can pick up your XML document, read it, copy from it, insert data into it, and generally manhandle it. Where HTML focuses on formatting content in a browser, XML enables that, but offers software a chance to perform further processing of the contents, such as searching, sorting, reorganizing, inserting, and deleting at the behest of the user, and your own little applet downloaded along with the file.

XML is democratic, too. Anyone can create a vocabulary of tags, using XML, defining what the content elements will be, and how they will be organized, in a schema or a Document Type Definition (DTD). Once displayed on the Web in an XML document with its accompanying schema or DTD, those tags and structures are open to anyone else to read and manipulate. They are not proprietary or hidden, so other people can disassemble the document, reuse parts of it, translate the tags into their own vocabulary. Industries can adopt standard vocabularies. Businesses can have their own variations, and easily translate content back and forth

How XML works

Fundamentally, XML separates the structural model, the actual content, and the rules for formatting the content.

The Document Type Definition or schema defines the abstract structure for this type of document, creating a vocabulary and organization of tags.
The document itself uses those tags to label the actual content.
One or more stylesheets tell the browser what format to apply to the content, using those tags, in different circumstances (old browser, new browser, handheld device).

That division of labor often means that you don't get to see exactly what your text will look like--a step back to the awful days of Unix editors like vi, or worse.

But you can have your software display the text in a way that imitates one or another of the stylesheets, while an inset window displays the standard structure, and the software beeps at you if you get out of line and try to insert the wrong element next, or delete a required element. You are still writing, but at your elbow is a structure monitor.

You have lost the independence of word processing, where you could organize and format any way you like. But your text is becoming a lot more useful for your users.

You are leaving the free-wheeling world of documents, and entering the world of carefully defined, standardized objects--steps, captions, definitions, introductions, whatnot.

Each element that has been identified with an XML tag now acts as a discrete object.

Often, it is stored as an object in an object-oriented database, but most writers do not care how the software stores the element.

What does matter is that now, instead of creating whole documents and formatting them, you have to craft a set of distinct objects without always knowing in advance all the structures into which they may be fitted together and formatted.

See: Coombs, Renear, and DeRose (1987), Goldfarb (1990), Goldfarb and Prescod (2000); Hunter et al (2000, Kay (2000), McGrath (1998), Megginson (1998), Musciano and Kennedy (1997), Travis and Waldt (1995); Turner, Douglas and Turner (1996), Young (2000), Yourdon (1994)

Resources | About Us | Site Map

Web Writing that Works!
http://www.WebWritingThatWorks.com
© 2003 Jonathan and Lisa Price
The Communication Circle
Discuss at HotText@yahoogroups.com
Email us directly at ThePrices@ThePrices.com
Order Hot Text (the book) from Amazon