A Project of |
Guidelines | Rants | Patterns | Poems | Services | Classes | Press | Blog | Resources | About Us | Site Map |
Home > Rants > Goodbye documents, hello objects! > Moving content from paper to the web> Mark up that text! |
Coming up with a solution--SGML Content tags appeal to software
|
Mark up that text!Markup is not scary. Markup is at least two millennia old. Text itself started out as a set of marks on a surface, so as the surfaces and tools changed, people naturally added extra marks that told the reader something useful about the text. This meta-information, at first, was fairly crude. For instance, in Roman days, most scrolling text was an unending stream of characters, like this: armavirumquecanoquiprimiseabtroiaThen someone invented white space, marking the division between words with an empty slot. I sing of arms and the man who first from the shores of TroySomeone else thought of marking the beginning and end of sentences. Soon, capital letters and periods set off sentences. Easy is the path leading down to hell, but long and hard the climb out.Then came a triumph of markup--paragraphing. Easy is the path leading down to hell, but long and hard the climb out.Headings, too, came along as another way of adding information about a chunk of text, saying, "Hey, this is what this section is about." The Song of Solomon Book One In fact, many of the formats we take for granted were invented to give information about the passage. For instance, bulleted lists could be considered a kind of markup, indicating that the items all belong together.The commons belong to the people. |
Related articles: Complexity theory as a way of understanding the Web Structuring complex interactive information What kind of thing am I creating? (Full chapter from Hot Text, in PDF, 728K, or 12 minutes at 56K)
|
When printing became established, a printing shop could plan changes in emphasis, weight, width, size, slant. To signal these to the people selecting type from the trays, someone had to mark up the text, indicating what formats to use where, using editorial markup symbols and abbreviations.
When 19th century printers came up with typesetting machines that poured lead into molds, generating hot chunks of metal that could be dropped into a tray, to print with, the operators used levers to shift from boldface to plain text and back, following the editor's markup. In the late 1960s and early 1970s, when programmers were inventing word processing, they looked at those levers, and decided to insert corresponding tags into the stream of characters. Each tag corresponded to a lever that the operators had pulled, when they were generating hot lead. Of course, the result was now cold type.
Each word processing machine and each program had its own proprietary set of tags, with different tags for each element of markup. Boldface might be indicated by [b] on a document coming from one vendor, {bold} in a document from another vendor, and #--emph--# in a third. But organizations such as airplane and tank manufacturers that had to bring together hundreds or thousands of documents from different word processors ran into a problem. They had to pay to have some programmer convert all the different tags into a single tag system, or vocabulary, so that a single printer could run the job. Eventually, the Pentagon and its suppliers asked computer vendors, publishers, and librarians to come up with a solution. Coming up with a solution--SGML Working at IBM, a team came up with a new approach. They saw that book designers typically analyzed a manuscript looking for the major components and diagnosing the structural relationships between them (this is a major heading, that is a minor heading, and this is just running text). Assuming that most documents had some kind of hierarchical structure, the team decided to use tags to indicate the structural role of each component (H1, H2, P) To record the abstract structure, the team invented another document, the Document Type Definition, which simply described all the content tags and showed how those elements related to each other, in the larger structure. Then, in a stroke of genius, the team suggested separating format from content. In a classic mainframe approach, they moved all formatting rules into their own file. So now the team had three different documents:
The General Markup Language of 1969 grew into the Standard General Markup Language (SGML) of 1974, and became an international standard in 1986. It is called a language, but really it is a machine for producing vocabularies--sets of standard tags that define the structure and content of a particular type of document. SGML had a lot of pluses:
But, yielding to the influence of its requesters, SGML soon acquired so many options, exceptions, and tangled syntax that it became too complicated for most mortals to use.
Enter hypertext and HTML In the 1980s, people began to see that you could not only create text electronically, but you could deliver it that way--in e-mail, help files, CD-ROMs. And the invention of hypertext gave users a way to move around in that electronic text. Click here and go there. Taking off from these ideas, Tim Berners Lee used SGML to come up with a Document Type Definition--a specific set of tags--for hypertext. With the Hypertext Markup Language (HTML), the Web was made possible. HTML is just a bunch of tags. H1 means Level One Heading, and P stands for paragraph. Despite the term "language," HTML is really a vocabulary list--it is not a factory for creating new tags, like its generator, SGML. New software, called a browser, downloaded the HTML page, read the tags, and, using the Document Type Definition that Berners-Lee had created, applied its own format to each element, and then displayed the result on-screen. HTML encouraged the explosion of Web pages because its tags were
When the Web took off, and individual sites grew to millions of pages, HTML revealed some limitations, such as:
All that HTML tells the software is general information such as:
HTML does not tell the software that this element is a book title, and this one over here is an author name, and that both live within an element called Book Description. In HTML all of those elements might be tagged as paragraphs. In those circumstances, it is hard to tell software how to look within a Book Description to find the particular paragraph that has a book title, as opposed to an author name. For ordinary consumers, this failing seems trivial. But if a company wants to be able to order products, and sell them over the Web, tags distinguish one type of content from another become critical. Also, over the years customers drove the vendors to expand the HTML tag set, to include more tags that could help format the page more attractively. But because users were so sloppy, the browser makers had to write a lot of code handling exceptions, mistakes, and problems, so the code became bloated and slow. To resolve these problems, the eXtensible Markup Language (XML) was created in 1996, becoming a standard in 1998. XML is really just SGML lite. Like SGML, XML lets you create your own vocabularies of tags, for different purposes. Because it is a subset of SGML, any old SGML processor can read XML. Because it is still fairly new, only the latest Web browsers can handle XML in its pure form, so web servers often have to transform the XML document into an HTML one. But XML is well on its way to outshining HTML as an all-purpose markup machine, a way of generating your own tags for your own purposes. XML tags identify the content, not the format. XML says things like:
Here's an example of the XML markup of the body of a marketing article. This genre is called featuresandbenefits, and it consists of a challenge followed by a solution. Over and out.
The tags are defined in a document called the Document Type Definition, which declares each element, shows what components should appear inside it, in what order, and how often. For instance, to define the tags used in the marketing article, the team declares an element called featuresandbenefits, which contains a challenge, and a solution. Then the DTD declares the challenge consists of ordinary text (grandly known as data made up of characters that will be checked by software called a parser, i.e, parsed character data). Ditto for the solution. That's it.
Content tags appeal to software Content tags resemble the names of records and fields, or rows and columns, in a relational database. Looked at another way, the tagged elements resemble objects in an object-oriented database. Either way, using XML tags, you have created what programmers call "structured" information (as opposed to the sloppy irregularly organized and unpredictable documents you usually produce), so a programmer can easily grab your content and send it to a database, or pull facts out of the database and plug them into your document, without inadvertently putting the address where the phone number belongs. In effect, your document becomes a little database, capable of talking to other databases. In this way XML makes it easy to write programs that manipulate the content of XML documents. Also, because it enforces a strict model of the content, and insists that writers observe a rigorous discipline on tagging (no exceptions, no mistakes, no craziness), XML limits the number of possible tags and organizational models, so processing software like parsers and browsers can be lightweight. Any vocabulary generated in XML is methodical and precise, and therefore unambiguous (at least from the point of view of software). Amazingly, with a little training, anyone--even you--can read an XML document. If the original tags are written well, even a newcomer can figure out what each element is, and where it belongs. And XML files are text files using standard codes for every character, such as the American Standard Code for Information Interchange (ASCII), or one of the two forms of Unicode. Therefore almost any software in the world can read these documents. Plus, the markup adds metadata, making the file "self describing," and therefore much easier for software to digest. Total strangers, with software you never heard of, can pick up your XML document, read it, copy from it, insert data into it, and generally manhandle it. Where HTML focuses on formatting content in a browser, XML enables that, but offers software a chance to perform further processing of the contents, such as searching, sorting, reorganizing, inserting, and deleting at the behest of the user, and your own little applet downloaded along with the file. XML is democratic, too. Anyone can create a vocabulary of tags, using XML, defining what the content elements will be, and how they will be organized, in a schema or a Document Type Definition (DTD). Once displayed on the Web in an XML document with its accompanying schema or DTD, those tags and structures are open to anyone else to read and manipulate. They are not proprietary or hidden, so other people can disassemble the document, reuse parts of it, translate the tags into their own vocabulary. Industries can adopt standard vocabularies. Businesses can have their own variations, and easily translate content back and forth Fundamentally, XML separates the structural model, the actual content, and the rules for formatting the content.
That division of labor often means that you don't get to see exactly what your text will look like--a step back to the awful days of Unix editors like vi, or worse. But you can have your software display the text in a way that imitates one or another of the stylesheets, while an inset window displays the standard structure, and the software beeps at you if you get out of line and try to insert the wrong element next, or delete a required element. You are still writing, but at your elbow is a structure monitor. You have lost the independence of word processing, where you could organize and format any way you like. But your text is becoming a lot more useful for your users. You are leaving the free-wheeling world of documents, and entering the world of carefully defined, standardized objects--steps, captions, definitions, introductions, whatnot. Each element that has been identified with an XML tag now acts as a discrete object. Often, it is stored as an object in an object-oriented database, but most writers do not care how the software stores the element. What does matter is that now, instead of creating whole documents and formatting them, you have to craft a set of distinct objects without always knowing in advance all the structures into which they may be fitted together and formatted.
See: Coombs, Renear, and DeRose (1987), Goldfarb (1990), Goldfarb and Prescod (2000); Hunter et al (2000, Kay (2000), McGrath (1998), Megginson (1998), Musciano and Kennedy (1997), Travis and Waldt (1995); Turner, Douglas and Turner (1996), Young (2000), Yourdon (1994) |
Home | Guidelines | Rants | Patterns | Poems | Services | Classes | Press | Blog |
Web
Writing that Works!
|