Portable Document Format (PDF) was created by Adobe in 1993 to complement their PostScript language. PDF allows the user to view a document with a presentational integrity that almost resembles a scanned image of the source. This delivery of visually rich content is the most attractive use of PDF. The format is entirely concerned with keeping the document intact, and, to ensure this, allows any combination of text, graphics and images. It also has full, rich colour presentation and is therefore often used with corporate and marketing graphic arts materials. Another enticing feature, depending on the quality of the printer, is that when a PDF file is printed out, the hard copy output is an exact replication of the screen image. PDF is also desirable for its delivery strengths. Not only does the document maintain its visual integrity, but it also can be compressed. This compression eases on-line and CD-ROM transmission and assists its archiving opportunities.
HyperText Markup Language (or HTML as it is commonly known) is a non-proprietary format markup system used for publishing hypertext on the World Wide Web. To date, it has appeared in four main versions (1.0, 2.0, 3.2, 4.0), with the World Wide Web Consortium (W3C) recommending 4.0 as the markup language of choice. HTML is a derivative of SGML the Standard Generalised Markup Language. SGML will be discussed in greater detail in Chapter 5, but suffice it to say that it is an international standard metalanguage that defines a set of rules for device-independent, system-independent methods of encoding electronic texts. SGML allows you to create your own markup language but provides the rules necessary to ensure its processing and preservation. HTML is a successful implementation of the SGML concepts, and, as a result, is accessible to most browsers and platforms. Along with this, it is a relatively simple markup language to learn, as it has a limited tagset. HTML is by far the most popular web-publishing language, allowing users to create online text documents that include multimedia elements (such as images, sounds, and video clips), and then put these documents in an environment that allows for instant publication and retrieval.
There are many advantages to a markup language like HTML. As mentioned above, the primary benefit is that a document encoded with HTML can be viewed in almost any browser an extremely attractive option for a creator who wants documents which can be viewed by an audience with varied systems. However, it is important to note that while the encoding can cross platforms, there are consistently differences in page appearance between browsers. While W3C recommends the usage of HTML 4.0, many of its features are simply not available to users with early versions of browsers. Unlike PDF which is extremely concerned with keeping the document and its format intact, HTML has no true sense of page structure and files can neither be saved nor printed with any sense of precision.
Besides the benefit of a markup language that crosses platforms with ease, HTML attracts its many users for the simple manner with which it can be mastered. For users who do not want to take the time to learn the tagset, the good news is that conversion-to-HTML tools are becoming more accessible and easier to use. For those who cannot even spare the time to learn how to use HTML-creation software, of which there are a limited quantity, they can sit down with any text creation program (Notepad for example) and author an HTML document. Then by using the 'Open File'; tool in a browser, the document can immediately be viewed. What this means for novice HTML authors is that they can sit down with a text creator and a browser and teach themselves a markup language in one session. And as David Seaman, Director of the Electronic Text Center at the University of Virginia, points out:
[this] has a real pedagogical value as a form of SGML that makes clear to newcomers the concept of standardized markup. To the novice, the mass of information that constitutes the Text Encoding Initiative Guidelines the premier tagging scheme for most humanities documents is not easily grasped. In contrast, the concise guidelines to HTML that are available on-line (and usually as a "help" option from the menu of a Web client) are a good introduction to some of the basic SGML concepts. (Seaman 1994).
This is of real value to the user. The notion of marking up a text is quite often an overwhelming concept. Most people do not realise that markup enters into their life every time they make a keystroke in a word processing program. So for the uninitiated, HTML provides a manageable stepping-stone into the world of more complex encoding. Once this limited tagset is mastered, many users find the jump into an extended markup language less intimidating and more liberating.
However, one of the drawbacks to this easy authoring language is that many of the online documents are created without a DTD. A DTD is the abbreviation for a document type definition, which outlines the formal specifications for an SGML encoded document. Basically, a DTD is the method for spelling out the SGML rules that the document is following. It sets the standards for what markup can be used and how this markup interacts with others. So, for example, if you create an HTML document with a specific software program, say HoTMetaL PRO, the resulting text will begin with a document type declaration stating which DTD is being used. A sample declaration from a HoTMetaL creation looks like this:
<!DOCTYPE HTML PUBLIC "-//SoftQuad//DTD HoTMetaL PRO 4.0::19970714::extensions to HTML 4.0//EN" "hmpro4.dtd">
As can be seen in the above statement, the declaration explains that the document will follow the HoTMetaL PRO 4.0 DTD. In so doing, the markup language used must adhere to the rules set out in this specific DTD. If it does not then the document cannot be successfully validated and will not work.
As it stands now, web browsers require neither a DTD nor a document type declaration. Browsers are notoriously lax in their HTML requirements, and unless something serious is missing from the encoded document it will be successfully viewed through a Web client. The impact of this is that while HTML provides a convenient and universal markup language for a user, many of the documents floating out in cyberspace are permeated with invalid code. The focus then moves away from authoring documents that conform to a set of encoding guidelines and towards the creation of works that can be viewed in a browser (Seaman 1994). This problem will become more severe with the increased use of Extensible Markup Language, or XML as it is more commonly known. This markup language, which is being lauded as the new lingua franca, combines the visual benefits of HTML with the contextual benefits of SGML/TEI. However, while XML will have the universality of HTML, the web clients will require a more stringent adherence to markup rules. While documents that comply with the rules of an HTML DTD will find the transition relatively simple, the documents that were constructed strictly with viewing in mind will require a good deal of clean up prior to conversion.
This is not to say that HTML is not a useful tool for creating online documents. As in the case of PostScript and PDF, the choice to use HTML should be document dependent. It is the perfect choice for static documents that will have a short shelf-life. If you are creating course pages or supplementary materials regarding specific readings that will not be necessary or available after the end of term, then HTML is an appropriate choice. If, however, you are concerned about presentational and structural integrity, the markup of document content and/or the long-term preservation of the text, then a user-definable markup language is a much better choice.
4.2.3: User-definable descriptive markup A user-definable descriptive markup is exactly what its name implies. The content of the markup tags is established solely by the user, not by the software. As a result of SGML and its concept of a DTD, a document can have any kind of markup a creator desires. This frees the document from being married to proprietary hardware or software and from its reliance upon an appearance-based markup language. If you decide to encode the document with a non-proprietary language, which we highly recommend, then this is a good time to evaluate the project goals. While a user-definable markup language gives you control over the content of the markup, and thereby more control over the document, the markup can only be fully understood by you. Although not tied to a proprietary system, it is also not tied to any accepted standard. A markup language defined and implemented by you is simply that a personal non-proprietary markup system.
However, if the electronic texts require a language that is non-proprietary, more extensive and content-oriented than HTML, and comprehensible and acceptable to a humanities audience, then there is a solution the Text Encoding Initiative (TEI). TEI is an international implementation of SGML, providing a non-proprietary markup language that has become the de facto standard in Humanities Computing. TEI, which is explained more fully in Chapter 5, provides 'a full set of tags, a methodology, and a set of Document Type Descriptions (DTDs) that allow the detailed (or not so detailed) description of the spatial, intellectual, structural, and typographic form of a work' (Seaman 1994).
4.3: Implications for long-term preservation and reuse Markup is a critical, and inescapable, part of text creation and processing. Regardless of the method chosen to encode the document, some form of markup will be included in the text. Whether this markup is proprietary or non-proprietary, appearance- or content-based is up to you. Be sure to evaluate the project goals when making the encoding decisions. If the project is short-lived or necessarily software dependent, then the choices are relatively straightforward. However, if you are at all concerned about long-term preservation, cross-platform capabilities, and/or descriptive markup, then a user-definable (preferably TEI) markup language is the best choice. As Peter Shillingsburg corroborates:
...the editor with a universal encoding system developing an electronic edition with a multiplatform application has created a tool available to anyone with a computer and has ensured the longevity of the editorial work through generations to come of software and hardware. It seems worth the effort (Shillingsburg 1996, 163).