Once the project objectives and document context have been established, you can move on to an analysis of the physical object. The first step is to provide the source texts with a classification. Defining the document type is a critical part of the digitization process as it establishes the foundation for the initial understanding of the text's structure. At this point you should have an idea of what documents are going to be digitized for the project. Even if you not sure precisely how many texts will be in the final project, it is important to have a representative sample of the types of documents being digitized. Examine the sample documents and decide what categories they fall under. The structure and content of a letter will differ greatly from that of a novel or poem, so it is critical to make these naming classifications early in the process. Not only are there structural differences between varying document types but also within the same type. One novel might consist solely of prose, while another might be comprised of prose and images, while yet another might have letters and poetry scattered throughout the prose narrative. Having an honest representative sample will provide you with the structural information needed to make fundamental encoding decisions.
Deciding upon document type will give you an initial sense of the shape of the text. There are basic structural assumptions that come with classification: looking for the stanzas in poetry or the paragraphs in prose for example. Having established the document type, you can begin to assign the texts a more detailed structure. Without worrying about the actual tag names, as this comes later in the process, label all of the features you wish to encode. For example, if you are digitizing a novel, you might initially break it into large structural units: title page, table of contents, preface, body, back matter, etc. Once this is done you might move on to smaller features: titles, heads, paragraphs, catchwords, pagination, plates, annotations and so forth. One way to keep the naming in perspective is to create a structure outline. This will allow you to see how the structure of your document is developing, whether you have omitted any necessary features, or if you have labelled too much.
Once the features to be encoded have been decided upon, the relationships between them can then be examined. Establishing the hierarchical sequence of the document should not be too arduous a task especially if you have already developed a structural outline. It should at this point be apparent, if we stick with the example of a novel, that the work is contained within front matter, body matter, and back matter. Within front matter we find such things as epigraphs, prologues, and title pages. The body matter is comprised of chapters, which are constructed with paragraphs. Within the paragraphs can be found quotations, figures, and notes. This is an established and understandable hierarchy. There is also a sequential relationship where one element logically follows another. Using the above representation, if every body has chapters, paragraphs, and notes, then you would expect to find a sequence of <chapter> then <paragraph> then <note>, not <chapter>, <note>, then <paragraph>. Again, the more you understand about the type of text you are encoding, the easier this process will be. While the level of structural encoding will ultimately depend upon the project objectives, this is an opportune time to explore the form of the text in as much detail as possible. Having these data will influence later encoding decisions, and being able to refer to these results will be much easier than having to sift through the physical object at a later point to resolve a structural dilemma.
The analysis also brings to light any issues or problems with the physical document. Are parts of the source missing? Perhaps the text has been water damaged and certain lines are unreadable? If the document is a manuscript or letter perhaps the writing is illegible? These are all instances that can be explored at an early stage of the project. While these problems will add a level of complexity to the encoding project, they must be dealt with in an honest fashion. If the words of a letter are illegible and you insert text that represents your best guess at the actual wording then this needs to be encoded. The beauty of document analysis is that by examining the documents prior to digitization you stand a good chance of recognising these issues and establishing an encoding methodology. The benefit of this is threefold: firstly, having identified and dealt with this problem at the start you will have fewer issues arise during the digitization process; secondly, there will be an added level of consistency during the encoding stage and retrospective revision won't be necessary; thirdly, the project will benefit from the thorough level of accuracy desired and expected by the scholarly community.
This is also a good time to examine the physical document and attempt to anticipate problems with the digitization process. Fragile spines, flaking or foxed paper, badly inked text, all will create difficulties during the scanning process and increase the likelihood of project delays if not anticipated at an early stage. This is another situation that requires examining representative samples of texts. It could be that one text was cared for in the immaculate conditions of a Special Collections facility while another was stored in a damp corner of a bookshelf. You need to be prepared for as many document contingencies as possible. Problems not only arise out of the condition of the physical object, but also out of such things as typography. OCR digitization is heavily reliant upon the quality and type of fonts used in the text. As will be discussed in greater detail in Chapter 3, OCR software is optimised for laser quality printed text. This means that the older the printed text, the more degradation in the scanning results. These types of problems are critical to identify, as decisions will have to be made about how to deal with them decisions that will become a significant part of the project methodology.
2.4: Typical textual features The final stage of document analysis is deciding which features of the text to encode. Once again, knowing the goals and objectives of the project will be of great use as you try to establish the breadth of your element definition. You have the control over how much of the document you want to encode, taking into account how much time and manpower are dedicated to the project. Once you've made a decision about the level of encoding that will go into the project, you need to make the practical decision of what to tag. There are three basic categories to consider: structure, format and content.
In terms of structure there are quite a few typical elements that are encoded. This is a good time to examine the structural outline to determine what skeletal features need to be marked up. In most cases, the primary divisions of text chapters, sections, stanzas, etc. and the supplementary parts paragraphs, lines, pages are all assigned tag names. With structural markup, it is helpful to know how detailed an encoding methodology is being followed. As you will discover, you can encode almost anything in a document, so it will be important to have established what level of markup is necessary and to then adhere to those boundaries.
The second step is to analyse the format of the document. What appearance-based features need to translate between the print and electronic objects? Some of the common elements relate to attributes such as bold, italic and typeface. Then there are other aspects that take a bit more thought, such as special characters. These require special tags, for example &Aelig; for Æ. However, cases do exist of characters which cannot be encoded and alternate provisions must be made. Format issues also include notes and annotations (items that figure heavily in scholarly texts), marginal glosses, and indentations. Elements of format are easily forgotten, so be sure to go through the representative documents and choose the visual aspects of the text that must be carried through to the electronic object.
The third encoding feature concerns document content. This is where you will go through the document looking for features that are neither structural nor format based. This is the point where you can highlight the content information necessary to the text and the user. Refer back to the decisions made about textual relationships and what themes and ideas should be highlighted. If, for example, you are creating a database of author biographies you might want to encode such features as author's name, place of birth, written works, spouse, etc. Having a clear sense of the likely users of the project will make these decisions easier and perhaps more straightforward. This is also a good time to evaluate what the methodology will be for dealing with textual revisions, deletions, and additions either authorial or editorial. Again, it is not so critical here to define what element tags you are using but rather to arrive at a listing of features that need to be encoded. Once these steps have been taken you are ready to move on to the digitization process.