Creating and Documenting Electronic Texts

 

Chapter 3: Digitization — Scanning, OCR, and Re-keying
 

3.1: What is digitization?

Digitization is quite simply the creation of a computerised representation of a printed analog. There are many methods of digitizing and varied media to be digitized. However, as this guide is concerned with the creation of electronic texts, it will focus primarily on text and images, as these are the main objects in the digitization process. This chapter will address such issues as scanning and image capture, necessary hardware and software concerns, and a more lengthy discussion of digitizing text.

For discussions of digitizing other formats, audio and video for example, there are many thorough analyses of procedure. Peter Robinson's The Digitization of Primary Textual Sources covers most aspects of the decision making process and gives detailed explanations of all formats. 'On-Line Tutorials and Digital Archives' or 'Digitising Wilfred', written by Dr Stuart Lee and Paul Groves, is the final report of their JTAP Virtual Seminars project and takes you step by step through the process and how the various digitization decisions were made. They have also included many helpful worksheets to help scope and cost your own project. For a more current study of the digitization endeavour, refer to Stuart Lee's Scoping the Future of Oxford's Digital Collections at http://www.bodley.ox.ac.uk/scoping, which examined Oxford's current and future digitization projects. Appendix E of the study provides recommendations applicable to those outside of the Oxford community by detailing the fundamental issues encountered in digitization projects.

While the above reports are extremely useful in laying out the steps of the digitization process, they suffer from the inescapable liability of being tied to the period in which they are written. In other words, recommendations for digitizing are constantly changing. As hardware and software develop, so does the quality of digitized output. The price cuts in storage costs allow smaller projects to take advantage of archival imaging standards (discussed below). This in no way detracts from the importance of the studies produced by scholars such as Lee, Groves, and Robinson; it simply acknowledges that the fluctuating state of digitization must be taken into consideration when project planning. Keeping this in mind, the following sections will attempt to cover the fundamental issues of digitization without focusing on ephemeral discussion points.

3.2: The digitization chain

The digitization chain is a concept expounded by Peter Robinson in his aforementioned publication. The idea is based upon the fundamental concept that the best quality image will result from digitizing the original object. If this is not an attainable goal, then digitization should be attempted with as few steps removed from the original as possible. Therefore, the chain is composed of the number of intermediates that come between the original object and the digital image — the more intermediates, the more links in the chain (Robinson 1993).

This idea was then extended by Dr Lee so that the digitization chain became a circle in which every step of the project became a separate link. Each link attains a level of importance so that if one piece of the chain were to break, the entire project would fail (Groves and Lee 1999). While this is a useful concept in project development, it takes us away from the object of this chapter — digitization — so we'll lean more towards Robinson's concept of the digitization chain.

As will soon become apparent with the discussion of imaging hardware and software, having very few links in the digitization chain will make the project flow more smoothly. Regardless of the technology utilised by the project, the results will depend, first and foremost, on the quality of the image being scanned. Scanning a copy of a microfilm of an illustration originally found in a journal is acceptable if it is the only option you have, but clearly scanning the image straight from the journal itself is going to make an immeasurable difference in quality. This is one important reason for carefully choosing the hardware and software. If you know that you are dealing with fragile manuscripts that cannot handle the damaging light of a flatbed scanner, or a book whose binding cannot open past a certain degree, then you will probably lean towards a digital camera. If you have text that is from an 18th-century book, with fading pages and uneven type, you will want the best text scanning software available. Knowing where your documents stand in the digitization chain will influence the subsequent imaging decisions you will make for the project.

3.3: Scanning and image capture

The first step in digitization, both text and image, is to obtain a workable facsimile of the page. To accomplish this you will need a combination of hardware and software imaging tools. This is a somewhat difficult area to address in terms of recommending specific product brands, as what is considered industry (or at least the text creation industry) standard is subject to change as technology develops. However, this chapter will discuss some of the hardware and software frequently used by archives and digital project creators.

3.3.1: Hardware — Types of scanner and digital cameras

There are quite a few methods of image capture that are used within the humanities community. The equipment ranges from scanners (flatbed, sheetfed, drum, slide, microfilm) to high-end digital cameras. In terms of standards within the digitizing community, the results are less than satisfactory. Projects tend to choose the most available option, or the one that is affordable on limited grant funding. However, two of the most common and accessible image capture solutions are flatbed scanners and high-resolution digital cameras.

Flatbed scanners

Flatbed scanners have become the most commonplace method for capturing images or text. Their name comes from the fact that the scanner is literally a flat glass bed, quite similar to a copy machine, on which the image is placed face down and covered. The scanner then passes light-sensitive sensors over the illuminated page, breaking it into groups of pixel-sized boxes. It then represents each box with a zero or a one, depending on whether the pixel is filled or empty. The importance of this becomes more apparent with the discussion of image type below.

As a result of their lowering costs and widespread availability, the use of quality flatbeds ranges from the professional digital archiving projects to the living rooms of the home computer consumer. One benefit of this increased use and availability is that flatbed scanning technology is evolving continually. This has pushed the purchasing standards away from price and towards quality. In an attempt to promote the more expensive product, the marketplace tends to hype resolution and bit-depth, two aspects of scanning that are important to a project (see section 3.4) but are not the only concerns when purchasing hardware. While it is not necessarily the case that you need to purchase the most expensive scanner to get the best quality digital image, it is unlikely that the entry-level flatbeds (usually under 100 pounds/dollars) will provide the image quality that you need. However, while it used to be the case that to truly digitize well you needed to purchase the more high-end scanner, at a price prohibitive to most projects, the advancing digitizing needs of users have pushed hardware developers to create mid-level scanners that reach the quality of the higher range.

As a consumer, you need to possess a holistic view of the scanner's capabilities. Not only should the scanner provide you with the ability to create archival quality images (discussed in section 3.4.2) but it should also make the digitization process easier. Many low-cost scanners do not have high-grade lenses, optics, or light sources, thereby creating images that are of a very poor quality. The creation of superior calibre images relates to the following hardware requirements (www.scanjet.hp.com/shopping/list.htm):

  • the quality of the lens, mirrors, and other optics hardware;
  • the mechanical stability of the optical system;
  • the focal range and stability of the optical system;
  • the quality of the scanning software and many other hardware and software features.

Also, many of the better quality scanners contain tools that allow you to automate some of the procedures. This is extremely useful with such things as colour and contrast where, with the human eye, it is difficult to achieve the exact specification necessary for a high-quality image. Scanning hardware has the ability to provide this discernment for the user, so these intelligent automated features are a necessity to decrease task time.

Digital cameras

One of the disadvantages of a flatbed scanner is that to capture the entire image the document must lie completely flat on the scanning bed. With books this poses a problem because the only way to accomplish this is to bend the spine to the breaking point. It becomes even worse when dealing with texts with very fragile pages, as the inversion and pressure can cause the pages to flake away or rip. A solution to this problem, one taken up by many digital archives and special collections departments, is to digitize with a stand-alone digital camera.

Digital cameras are by far the most dependable means of capturing quality digital images. As Robinson explains,

They can digitize direct from the original, unlike the film-based methods of microfilm scanning or Photo CD. They can work with objects of any size or shape, under many different lights, unlike flatbed scanners. They can make images of very high resolution, unlike video cameras (Robinson 1993, 39).

These benefits are most clearly seen in the digitization of manuscripts and early printed books — objects that are difficult to capture on a flatbed because of their fragile composition. The ability to digitize with variant lighting is a significant benefit as it won't damage the make-up of the work, a precaution which cannot be guaranteed with flatbed scanners. The high resolution and heightened image quality allows for a level of detail you would expect only in the original. As a result of these specifications, images can be delivered at great size. A good example of this is the Early American Fiction project being developed at UVA's Electronic Text Center and Special Collections Department. (http://etext.lib.virginia.edu/eaf/intro.html)

The Early American Fiction project, whose goal is the digitization of 560 volumes of American first editions held in the UVA Special Collections, is utilizing digital cameras mounted above light tables. They are working with camera backs manufactured by Phase One attached to Tarsia Technical Industries Prisma 45 4x5 cameras on TTI Reprographic Workstations. This has allowed them to create high quality images without damaging the physical objects. As they point out in their overview of the project, the workflow depends upon the text being scanned, but the results work out to close to one image every three minutes. While this might sound detrimental to the project timeline, it is relatively quick for an archival quality image. The images can be seen at such a high-resolution that the faintest pencil annotations can be read on-screen. Referring back to Robinson's digitization chain (3.2) we can see how this ability to scan directly from the source object prevents the 'degradation' found in digitizing documents with multiple links between original and computer.

3.3.2: Software

Making specific recommendations for software programs is a problematic proposition. As has been stated often in this chapter, there are no agreed 'standards' for digitization. With software, as with hardware, the choices made vary from project to project depending upon personal choice, university recommendations, and often budgetary restrictions. However, there are a few tools that are commonly seen in use with many digitization projects. Regardless of the brand of software purchased, the project will need text scanning software if there is to be in-house digitization of text and an image manipulation software package if imaging is to be done. There are a wide variety of text scanning softwares available, all with varying capabilities. The intricacies of text scanning are discussed in greater detail below, but the primary consideration with any text scanning software is how well it works with the condition of the text being scanned. As this software is optimised for laser quality printouts, projects working with texts from earlier centuries need to find a package that has the ability to work through more complicated fonts and degraded page quality. While there is no standard, most projects work with Caere's OmniPage scanning software. In terms of image manipulation, there are more choices depending upon what needs to be done. For image-by-image manipulation, including converting TIFFs to web-deliverable JPEGs and GIFs, Adobe Photoshop is the more common selection. However, when there is a move towards batch conversion, Graphic's DeBabelizer Pro is known for its speed and high quality. If the conversion is being done in a UNIX environment, the XV operating system is also a favourite amongst digitization projects.

3.4: Image capture and Optical Character Recognition (OCR)

As discussed earlier, electronic text creation primarily involves the digitization of text and images. Apart from re-keying (which is discussed in 3.5), the best method of digitizing text is Optical Character Recognition (OCR). This process is accomplished through the utilisation of scanning hardware in conjunction with text scanning software. OCR takes a scanned image of a page and converts it into text. Similarly, image capture also requires an image scanning software to accompany the hardware. However, unlike text scanning, image capture has more complex requirements in terms of project decisions and, like almost everything else in the digitization project, benefits from clearly thought out objectives.

3.4.1: Imaging issues

The first decision that must be made regarding image capture concerns the purpose of the images being created. Are the images simply for web delivery or are there preservation issues that must be considered? The reason for this is simple: the higher quality the image need be, the higher the settings necessary for scanning. Once this decision has been made there are two essential image settings that must be established — what type of image will be scanned (greyscale? black and white? colour?) and at what resolution.

Image types

There are four main types of images: 1-bit black and white, 8-bit greyscale, 8-bit colour and 24-bit colour. A bit is the fundamental unit of information read by the computer, with a single bit being represented by either a '0' or a '1'. A '0' is considered an absence and a '1' is a presence, with more complex representations of information being accommodated by multiple or gathered bits (Robinson 1993, 100).

A 1-bit black and white image means that the bit can either be black or white. This is a rarely used type and is completely unsuitable for almost all images. The only amenable image for this format would be printed text or line graphics for which poor resulting quality did not matter. Another drawback of this type is that saving it as a JPEG compressed image — one of the most prevalent image formats on the web — is not a feasible option.

8-bit greyscale images are an improvement from 1-bit as they encompass 256 shades of grey. It is often used for non-colour images (see the Wilfred Owen Archive at http://www.hcu.ox.ac.uk/jtap/) and provides a clear image rather than the resulting fuzz of a 1-bit scan. While greyscale images are often considered more than adequate, there are times when non-colour images should be scanned at a higher colour because the finite detail of the hand will come through distinctly (Robinson 1993, 28). Also, the consistent recommendation is that images that are to be considered preservation or archival copies should be scanned as 24-bit colour.

8-bit colour is similar to 8-bit greyscale with the exception that each bit can be one of 256 colours. The decision to use 8-bit colour is completely project dependent, as the format is appropriate for web page images but can come out somewhat grainy. Another factor is the type of computer the viewer is using, as older ones cannot handle an image above 8-bit, so it will convert a 24-bit image to the lower format. However, the factor to take into consideration here is primarily storage space. An 8-bit image, while not having the quality of a higher format, will be markedly smaller.

If possible, 24-bit colour is the best scanning choice. This option provides the highest quality image, with each bit having the potential to contain one of 16.8 million colours. The arguments against this image format are the size, cost and time necessary. Again, knowing the objectives of the project will assist in making this decision. If you are trying to create archival quality images, this is taken as the default setting. 24-bit colour makes the image look more photo-realistic, even if the original is greyscale. The thing to remember with archival quality imaging is that if you need to go back and manipulate the image in any way, it can be copied and adjusted. However, if you scan the image as a lesser format then any kind of retrospective adjustments will be impossible. While a 24-bit colour archived image can be made greyscale, an 8-bit greyscale image cannot be converted into millions of colours.

Resolution

The second concern relates to the resolution of the image. The resolution is determined by the number of dots per inch (dpi). The more dots per inch in the file, the more information is being stored about the image. Again, this choice is directly related to what is being done with the image. If the image is being archived or will need to be enlarged, then the resolution will need to be relatively higher. However, if the image is simply being placed on a web page, then the resolution drops drastically. As with the choices in image type, the dpi ranges alter the file size. The higher the dpi, the larger the file size. To illustrate the differences, I will replicate an informative table created by the Electronic Text Center, which examines an uncompressed 1" x 1" image in different types and resolutions.

Resolution (dpi) 400x400 300x300 200x200 100x100
2-bit black and white 20K 11K 5K 1K
8-bit greyscale or colour 158K 89K 39K 9K
24-bit colour 475K 267K 118K 29K

Clearly the 400 dpi scan of a 24-bit colour image is going to be the largest file size, but is also one of the best choices for archival imaging. The 100 dpi image is attractive not only for its small size, but because screen resolution rarely exceeds this amount. Therefore, as stated earlier, the dpi choice depends on the project objectives.

File formats

If, when using an imaging software program, you click on the 'save as' function to finalise the capture, you will see that there are quite a few image formats to choose from. In terms of text creation there are three types fundamental to the process: TIFF, JPEG, and GIF. These are the most common image formats because they transfer to almost any platform or software system.

TIFF (Tagged Image File Format) files are the most widely accepted format for archival image creation and retention as master copy. More so than the following formats, TIFF files can be read by almost all platforms, which also makes it the best choice when transferring important images. Most digitization projects begin image scanning with the TIFF format, as it allows you to gather as much information as possible from the original and then saves these data. This touches on the only disadvantage of the TIFF format — the size of the image. However, once the image is saved, it can be called up at any point and be read by a computer with a completely different hardware and software system. Also, if there exists any possibility that the images will be modified at some point in the future, then the images should be scanned as TIFFs.

JPEG (Joint Photographic Experts Group) files are the strongest format for web viewing and transfer through systems that have space restrictions. JPEGs are popular with image creators not only for their compression capabilities but also for their quality. While a TIFF is a lossless compression, JPEGs are a lossy compression format. This means that as a filesize condenses, the image loses bits of information. However, this does not mean that the image will markedly decrease in quality. If the image is scanned at 24-bit, each dot has the choice of 16.8 million colours — more than the human eye can actually differentiate on the screen. So with the compression of the file, the image loses the information least likely to be noticed by the eye. The disadvantage of this format is precisely what makes it so attractive — the lossy compression. Once an image is saved, the discarded information is lost. The implication of this is that the entire image, or certain parts of it, cannot be enlarged. Additionally, the more work done to the image, requiring it to be re-saved, the more information is lost. This is why JPEGs are not recommended for archiving — there is no way to retain all of the information scanned from the source. Nevertheless, in terms of viewing capabilities and storage size, JPEGs are the best method for online viewing.

GIF (Graphic Interchange Format) files are an older format that are limited to 256 colours. Like TIFFs, GIFs use a lossless compression format without requiring as much storage space. While they don't have the compression capabilities of a JPEG, they are strong candidates for graphic art and line drawings. They also have the capability to be made into transparent GIFs — meaning that the background of the image can be rendered invisible, thereby allowing it to blend in with the background of the web page. This is frequently used in web design but can have a beneficial use in text creation. There are instances, as mentioned in Chapter 2, where it is possible that a text character cannot be encoded so that it can be read by a web browser. It could be inline images (a head-piece for example) or the character is not defined by ISOLAT1 or ISOLAT2. When the UVA Electronic Text Center created an online version of the journal Studies in Bibliography, there were instances of inline special characters that simply could not be rendered through the available encoding. As the journal is a searchable full-text database, providing a readable page image was not an option. Their solution to this, one that did not disrupt the flow of the digitized text, was to create a transparent GIF of the image. These GIFs were made so that they matched the size of the surrounding text and subsequently inserted quite successfully into the digitized document.

Referring back to the discussion on image types, the issue of file size tends to be one that comes up quite often in digitization. It is the lucky project or archive that has an unlimited amount of storage space, so most creators must contemplate how to achieve quality images that don't take up the 55mb of space needed by a 400 dpi, archival quality TIFF. However, it is easy to be led astray by the idea that the lower the bit the better the compression. Not so! Once again, the Electronic Text Center has produced a figure that illustrates how working with 24-bit images, rather than 8-bit, will produce a smaller JPEG — along with a higher quality image file.

300 dpi 24-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 2188 K
'moderate loss' JPEG: 59 K

300 dpi 8-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 729 K
'moderate loss' JPEG: 76 K

100 dpi 24-bit colour image: 2.65 x 3.14 inches:
uncompressed TIFF: 249 K
'moderate loss' JPEG: 9 K

100 dpi 8-bit color image: 2.65 x 3.14 inches:
uncompressed TIFF: 85 K
'moderate loss' JPEG: 12 K
(http://etext.lib.virginia.edu/helpsheets/scanimage.html)

While the sizes might not appear to be that markedly different, remember that these results were calculated with an image that measures approximately 3x3 inches. Turn these images into page size, calculate the number that can go into a project, and the storage space suddenly becomes much more of an issue. Therefore, not only does 24-bit scanning provide a better image quality, but the compressed JPEG will take less of the coveted project space.

So now that the three image formats have been covered, what should you use for your project? In the best possible situation you will use a combination of all three. TIFFs would not be used for online delivery, but if you want your images to have any future use, either for archiving, later enlarging, manipulation, or printing, or simply as a master copy, then there is no other format in which to store the images. In terms of online presentation, then JPEGs and GIFs are the best method. JPEGs will be of a better calibre and smaller filesize but cannot be enlarged or they will pixelate. Yet in terms of viewing quality their condition will almost match the TIFF. How you use GIFs will depend on what types of images are associated with the project. However, if you are making thumbnail images that link to a separate page which exhibits the JPEG version, then GIFs are a popular choice for that task.

In terms of archival digital image creation there seems to be some debate. As the Electronic Text Center has pointed out, there is a growing dichotomy between preservation imaging and archival imaging. Preservation imaging is defined as 'high-speed, 1-bit (simple black and white) page images shot at 600 dpi and stored as Group 4 fax-compressed files' (http://etext.lib.virginia.edu/helpsheets/specscan.html). The results of this are akin to microfilm imaging. While this does preserve the text for reading purposes, it ignores the source as a physical object. Archiving often presupposes that the objects are being digitized so that the source can be protected from constant handling, or as an international means of accessibility. However, this type of preservation annihilates any chance of presenting the object as an artefact. Archiving an object has an entirely different set of requirements. Yet, having said this, there is also a prevalent school of thought in the archiving community that the only imaging that can be considered of archival value is film imaging, which is thought to last at least ten times as long as a digital image. Nonetheless, the idea of archival imaging is still discussed amongst projects and funding bodies and cannot be overlooked.

There is no set standard for archiving, and you might find that different places and projects recommend another model. However, the following type, format and resolution are recommended:

  • 24-bit: There really is little reason to scan an archival image at anything less. Whether the source is colour or greyscale, the images are more realistic and have a higher quality at this level. As the above example shows, the filesize of the subsequently compressed image does not benefit from scanning at a lower bit-size.
  • 600 dpi: This is, once again, a problematic recommendation. Many projects assert that scanning in at 300 or 400 dpi provides sufficient quality to be considered archival. However, many of the top international digitization centres (Cornell, Oxford, Virginia) recommend 600 dpi as an archival standard — it provides excellent detail of the image and allows for quite large JPEG images to be produced. The only restrictive aspect is the filesize, but when thinking in terms of archival images you need to try and get as much storage space as possible. Remember, the master copies do not have to be held online, as offline storage on writeable CD-ROMs is another option.
  • TIFF: This should come as no surprise given the format discussion above. TIFF files, with their complete retention of scanned information and cross-platform capabilities are really the only choice for archival imaging. The images maintain all of the information scanned from the source and are the closest digital replication available. The size of the file, especially when scanned at 24-bit, 600 dpi, will be quite large, but well worth the storage space. You won't be placing the TIFF image online, but it is simple to make a JPEG image from the TIFF as a viewing copy.

This information is provided with the caveat that scanning technology is constantly changing for the better. It is more than likely that in the future these standards will become passé, with higher standards taking their place.

3.4.2: OCR issues

The goal of recognition technology is to re-create the text and, if desired, other elements of the page including such things as tables and layout. Refer back to the concept of the scanner and how it takes a copy of the image by replicating it with the patterns of bits — the dots that are either filled or unfilled. OCR technology examines the patterns of dots and turns them into characters. Depending upon the type of scanning software you are using, the resulting text can be piped into many different word processing or spreadsheet programs. Caere OmniPage released version 10.0 in the Autumn of 1999, which boasts the new Predictive Optical Word Recognition Plus+ (POWR++) technology. As the OmniPage factsheet explains,

POWR++ enables OmniPage Pro to recognize standard typefaces, without training, from 4 to 72 point sizes. POWR++ recognizes 13 languages (Brazilian Portuguese, British English, Danish, Dutch, Finnish, French, German, Italian, Norwegian, Portuguese, Spanish, Swedish, and U.S English) and includes full dictionaries for each of these languages. In addition, POWR++ identifies and recognizes multiple languages on the same page (http://www.caere.com/products/omnipage/pro/factsheet.asp).

However, OCR software programs (including OmniPage) are very up-front about the fact that their technology is optimised for laser printer quality text. The reasoning behind this should be readily apparent. As scanning software attempts to examine every pixel in the object and then convert it into a filled or empty space, a laser quality printout will be easy to read as it has very clear, distinct, characters on a crisp white background — a background that will not interfere with the clarity of the letters. However, once books become the object type, the software capabilities begin to degrade. This is why the first thing you must consider if you decide to use OCR for the text source is the condition of the document to be scanned. If the characters in the text are not fully formed or there are instances of broken type or damaged plates, the software will have a difficult time reading the material. The implications of this are that late 19th and 20th-century texts have a much better chance of being read well by the scanning software. As you move further away from the present, with the differences in printing, the OCR becomes much less dependable. The changes in paper, moving from a bleached white to a yellowed, sometimes foxed, background creates noise that the software must sift through. Then the font differences wreak havoc on the recognition capabilities. The gothic and exotic type found in the hand-press period contrasts markedly with the computer-set texts of the late 20th century. It is critical that you anticipate type problems when dealing with texts that have such forms as long esses, sloping descenders, and ligatures. Taking sample scans with the source materials will help pinpoint some of these digitizing issues early on in the project.

While the advantages of exporting text in different word processing formats are quite useful if you are scanning in a document to print or to compensate for an accidentally deleted file, there are a few issues that should take priority with the text creator. Assuming you are using a software program such as OmniPage, you should aim for a scan that retains some formatting but not a complete page element replication. As will be explained in greater detail in Chapter 4, when text is saved with formatting that relates to a specific program (Word, WordPerfect, even RTF) it is infused with a level of hidden markup — a markup that explains to the software program what the layout of the page should look like. In terms of text creation, and the long-term preservation of the digital object, you want to be able to control this markup. If possible, scanning at a setting that will retain font and paragraph format is the best option. This will allow you to see the basic format of the text — I'll explain the reason for this in a moment. If you don't scan with this setting and opt for the choice that eliminates all formatting, the result will be text that includes nothing more than word spacing — there will be no accurate line breaks, no paragraph breaks, no page breaks, no font differentiation, etc. Scanning at a mid-level of formatting will assist you if you have decided to use your own encoding. As you proofread the text you will be able to add the structural markup chosen for the project. Once this has been completed the text can be saved out in a text-only format. Therefore, not only will you have the digitized text saved in a way that will eliminate program-added markup, but you will also have a basic level of user-dictated encoding.

3.5: Re-keying

Unfortunately for the text creator, there are still many situations where the documents or project preclude the use of OCR. If the text is of a poor or degraded quality, then it is quite possible that the time spent correcting the OCR mistakes will exceed that of simply typing in the text from scratch. The amount of information to be digitized also becomes an issue. Even if the document is of a relatively good quality, there might not be enough time to sit down with 560 volumes of texts (as with the Early American Fiction project) and process them through OCR. The general rule of thumb, and this varies from study to study, is that a best-case scenario would be three pages scanned per minute — this doesn't take into consideration the process of putting the document on the scanner, flipping pages, or the subsequent proofreading. If, when addressing these concerns, OCR is found incapable of handling the project digitization, the viable solution is re-keying the text.

Once you've made this decision, the next question to address is whether to handle the document in-house or out-source the work. Deciding to digitize the material in-house relies on having all the necessary hardware, software, staff, and time. There are a few issues that come into play with in-house digitization. The primary concern is the speed of re-keying. Most often the re-keying is done by the research assistants working on the project, or graduate students from the text creator's local department. The problem here is that paying an hourly rate to someone re-keying the text often proves more expensive than out-sourcing the material. Also, there is the concern that a single person typing in material tends to overlook keyboarding errors — and if the staff member is familiar with the source material, there is a tendency to correct automatically those things that seem incorrect. So while in-house digitization is an option, these concerns should be addressed from the outset.

The most popular choice with many digitization projects (Studies in Bibliography, The Early American Fiction Project, Historical Collections for the National Digital Library and the Chadwyck-Healey databases — to name just a few) is to out-source the material to a professional keyboarding company. The fundamental benefit most often cited is the almost 100% accuracy rate of the companies. One such company, Apex Data Services, Inc. (used by the University of Virginia Electronic Text Center), promises a conversion accuracy of 99.995%, along with 100% structural accuracy, and reliable delivery schedules. Their ADEPT software allows the dual-keyboarders to witness a real-time comparison, allowing for a single-entry verification cycle (http://www.apexinc.com/dcs/dcs_index.html). Also, by employing keyboarders who do not possess a subject speciality in the text being digitized — many, for that matter, often do not speak the language being converted — they avoid the problem of keyboarders subconsciously modifying the text. Keyboarding companies are also able to introduce a base-level encoding scheme, established by the project creator, into the documents, thereby eliminating some of the more rudimentary tagging tasks.

Again, as with most steps in the text creation process, the answers to these questions will be project dependent. The decisions made for a project that plans to digitize a collection of works will be markedly different from those made by an academic who is creating an electronic edition. It reflects back, as always, to the importance of the document analysis stage. You must recognise what the requirements of the project will be, and what external influences (especially staff size, equipment availability, and project funding) will affect the decision-making process.

© The right of Alan Morrison, Michael Popham and Karen Wikander to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. 

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or part of any of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. 
Arts and Humanities Data Service 
 
A red line
Bibliography Next Back Glossary Contents