In recent years there has been considerable growth in the use of computers in arts research, but as yet there exists no introductory guide to the subject. This book is intended to fill such a need. It is based on a series of lectures given in Oxford, lectures which are mostly attended by those doing research in arts subjects. Chapters One and Two introduce many of the mysteries of the computer and may be skipped by those who are already familiar with the machine. The major areas of computer applications in literary research are covered in subsequent chapters, together with a general view of indexing, cataloguing and information retrievae for historical and bibliographical data. For those whose interest is awakened, the final chapter gives a brief guide on how to start a computer project in the academic world. The examples of computer printout were produced on the <2ICL>2 1906A computer at Oxford University Computing Service, whose help I acknowledge. Kind permission has been granted by Professor <2A.>2 Parpola to reproduce illustrations from <1Materials for the Study of the Indus Valley Script;>1 by Professor <2J.>2 Raben and Queen's College, Flushing, New York from the paper by Rubin in <1Computers and the Humanities>1 Volume 4; by Edinburgh University Press from papers by de Tollenaere, Berry-Rogghe, Ott, Waite, Kiss, Armstrong, Milroy and Piper in <1The Computer and Literary Studies;>1 by Edinburgh University Press and Minnesota University Press from papers by Berry-Rogghe and Wright in <1Computers in the Humanities;>1 by the University of Wales Press from papers by Fortier and McConnell, Crawford, Shibayev and Raben and Lieberman in <1The Computer in Literary>1 and Linguistic Studies (Proceedings of the Third International Symposium);>1 bythe University of Waterloo Press from the paper by Joyce in <1Computing in the>1 <1Humanities;>1 by Mouton from papers by Cabaniss and Hirschmann in <1Computer Studies in the Humanities and Verbal Behavior>1 Volume 3; by Terence <2D.>2 Crawford and Glyn <2E.>2 Jones from the <1Bulletin of the Board of Celtic Studies,>1 Volume 3; by the American Philological Association from the paper by Packard in <1Transactions of the American Philological Association>1 Volume 104; by Harvest House, from <1It's Greek to the Computer>1 by Andrew Q. Morton and Alban <2D.>2 Winspear; and by Cambridge University Press from <2R.L.>2

Widmann, "The Computer in Historical Collation: Use of the IBM 360/75 in Collating Multiple Editions of <1A Midsummer Night's Dream' in The Computer in Literary and Linguistic Research,>1 ed, <2R.A.>2 Wisbey. The page from <2T.H.>2 Howard-Hill (ed.), <1The Merchant of Venice>1 (Oxford Shakespeare Concordance), appears by permission of the Oxford University Press. Illustrations by Shaw, Chisholm, Lance and Slemons are reprinted from <1Computers and the Humanities>1 Volume 8 (1974) and 10 (1976) from the articles "Statistical Analysis of Dialectal Boundaries', "Phonological Patterning in German Verse' and "The Use of the Computer in Plotting the Geographical Distribution of Dialect Items', copyright Pergamon Press Ltd. I acknowledge the assistance of the Association for Literary and Linguistic Computing from whose <1Bulletin>1 several illustrations are reproduced. Finally, I would like to thank Richard Newnham who first suggested the idea of this book to me after attending my lectures, Paul Griffiths and Lou Burnard of Oxford University Computing Service who commented on sections of the manuscript, <2T.H.>2 Aston of the History of the University who provided some material and suggestions for Chapter Nine, Ewa Kus who assisted greatly with the typing of the manuscript, Marc Wentworth who provided the illustration of a Chinese character (Figure 2.7), and, above all, Alan Jones and my husband Martin who both read the entire manuscript and provided many valuable comments and suggestions. Oxford, 1979 <2S.H.>2

The earliest computers were designed for scientific work and were created in such a way that numerical operations were the easiest to perform. Soon computers began to be used more in commerce and industry, when it was realised that a machine could handle large files of data, such as ledgers and payrolls, much faster than the human brain. Computer applications in the humanities have some similarities to those in industry. Both involve large amounts of information compared with scientific or mathematical data. Commercial and humanities applications are characterised by relatively simple procedures such as searching, sorting and counting, whereas the scientific applications, though often using much less data, frequently involve calculations which are much more complicated. In the sciences the growth of computer usage has enabled much more research to be done, as the computer has taken over the time-consuming tasks of performing long and complex calculations. This is no less true in the humanities, where the computer is now used to perform purely mechanical operations for which cards or slips of paper were previously used. However, it must never be forgotten that the computer is only a machine. It can only do what it is told to do, nothing more. Computers rarely, if ever, go wrong and produce erroneous results through malfunction. That, of course, is not to say that erroneous results never come out of a computer. Frequently they do, but they are the fault of the person who is using the computer who has given it incorrect instructions, not of the machine itself. In no way can the computer be blamed for errors; the designer of the computer system or the person using it is always the one at fault. Instructions are presented to the computer in the form of a computer <1Program>1 (always spelt "program'). The computer will then follow these instructions in logical sequence, performing them one by one. Therefore any problem which is to be solved by using a computer must be described in the form of a program. This involves breaking the problem down into very Simple and completely unambiguous steps. At various stages in the problem a decision may need to be taken which depends on what is contained in the information to be processed. In such a case, the computer program will be so constructed that some parts of it will be executed if the

answer to the decision is 'yes' and others if the answer is 'no'. The program may contain many such decisions. All must be worked out beforehand in logical sequence a process which is known as constructing an <1algorithm.>1 The same set of instructions may be executed many times over when a program is running. They are given only once in a program, together with an indication of how many times they are to be done. A computer program may be compared to a cooking recipe, which includes a series of steps which must be followed in the correct order, but in minute detail, like: Take some flour Pour some flour on to the weighing scales until 4oz is registered Put the flour in a bowl Add a pinch of salt to the bowl Take an egg Break the egg into the bowl Take some milk Measure out the milk If it is less than 1/2 pint add some more milk Repeat the last instruction until there is 1/2 pint Add the milk to the bowl Mix thoroughly The instructions are given in logical sequence. The repeated sequence of getting more milk or measuring the flour on the scales is exactly what might be found in a computer program. Let us now consider the instructions for counting all the times the word "and' occurs in a text. They would be something like: Take the next word Is it "and'? If it is not "and' go back to get the next word If it is 'and', add 1 to a counter and go back to get the next word When all words have been processed, print out the number of "and's. The computer program is written in a symbolic form using a computer programming language, a highly artificial "language' which is totally unambiguous. The computer operates with a very basic series of instructions, such as comparing characters or adding and subtracting numbers. To add up two numbers may require three or four of these basic instructions. Obviously writing programs in the basic instructions will take a long time, as so many are required. Therefore a series of what are called <1high level Programming languages>1 were developed which require much less

human programming effort to write. One instruction in a high-level language may be equivalent to five or more in the machine's basic instruction set. Other computer programs are used to translate from these high-level languages into the machine's basic instructions. This process is called <1compilation>1 and the special programs to do this are called <1compilers.>1 The most frequently used computer languages are <2FORTRAN>2 and <2ALGOL>2 for scientific work, and <2COBOL>2 for commercial applications. Any of these three may be used in humanities work, or perhaps some of the, more recently developed languages like <2PL/1, ALGOL68>2 or <2SNOBOL.>2 Of these <2SNOBOL>2 is particularly suitable for analysing information which is textual rather than numerical. Written in <2SNOBOL,>2 the program given above would become <2&TRIM>2=1 <2&ANCHOR>2= 1 <2LETTERS = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'>2 <2WORDPAT = BREAK(LETTERS) SPAN(LETTERS) . WORD>2 <2MORE CARD = INPUT:F(PRINT)>2 <2AGAIN CARD WORDPAT =:F(MORE) <2IDENT(WORD,'AND'):F(AGAIN)>2 <2COUNT = COUNT + 1:(AGAIN)>2 <2PRINT OUTPUT = COMT>2 <2END>2 Learning to program a computer is not at all difficult. On the contrary, it can be an amusing intellectual exercise, though one that demands the discipline of accurate logic. If the logic of a program is wrong, the computer will execute the program exactly as it is and do what it was actually asked to do, not what the programmer intended to ask it to do. 'Garbage in, garbage out' is a saying in computing which is always true, for the machine has no power of thought. However, it is not always necessary to learn to program a computer in order to make use of it. In the early days of computing, it was soon realised that many people were writing what was essentially the same program and that effort was being duplicated. Instead they collaborated, and the result was a number of general-purpose computer programs called <1packages.>1 Each will perform a variety of related functions, and the user has only to present his data and a few simple instructions specifying which of the functions he would like to perform on the data. More and more packages have been written in recent years. For humanities applications, one such package, called <2COCOA,>2 makes word counts and concordances of texts. There are others like <2FAMULUS>2 and <2INFOL>2 which sort and catalogue the kind of information that is traditionally held on file cards, and others again which

deal with some aspects of literary analysis. Each package has its own series of instructions, and it takes only a matter of an hour or so to learn to use them successfully. There are a number of different kinds of computers. Each computer manufacturer has its own series of machines, such as IBM 370 and <2ICL>2 2900. The size and power of the computers in each series or range of machines may vary considerably. Within one machine range, computers are mostly compatible with one another. That is, a program written for a small computer in the range will also work on a larger one. Programs can usually be moved from one range of computer to another if they are written in a high-level programming language, for they can be recompiled into the basic instruction set of the second machine. The most common compilers are those for <2FORTRAN>2 and <2COBOL.>2 Except for the small or mini computers, there are very few machines which do not support at least one of these. Some of the other programming languages are not always available on every kind of computer. For this reason, computer packages -- at least those used in the academic world -- are usually written in <2FORTRAN,>2 which is the most widely used academic language. The packages are not compiled every time they are used: that would be a waste of computer time. Instead they are kept in a form which is already compiled, known as <1binary.>1 Large computers, such as those found in university computer centres, can perform several different programs at the same time and can also organise queues of programs waiting to run. To do this, the operation of the computer is controlled by a large program called an <1operating system,>1 This in effect organises the work flow through the machine, the aim being to utilise its resources efficiently. The faster the machine can be made to work, the more programs it can process. Some computer centres may employ two or three programmers whose job is just to look after the operating system and ensure that it is working smoothly. At the other end of the scale, there has been much growth recently in the use of <1mini-computers.>1 These are much smaller computers and are usually dedicated to performing one task only. Such machines, if they are serving only one purpose, do not need an operating system to control them, and are frequently programmed in their basic instructions, not in a high-level language. Smaller still are <1microprocessors,>1 computer memory which can be programmed to perform simple functions, rather like powerful programmable calculators. The popular image of a computer, as seen in films and on the television, is one of metal cabinets, tapes whizzing round and a very clinical atmosphere of polished floors and a few human operators looking somewhat out of place. A computer is not just one machine, but a collection of machines which are all linked together to perform the various functions necessary to computing. In fact, most of the equipment visible in a

computer room is concerned with getting information into and out of the computer rather than actually processing it. These are various devices for computer <1input>1 and <1output>1 and will be described in more detail in the next chapter. Information which is being processed is held in the computer's main memory, one or more of the metal cabinets. The processing itself is done in the arithmetic or logical unit and the results transferred back to the main memory before being printed. Collectively, the machinery in a computer room is called the <1hardware.>1 The programs which operate in the machinery are called the <1software.>1 Most people who use computers have no knowledge at all of the electronics which make them work. This is the domain of the engineers and hardware experts whose job it is to ensure that the machinery itself is functioning as it should. If it is miswired it will give wrong answers, but this is very unlikely to happen, as a number of special test programs are run after any changes to the hardware and to ensure that it is working correctly. The ordinary user may be curious enough to discover that all information is held in the computer as patterns of binary digits which are called <1bits.>1 Conventionally they are represented by 1s and 0s, but they are really electronic pulses. The computer's memory is divided up into a large number of cells. The number of bits per cell is always constant on one machine, but it may vary from machine to machine. The cells are always called <1words>1 in computer terminology, which is confusing for those who want to analyse text with real words on the computer. Computer storage is measured in units of 1024 words, one unit of which is called a k. The actual amount of storage, of course, depends on the number of bits per word. For example a computer of 96K 24-bit words would be a medium-sized machine. Some ranges of computers, such as <2ICL>2 2900s or <2IBM>2 360s and 370s, use a different measurement of storage, which is called a <1byte.>1 One byte has eight bits and can be used to represent one character. There are 256 possible combinations of 1s and 0s in eight bits, and so these "byte' machines can use up to 256 different characters. In practice this number is usually lower, because the machinery used to input material to the computer may not permit so many different characters. Computers whose storage is measured in words tend to use only six bits to represent characters. This means that their character set has only 64 different symbols, as there are 64 possible combinations of six bits. These would normally include upper case letters, digits and a few punctuation and mathematical symbols. By using special symbols and an ingenious program, it is possible to retain the distinction between upper and lower case letters on these machines. On the other hand, because their character set is larger, byte machines have upper and lower case letters as normal characters.

Computer people are fond of jargon. They have names for every part of the computer, whether it is hardware or software. We have already seen that programming languages and packages have names which will not have been encountered elsewhere. Most of these are acronyms such as: <2FORTRAN>2 for FORmula TRANslation <2COBOL>2 for COmmon Business Oriented Language <2SNOBOL>2 for StriNg Oriented symBOlic Language <2COCOA>2 for word COunt and COncordance generation on Atlas Others are even more naive. <2PL/1>2 is merely Programming Language 1, although it was by no means the first to be designed. The computer user must unfortunately acquire some of this jargon. Terms like "input' and "output' soon become commonplace to him. If he moves to work on another kind of computer he will find that a different set of terminology is in use for what are essentially the same features. The names of the programming languages and other packages do not change, but many other terms do. Computer people often call their computer "the machine' and refer to it and its programs as "the system'. "The system' has come to be used also to refer to any other set of programs and is a common and ambiguous term in computing. "The system is down' means that for some reason the computer is not working. It may be hardware problems -- that is, something wrong with the physical machinery -- or it may be software -- that is, a user or operator has given some instructions to the operating system which it cannot understand and cannot detect as an error. When the computer stops working in this way, it is frequently said to "crash'. This does not mean that any machinery has literally crashed to the floor, merely that it has come to a sudden halt. It may take from a few minutes to a few hours to start it again, depending on the nature of the fault. The physical environment which is required for computers helps to maintain their aura of mystique. A large computer generates a lot of heat and would soon burn up if it was not kept in an air-conditioned atmosphere. It also requires a constant relative humidity, so that excess moisture does not damage the circuitry. Dust can obviously do a lot of harm to such delicate electronics, and therefore computer rooms must be very clean. Computer room floors always consist of a series of tiles, as many cables are required to connect the various pieces of machinery. These are usually laid under the floor, which needs to be easily accessible for maintenance. When several million pounds worth of equipment is at stake,

it is essential to take such precautions. In contrast, minicomputers can often work without the need for the air-conditioned environment. In fact some are even used on North Sea oil rigs. Early computers had splendid names like <2ENIAC, EDSAC, ORION,>2 <2MERCURY>2 and <2ATLAS.>2 The current fashion is for merely a series of numbers after the manufacturers name or initials. The early computers were much less powerful than our present-day ones, but they were physically much bigger in relation to their power. Modern technology has reduced the physical size of computer storage to a fraction of what it was in the days even of <2ATLAS,>2 which was in use until 1973. To conclude our introduction to computers, we can briefly survey the beginnings of computing in the humanities. When Father Roberto Busa began to make plans for the first computer-aided project in the humanities in 1949, the possibilities must have seemed very different from now. Busa set out to compile a complete word index and concordance to the works of Thomas Aquinas, some eight million words in all. He also planned to include material from related authors, bringing the total number of words to over ten million. Initially he had thought of using file cards for what was to be his life's work. In 1951 he began to direct the transfer of texts to punched cards, which were then virtually the only means of computer input. In 1973 the first volume of his massive work appeared. Further volumes of the series have appeared in quick succession, and the 60-odd volumes are now almost complete. It may seem a long time before publication, but it would not have been possible to attempt such a mammoth undertaking without substantial mechanical help. Busa"s project illustrates the oldest and still the most widely used and reliable computer application in literary research, the simple mechanical function of compiling a word index or concordance of a text. Such concordances were of course made many years before computers were invented. Many researchers devoted their lives to producing what is essentially a tool for others to use. Busa led the way, but he was soon followed by a number of scholars, who also investigated the possibility of using a computer to compile indexes of texts. One of the earliest workers in Britain was Professor <2R.A.>2 Wisbey who, when he was at Cambridge, produced word indexes to a number of Early Middle High German texts in the early 1960s. Wisbey has continued his work since moving to King's College, London, and although his early work was published in book form his latest computer concordance is published on microfiche. Across the Atlantic a number of scholars in the field of English studies began to make concordances which were published in a series by Cornell University. In Holland, Felicien de Tollenaere also made plans in the early 1960s for a word index to the Gothic Bible, which was published in 1976. In

France, Bernard Quemada and a group of others started work on the Tresor de la Langue Francaise, an enormous undertaking which would not be possible without the help of a computer. There was also much early interest in word indexing in Pisa. By the middle 60s, the lone pioneer of Busa had been followed by a number of other scholars, who found that the computer was not just a tool for the scientist but could provide much of the mechanical assistance they too needed in their work. Other applications of computers were soon realised and exploited. Cataloguing and indexing are again mechanical functions. Once it has been decided into which category an item should be placed, the sorting and merging of files is purely mechanical. One of the earliest computer applications was the recataloguing of the pre-1920 material in the Bodleian Library, Oxford. The introduction of modern methods in archaeology has included the use of computers for cataloguing and classification of artefacts, as well as statistical calculations of frequencies of objects found. Some lawyers were early computer users too. As early as 1962, they realised the possibility of using a computer to search large files of legal material to find the documents relevant to a specific case. The examination of features of an author's style, provided that those features can be defined unambiguously, was seen to be a further application. The Swede, Ellegard, in his study of the <1Junius Letters>1 was the first to use a computer for stylistic analysis; but in 1960, when he was working, machines could not cope with the large quantity of text he wanted to analyse and he used the computer only to do the calculations. Mosteller and Wallace soon followed his lead in their study of the <1Federalist Papers>1 and they used a computer to count the words as well. Meanwhile in Britain a Scottish clergyman named Andrew Morton had publicised the use of computers in authorship studies in several books and articles on Paul's epistles, and had attracted the attention of the newspapers. Everyone knew, or thought that they knew, that the computer could be used to solve problems of disputed authorship. In many authorship studies we shall see that the computer can be used to find much more evidence than could be done manually. Once the programs have been written, they can be run many times on different material for comparative purposes. From these early beginnings there has been a large growth in the use of computers in the humanities in the last ten years. Many more applications have now been developed in the fields of textual criticism and poetic analysis, as well as concordancing, cataloguing and stylistic analysis. Some are being performed by standard packages, such as <2COCOA,>2 which are simple to use and readily available in university computer centres. In other cases the researchers have learnt to program the computer themselves. This has allowed them much more flexibility and in addition has involved them in new disciplines which many have found stimulating. In all cases the

results are much more comprehensive than could ever have been obtained by hand.

Before any information can be analysed by computer, it must first be transcribed into a form which the computer can read, and this almost invariably means that it has to be typed out on one of the computer's input devices. These devices, like the computer itself, have only a limited character set. As we have seen on some computers, and particularly on older machines, the character set consists of only 64 graphic symbols. This is the case on <2ICL>2 1900 and <2CDC>2 6600/7600 computers. IBM 360/370 and <2ICL>2 2900 computers are byte machines and allow a total of 256 different characters. But however many characters any particular computer has, the graphic symbols it uses to read and print these characters are not ideally suited to representing literary material. Special computer input devices do exist for representing literary material and for languages which are not written in the Roman alphabet, but these are not widely available and so we must consider the situation where only standard equipment can be used. Input is the most time-consuming part of handling a text by computer. Great care must be taken to ensure that the text is adequately represented so that the effort involved in preparing a text for the computer analysis is not wasted because insufficient information has been included. It is highly unlikely that a computer whose character set includes French accents, German umlaut or Spanish tilde will be available; but if any of these characters appear in the text, they should also appear in the computer transcription of the text. If they are omitted, the computer will be unable to distinguish words which should be distinguished. It would, for example, treat the French preposition a% as identical with the verb <1a,>1 part of <1avoir.>1 Therefore some of the mathematical symbols must be used to denote such diacritical marks. In French an asterisk * could perhaps be used for an acute accent and a number sign # for a grave accent. A circumflex could be represented by an <1at>1 symbol @ and a cedilla by a \\ or $ sign, provided these symbols exist on the computer to be used. The French word <1a%>1 would then appear to the computer as <1A#.>1 The accent may, if desired, be typed before the letter, giving <1#A.>1 It does not matter which of these is chosen, provided the same conventions are used all the way through the text. Once <1A#>1 has

been chosen to represent <1a%, A#>1 must always be given, never <1#A.>1 Special symbols must also be used to denote a change from upper to lower case if the computer only possesses a character set of 64 symbols. Preserving the distinction between upper and lower case has been considered unnecessary by some people, and indeed many texts have been prepared for computer use without it; but it may be essential in some languages, for example, to distinguish proper names. It is often advisable to use one symbol to indicate a capital letter at the start of a sentence and another to indicate a capital at the start of a proper name. Both symbols would be given when a sentence begins with a proper name. If $ was used for a proper name and % for a capital at the start of a sentence, "Jack and Jill went up the hill' would be coded for a 64 character set computer as follows: <2%$JACK AND $JlLL WENT UP THE HILL>2 and 'The cat sat on the mat' would appear as <2%THE CAT SAT ON THE MAT>2 where letters which have no preceding symbol represent lower case. On a computer with 256 characters, which therefore distinguishes lower case letters, it may still be advisable to identify proper names with some special symbol. The computer could then distinguish the town Reading from the verbal form "reading' in the following sentence Reading is boring. Four lines of poetry coded for an upper case only computer would appearas <2%YES.>2 %I <2REMEMBER ,ADLESTROP>2 \\ <2%THE NAME, BECAUSE ONE AFTERNOON>2 <2%OF HEAT THE EXPRESS-TRAIN DREW UP THERE>2 <2%UNWONTEDLY. %IT WAS LATE $JUNE.>2 Here we are using % to represent a capital at the beginning of a line or sentence, $ at the beginning of a proper name and \\ for a long dash, another symbol which the computer does not have. Coding texts which are written in the Roman alphabet therefore involves upper case letters, proper names, diacritics and unusual punctuation; but it is not too difficult to deal with these. More problems arise with languages which are not written in the Roman alphabet. They must first be

transliterated into the computer's character set. In Greek, for example, the letter a is usually denoted by <2A,>2 B by <2B,>2 y by G etc. The Greek letter theta is conventionally transliterated as "th' as that is the sound it represents, but in transcribing for computer input, it would be more sensible to choose a single symbol to represent each GREEK letter. Q could be used for theta, Y for PS, F for plti, etc. A GREEK word such as <1BEos>1 would then appear in computer GREEK as <2QEOS.>2 Many GREEK texts prepared for computer analysis have included symbols to mark accents, breathings and capitals, in fact all the possible diacritical marks. This may well lead to a text which appears cluttered with non-alphabetical characters, but no information will have been discarded. It is a very simple operation to ask the computer to delete all the instances of a particular character, but it takes much longer to insert them afterwards. The first four lines of the <1Iliad>1 are given below with the coding for every diacritic <2$M5HNIN 24AEIDE, QE4A, $PHLH7I2ADEW 2$ACIL5HOS>2 <2O2ULOM4ENHN,>2 36H MUR4I' <22$ACAIO5IS 24ALGE' 24EQHKE,>2 <2POLL6AS>2 D' 2IFQ4IMOUS <2YUC6AS>2 24$A7IDI <2PRO47IAYEN>2 3HR4WWN, <2A2UTO6US D6E>2 3EL4WRIA <2TE5UCE K4UNESSIN>2 With many of the diacritics removed, the text becomes clearer to read <2$MHNIN AEIDE, QEA, $PHLHIADEW $ACILHOS>2 <2OULOMENHN,>2 3H MURI' <2$ACAIOIS ALGE' EQHKE,>2 <2POLLAS D' IFQIMOUS YUCAS $AIDI PROIAYEN>2 <23HRWWN, AUTOUS DE 3ELWRIA TEUCE KUNESSIN>2 but in this form some of the information is lost. It takes a classical scholar only a matter of a few minutes to learn the computer transcription of Greek. The Russian alphabet has thirty-three letters in the modern script and is really too long for only one computer character to be used for each Russian letter. Characters must be reserved for the ten digits and all possible punctuation symbols and there are then very few characters left to identify such features as capital letters and foreign words. It might just be possible to fit them in, but it is quite feasible to use more than one computer character to represent a single character within the text, provided the computer is programmed to recognise this. The Russian letter n could then be transliterated as <2TS>2 and A as <2YA.>2 Semitic languages such as Arabic and Hebrew are written from right to left, but in transliteration they are typed into the computer from left to right. There would be no sense in typing the text backwards, as special programming would then be necessary to read it backwards. The

vocalisation is normally omitted from Hebrew and Arabic texts, and it can usually be omitted from the computer text. Neither of these languages has capital letters as we know them in European scripts. Hebrew is therefore a simple case because the alphabet is short, only twenty-two letters. The five letters which have different form at the end of a word could be coded in the same way as the normal form, as they have no semantic significance. If ever it were required to print out the material in Hebrew script, the computer could be programmed to recognise that, for example, a letter N at the end of a word is always a final form. The Arabic alphabet and its derivatives have more letters, and each of these has four different forms, depending on where it occurs within a word, whether it is the first letter, last letter, one in the middle or one standing on its own. The total number of characters then required is somewhere in the region of 130 plus numerals. Fortunately there is no punctuation in classical Arabic and it is of only doubtful validity in the modern language, and so it can be ignored for computer coding. Though for some computers it would be theoretically possible to code all these letter forms as different computer symbols, the normal method adopted has been to use the same computer character to denote each of the four forms of the same letter. The rules for writing Arabic script are so rigid that the computer can be programmed to transcribe it back into the original script if necessary. Most alphabetic scripts can be dealt with in the ways that have been described. Devanagari has some forty-eight symbols, but a two-character code has been devised without much difficulty. Chinese and Japanese present more problems. One possibility for Japanese is to use the syllabic form of katakana, but this loses the visual representation of the text completely. In coding Chinese, the most sensible procedure is to allocate a number to each Chinese character and record the text digitally. The number could perhaps be accompanied by a transliteration. more than one computer project on Chinese has used the Standard Chinese Telegraph Codes. When preparing a text for the computer, it is always important to include as much information as possible. Even if it is not all required at once, it may be needed in the future, by other people as well as by the one who originally prepared the text. A small pilot study should always be carried out first. Even a few lines of text prepared in the chosen format should suffice to experiment with particular programs before a large batch is prepared. Foreign words or quotations should be considered with care, particularly if a word index or concordance is to be made. For example, the word <1his>1 in the Latin phrase <1in his moribus>1 which appeared as a quotation in a text should not be listed in a word index under the English word <1his.>1 This would happen unless the computer were told by some identifying symbol that the Latin was a quotation. Similarly if there are lacunae or other

omissions in a text, they should be indicated in some way. Otherwise for example, in Latin the letters "in' followed by a lacuna would be listed with all the occurrences of <1in,>1 whereas they may be the beginning of a totally different word. It may also be necessary to include information about changes of type faces, for example in a study of printing. A text must contain reference identifiers in some form so that any features which the computer may find can be located in the original text. This may be done either by giving the line number and other identifiers on each line of text or by using special symbols to enclose reference identifiers within the text. Each identifier would then refer to all the text which follows it until another identifier is encountered. When the latter method is used, the computer can also be programmed to keep track of the line numbers in the text. Transcribing the material into the computer's character set is only the first part of preparing a text in computer readable form. The transcribed or transliterated text must then be typed on to a suitable computer input medium. The traditional input media of punched cards and paper tape are not now as universally used as they once were, but we should begin by describing these. A punched card is simply a piece of card <231/4>2 inches high by 73/8 inches wide, which is divided into eighty columns vertically and twelve rows horizontally. Special keyboard devices, called <1card punches,>1 are used to make patterns of rectangular holes in the columns. Each column is used to represent one character and normally holes are punched in up to three of the rows for each character. The example in Figure 2.1 shows the letters of the alphabet and numerals represented on a punched card. Different kinds of computers use different patterns of holes to represent each character. These are known as card codes and have names like <2BCD>2 and <2EBCDIC.>2 The letters and numerals normally have the same card code on all computers, but some of the punctuation and mathematical symbols ("special characters') may have different codes. If information is to be transferred from one computer to another on punched cards, it is advisable to enquire about card codes before starting. Most computer installations have utility programs to translate from "foreign' card codes. Only upper case letters can be represented with any ease on punched cards. The card-punching machines usually print the characters which the holes represent along the top of each card, as you can see from the reproduction of a card in Figure 2.1. This is only for visual recognition: the input device only reads the holes. It is possible to make a copy of a deck of cards on a machine called a card <1reproducer.>1 Only the holes will be copied. If the printing is also required, the cards must then be fed through another machine called an <1interpreter>1 which reads the holes and interprets them into characters which it prints along the top.

Accuracy is of vital importance when typing any material for the computer. One way of checking punched cards is for a second typist to retype them using a card <1verifier.>1 Cards which have already been punched are fed into the verifier, which looks like another card punch. The information on each card is then retyped over the card, and if identical characters are typed a small incision is made into the side of the card. Those cards which have no incision can be checked manually and any discrepancy noted and corrected if necessary. This method is quite effective when two different typists are used, but if the same mistake is made twice it goes unnoticed. The chief disadvantage of punched cards is that they occupy a lot of room for storage. One card holds a maximum of eighty characters; but it is usually much less, as information is rarely typed to the end of a card. Unused columns on the right are read as blanks, and a program can be instructed to ignore them. One box of 2000 cards is 14 inches long and it would take over forty boxes to hold the complete works of Shakespeare. Error correction is however fairly simple, as usually only one card needs to be repunched and the offending card can be thrown away. Paper tape has two advantages over punched cards: it occupies less room for storage, and it can represent lower case letters as well as upper case. Most paper tape now in use is 1 inch wide. It has a maximum of eight possible holes across the tape and is therefore known as 8-hole tape or 8- track tape. A row of tiny holes between the fifth and sixth track simply holds the tape in the tape-punching or tape-reading machine. These are known as the sprocket holes. The tape punching machine is a special kind of typewriter, sometimes called a <1flexowriter.>1 It can be used either to punch new tape or to read what is already on a tape, which it does by interpreting the pattern of holes and typing the corresponding characters on a sheet or roll of paper inserted in it. Figure 2.2 shows a piece of paper tape containing the capital letters and numerals. Only seven of the tracks are actually used to represent each character. The eighth is known as the <1parity track,>1 which is a kind of checking device. The piece of tape shown here is even parity, i.e. there is always an even

number of holes for each character. A hole is punched in the eighth track when necessary to make an even number. If the tape reader is working in even parity and encounters a character with an odd number of holes, it has detected an error and stops. It is normally necessary to leave a foot or so of blank tape with only the sprocket holes punched at either end of a paper tape so that it can easily be fed into a reader. This blank tape is called <1runout,>1 and it is always advisable to be generous with it. Like cards, there are many different paper tape codes, each using a different pattern of holes to represent the characters. There are also other widths of paper tape, usually either 5- or 7-track. The chief disadvantage of paper tape is error correction. Using a flexowriter, the tape must be copied on to a new tape. When an error is reached, the flexowriter is stopped, the correction typed on to the new tape and the old tape repositioned at the start of the next correct section. Alternatively sections of corrected tape can be spliced or glued into the original. These tasks are very laborious and frequently lead to more errors, but they must be undertaken in order to obtain a clean error-free tape. It is for this reason that paper tape is now little used in this manner. If it is used at all, the tape complete with errors is fed into the computer and the errors are corrected, by use of a terminal, as will be described later. Cards and paper tape are read into the computer via peripheral devices, known as <1card readers>1 and <1tape readers.>1 Card readers can process up to one thousand cards per minute, and the speed of tape readers is in the order of one thousand characters per second. The computer stores up the information for later use, and after being read the cards or tape pile up in a hopper or bin and can be retrieved for further use. The most common means of obtaining information from a computer is by printing it on what is called a <1lineprinter,>1 so called because it prints one whole line at a time. An example of lineprinter output is shown in Figure 2.3. The reason why this printing appears somewhat ugly is that the machine normally prints at well over a thousand lines per minute. The type face varies from one computer to another, but all have rather inelegant characters. Most lineprinters have only a limited character set, frequently only upper case letters, numerals and punctuation symbols. The size of the paper used for printing varies from one installation to another. The width is usually 120, 132 or 160 characters across the page and there are ten characters to an inch. At the normal spacing of six lines per inch, 66 lines can be accommodated on the most frequently used size of paper page. At eight lines per inch the printing becomes very squashed, as seen in Figure 2.4; but some centres now print at this spacing to save paper consumption. The holes down the side of the paper are there merely to hold it in the lineprinter. Computer paper is always folded in a concertina fashion and the folds are perforated so that each separate document or piece of output

can be torn off, Many installations use paper which is striped across the page. The stripes are to assist the reader. When, for example, rows of numbers are printed in columns across the page, the stripes guide the reader to see which numbers are in the same row. They are not necessary for the computer, and plain paper can be used just as well. Gas bills,

electricity bills and bank statements are now printed by computer, and special pre-printed stationery is used on the lineprinter. If several copies of some printout are required, special stationery can be used which is carbon- backed to make two, three or even four copies at one printing. Some lineprinters have upper and lower case letters and can give a more pleasing appearance to the printout than an upper case only machine; but even so the quality of the printout they produce is barely suitable for publication. The characters which they print will still not include accents and diacritics which are necessary to reproduce some textual material. Neither will they cater for any change of type fount -- into bold or italic, perhaps. It is possible to have special characters made for a lineprinter so that it will print accents, but these are very expensive. It will be seen later that there are other, much better, means of producing output direct from a computer when a final copy suitable for publication is required, but the lineprinter is the normal method of computer output and will certainly need to be used to obtain results at an intermediate stage. Output can also be obtained from a computer on punched cards or paper tape. This could perhaps be necessary if information is to be transferred to another computer, but it would only be feasible for small amounts of data, There are much better ways of transferring larger quantities of information. <1Magnetic tape>1 and <1disc>1 provide the means for storing large amounts of information inside a computer and, in the case of magnetic tape, for transferring information from one computer to another, Suppose that a researcher had laboriously prepared the whole of the plays of Shakespeare on to punched cards and then wanted to perform several different analyses on this material. It would be nonsensical to attempt to feed all those boxes of cards into the computer for each analysis. The obvious solution would be to feed them in once and then store the text on some magnetic media, such as tape or disc. The computer can then access the information on the tape or disc much faster than it can from cards, and the operator is relieved of the cumbersome chore of handling cards. Magnetic tape looks very much like tape-recorder tape and behaves in a similar manner. It is normally half-an-inch wide and is wound on reels which hold up to 2400 feet of tape. A computer installation may possess many thousands of tapes, but only a few would be in use at any one time. When they are being used, the tapes are mounted on a tape deck which has special heads, which either read the information recorded on the tape, i.e. transfer it to the core store or memory of the computer, or write new information on to the tape from the store. When new information is written to a tape, it overwrites and consequently erases what was there before, just as a tape-recorder does. Therefore care must be taken not to erase or overwrite information which is required. One way of ensuring that this does not happen is to use what is called a <1write permit ring>1, a plastic ring

which can be inserted in the back of a tape reel. Information can only be written to the tape when this ring is in place. The information is recorded on the tape in a series of rows of magnetised spots across the tape. Each magnetised spot represents a bit and tapes can have seven bits across the tape or more frequently nine bits. The seven bits, or 7-track tape as it is called, can be used to represent one 6-bit character per row, with the extra spot being used to check parity, just as in the case of paper tape. In the case of 9-track tape, eight spots or bits are used to represent data (e.g. one byte) and the ninth track is a parity track. Information is transferred to and from the tape in sections known as blocks. Between each block there is a gap of 1/2 or 3/4 inch. It is therefore more economical to write large-size blocks of, say, 4000 or 8000 characters rather than to write blocks of individual 80-character <1card images,>1 as they are called. The amount of information which can be stored on a magnetic tape depends on the block size used when the material is written. It also depends on the density at which the information is recorded. On 7-track tape the information can be recorded at 200, 556 or 800 rows per inch, commonly known as bits per inch, abbreviated to <1bpi.>1 On 9-track tape the recording density is 800bpi or 1600 or it can now be even higher. Different tape decks are usually required to read 7- and 9-track tapes, and some tape decks may not be able to read certain densities. It is therefore essential to obtain some technical information about magnetic tape formats before attempting to transfer information from one computer to another on magnetic tape. The technical details may be difficult for the new user to understand; but before attempting to copy the material on to the tape, he should always remember to ask the people at the computer centre which is going to read the information what format of magnetic tape it can read. If a user turns up at the computer centre carrying a magnetic tape which has no technical specifications -- these are normally written on a sticky label on the outside of the reel -- the centre will not be very helpful until more information is supplied about the format of the tape. Magnetic tapes are easily damaged. They must be kept in the air- conditioned atmosphere of the computer room and will deteriorate if left in the hot atmosphere of a study or office. However, they will withstand being in the post for some time, for example crossing the Atlantic by surface mail. If a tape becomes scratched, the tape deck will be unable to read the section which is scratched and in many cases will stop dead so that the information which comes after the scratch is also inaccessible. Tapes can also deteriorate over a long period of time. It is therefore essential for at least two copies to be kept of material which is stored on magnetic tape. If one becomes damaged, or "fails', as the technical term is, the information is preserved on the copy and can then be restored to another tape. It is

advisable to keep at least three copies of material, such as a text archive, which has taken a long time to prepare. Most computer centres have a utility program which makes copies of magnetic tapes, It is sensible to run this program every time more data has been added to a tape. Tapes are normally identified by a serial number or name. The computer centre will allocate some tapes to a user, who will know their numbers or names, so that his program may request the correct tape; but he will probably never see his tapes. The computer operators will be instructed by the computer to load the tape on to a tape deck when a program requests it. They will then take the tape out of the rack, mount it on the appropriate deck and return it to the rack when the program has finished. Loading and unloading tapes are in fact the chief occupations of operators in modern computer centres. If magnetic tape can be compared with tape-recorder tape, magnetic disc could perhaps be said to resemble gramophone records. A magnetic disc is really a series of discs stacked one on top of another with a gap of about an inch between each. They are all mounted together on a vertical spindle. Exchangeable discs can be removed from the spindle or replaced as required and they are sometimes called disc packs. Typical sizes of exchangeable discs hold thirty million, sixty million, one hundred million or two hundred million characters. The speed of access of information from magnetic disc is much faster than that from a tape. Information on a magnetic tape must be read and written serially; i.e. to read from or write to any part of the tape, all the preceding information on the tape must first be read. It takes up to five minutes for a reel of tape to be wound from one end to the other. On the other hand, it is quite feasible to access information randomly on disc. Each disc spins rapidly and a read head is positioned wherever it is required. Normally a disc would contain several or many sets of information, and these are called <1files>1. Each file may have been written serially, as if it were a tape; or it could have been set up in such a way that any part of it could be accessed without all the preceding information having to be read. Discs have many advantages over magnetic tape, but unfortunately they are not cheap. The computer centre may have what appears to be a large number of discs and disc drives, but the user will find that they are in constant use, as all the files which are most frequently used by the computer itself are kept on disc. There will not be very much space left for the user to store large quantities of text. He may well be allocated some disc space for small files of data; but once this gets large or full, he will be requested to use magnetic tapes instead. Fixed discs and drums may also be used as storage devices. The average access times for these is even faster than for exchangeable disc, but there is usually much less space available, particularly on drums. These are not

usually available to users at all, but are employed to store files which are constantly in use by the operating system. Another means of transferring information in and out of computers is by <1teletype>1 terminal. These are being used more and more frequently and are the basis of <1on-line>1 or <1interactive>1 working, where the user communicates directly with the computer from his own keyboard and it responds to him immediately, thus creating a dialogue between the two. A teletype resembles a typewriter and holds paper which allows up to eighty characters across the page. Some teletypes have upper and lower case letters and look like electric typewriters; others have a more restricted character set, like a card punch, and operate only in upper case letters. If necessary, teletypes can be equipped with a special keyboard and character set for typing foreign languages, thus solving some of the problems of transliteration or transcription of input. Although the teletype may have a non-standard character set, what is transmitted to the computer consists of only a pattern of bits. If this pattern of bits is sent to another device for output, such as the lineprinter, it may well appear as a completely different graphic symbol. The teletype is a somewhat noisy device, though it does have the advantage of providing the user with a piece of paper recording his dialogue with the computer. Another kind of computer terminal is a <1visual display>1 <1unit>1, or <2VDU. A VDU>2 is a small video screen with a keyboard attached. The characters which the user or the computer types appear on the screen, which will hold about twenty lines of text. Some <2VDUs>2 have a tiny computer memory inside them which allows corrections to be made to a screenful of data before it is transmitted to the computer. Others simply roll the text up the screen as more information is typed at the bottom, so that what is typed gradually disappears from view, though the computer will not have forgotten it. The <2VDU>2 is quiet -- uncannily so in some respects. Like the teletype it can be fitted with a special keyboard. The shapes of the characters it displays are stored on a small electronic chip, and a chip containing any one of many alphabets could be used to display Russian or GREEK letters on the screen. Combined with an appropriate keyboard, this makes input of non-Roman alphabets much simpler. Teletype terminals or <2VDUs>2 can be situated many miles away from the computer, being linked to it via a telephone line. They can be used in the home if there is another piece of equipment, called a <1modem,>1 which links the telephone to the terminal. The user dials the computer's telephone number and when it replies -- automatically, of course -- he puts the telephone handset into the modem box, presses a button and is ready to start a computer session. High-speed telephone lines are being used more and more to link one computer to another. A card reader and lineprinter, say in Cambridge, could be used as what is called a <1remote job entry>1 station <2(RJE>2

terminal) to a computer in Edinburgh. The cards would be fed through a reader in Cambridge, the program run in Edinburgh and the results printed back in Cambridge. In the <2UK,>2 at least, the university computer centres are linked up in this way in order to pool their resources and give extra benefit to their users. Such a linking of computers is called a <1network.>1 Using a teletype or <2VDU>2 terminal is the best way of making corrections to information which is already in computer-readable form, usually stored as files on disc. The computer can be instructed to make changes to files by using a special program, called the <1editor.>1 The places where the corrections are to be made are identified within a file either by the line number or by locating a unique sequence of characters. The actual instructions to be given to the editor vary from computer to computer, but all work in a similar way. Instructions like <2T#32>2 <2R/AND/BUT/>2 would be interpreted as move to line number 32 replace the first occurrence of the characters <2AND>2 in that line by <2BUT>2 The slashes (/) are called delimiters and serve to separate the error from the correction in the editing instruction. In this example the editor has been told to find the place to be corrected by line number. If the line number was not known, it could be requested to find the place with an instruction such as <2TC/CAT AND/>2 which would be interpreted as look for the first line containing <2CAT AND.>2 Enough text must be given to identify the line uniquely. There may be many <2ANDs>2 in a file, but the particular one we want is preceded by <2CAT.>2 There are special instructions for inserting and deleting complete lines, and it is usually also possible to specify an editing instruction which is to be repeated until the end of the file or some other condition is reached. The computer makes a new version of the file with the corrections incorporated into it. The old version can then be erased by supplying the relevant instruction to the operating system. Some computer editing programs work by making one pass through the file. If the user has passed the place where a correction is to be made, that particular dialogue with the editor must be terminated and another begun which puts him back to the beginning of the file. This can be done in the same session. More sophisticated editors allow

the user to jump about the file making corrections at random. The correction of data is much easier using an editor than repunching cards or tape. Many editors display the corrected line on the terminal, so that the user can ensure that it is correct before sending it to the computer. If the information was originally typed on punched cards or paper tape, it can be fed into files in stages and each stage corrected before being copied on to magnetic tape for long-term storage. People frequently ask whether there are machines which can read material from a printed book and transfer it to magnetic tape or disc. This would eliminate the most laborious part of any computer-aided analysis of text. The answer is that there are machines which can read, but they cannot read directly from a book. They can however read typescript. This is because on a typewriter each letter occupies the same amount of space, whereas in a book the spacing is variable because of the right-hand justification. Machines which can read are called <1optical character readers>1 <2(OCR).>2 They can only read certain type founts and they are very expensive to buy. The information must first be typed on special paper using an electric typewriter fitted with a special ribbon and <2OCR>2 golf ball head. The tpewriter often needs adjustment for <2OCR>2 work, which renders it unsatisfactory for normal typing. Two type faces which are commonly used for <2OCR>2 typing are <2OCR A>2 and <2OCR B,>2 as shown in Figures 2.5 and 2.6. The former is not very pleasant for the human eye and is frequently typed only in upper case. The latter is much easier on the eye, but has been found

to be prone to error in reading, particularly in the confusion of lower case <1h>1 and <1b>1. Some other manufacturers of <2OCR>2 equipment have designed their own type faces to suit their reader. One in particular has little strokes or bars underneath the characters. These are read by the machine and incorporate a parity check in the same way as magnetic or paper tape to test for errors or misreads. An experiment with this machine found that the total height of each character, including the strokes underneath, was too great for an electric typewriter ribbon and thus did not register complete characters on the page. Optical reading has had some success in the area of mark sensing, where pencil marks on a preprinted form are detected by another kind of scanning device and are transmitted to the computer's storage devices. This method is used for the Open University's computer marked assignments and may be a possible method of input for archaeological or historical data. As far as textual material is concerned, optical character reading has not been very successful. The information still has to be typed. A trained typist types faster than a key punch operator, but now <2VDUs>2 and teletypes are being used more frequently for input, and these can be operated much faster than a card punch. For <2OCR>2 work errors can be introduced at the typing stage. They cannot be corrected using correcting fluid, as this changes the surface of the paper. Instead the whole page must be retyped or a special delete character (a question mark ? in Figure 2.5) used to instruct the <2OCR>2 machine to ignore the previous character. Once the typing is complete, errors can be introduced again at the reading stage. Characters which are misread go completely undetected until the data is proof-read. Characters which are rejected because the machine is unable to identify them are dealt with in one of two ways. Either a special character such as an @ is substituted for them, or the reader stops and displays the shape it has seen on a <2VDU>2 screen. The operator can then key in the unrecognised character. We have now dealt with all the major methods of computer input, and have seen how textual material must either be transcribed into the computer's character set or typed on to a special purpose keyboard. We can now consider how output of a publishable quality can be produced by a computer. It is clear that the standard upper-case-only lineprinter output, such as we saw on page 26, is of such a poor quality that it is unsuitable for publication purposes. An upper and lower case lineprinter produces slightly better printout, but it is not really suitable for high quality printing, Another method adopted by Wisbey at Cambridge was to produce paper tape from the computer and then feed it through a flexowriter, which was fitted with a special type face for medieval German

diacritics. Paper tape can also be used to drive a terminal which resembles an electric typewriter, and can have any golf ball head inserted in it. Again, this produces output which looks like typescript, not like printing. At a speed of 10 characters a second, it would take some time to print even a fairly short text or concordance on a flexowriter. Paper tape is also prone to tear easily or become entangled, and it is not easy to handle in large quantities. What is really needed to produce high-quality output is a photocomposing or computer-typesetting device. These machines contain a cathode ray tube <2(CRT)>2 which is photographed by a camera facing it. They are driven by a magnetic or paper tape containing instructions to display characters on the <2CRT>2 which are recorded by the camera. The images are then reproduced on film or photorecording paper. The more expensive machines can display any shape on the <2CRT,>2 whether characters or pictures. The shapes are made up entirely of straight lines. A curve is drawn by a number of very tiny straight lines joined together at the required angles. Any shape can be drawn by the computer, if the relevant instructions are programmed. Each character is first designed on an enlarged grid and is made up entirely of straight lines. The coordinates of the set of lines for each character are then stored inside the computer. Thick and thin lines, an italic effect, can be created by drawing a series of lines parallel to each other, When the character is drawn at the small size required for printing, the parallel lines appear to be joined up, thus giving a single thicker line. Every time the computer is required, for example, to print an <2A,>2 it calls up the coordinates for the letter <2A>2 and instructs the photocomposing machine to draw the letter at the appropriate position on the cathode ray tube. As the characters are drawn rather than printed, it is theoretically possible to draw or print any character of any alphabet. The designs must first be established and the coordinates for each character stored once. By this method it is possible to print even Chinese using the computer. An example of a Chinese character drawn on a grid of 26 x 26 is shown in Figure 2.7. Some photocomposing machines which work in this way draw their characters only from vertical lines, and for this a very fine definition is required, Others draw straight lines in all directions like the Chinese character shown. More recently much cheaper computer typesetters have become available, which use metal grids containing the shapes of ready- formed characters which can be projected on to an image and photographed. The number of characters available is much lower than on a typesetter which constructs shapes out of straight lines, but several founts can be used on one page and they are available in many different type sizes. If sufficient special codes have been inserted in the input to indicate change of fount, change of type size, etc, on either kind of typesetter the

output can resemble a printed book. All that is needed is a computer program to interpret the typesetting codes and instruct the photocomposer to generate the required characters. The program can include the necessary instructions to make up a page of print correctly with page numbers, titles, footnotes, etc. As any character can be displayed anywhere on the page, the program can also introduce variable-width spacing and thus justify the text to the right-hand as well as to the left-hand margin. Such a program would also interpret codes for capitalisation (if the text was all in upper case) and deal with accents and diacritics. The example

shown in Figure 2.8 was typeset by computer and shows the range of possibilities. One obvious advantage of using a computer typesetting device for printing is that the material does not have to be retyped if it is already in computer-readable form. Suppose a complete text had been prepared for computer analysis and was now error-free. The text would then be processed by the computer in some way, perhaps to make a concordance. The resulting output could then be printed directly on to a computer typesetter without any further need for proof-reading. Once the material which goes into the computer is accurate, the output is also accurate, even if it is in a much expanded form like a concordance. The chance of further error, which would inevitably creep in during conventional typesetting, is eliminated and with it the requirement for extensive proof-reading. The output from photocomposing machines is usually in the form of photorecording paper, but some machines also allow microfilm or microfiche. Computer output microform, or <2COM>2 as it is generally known, is now becoming widely used in the commercial world. <2COM>2 devices can behave as if they were fast lineprinters and print text on to film using rather ugly lineprinter-type characters at speeds of up to 10,000 lines per minute. 35mm microfilm was the earlier type of microform to be used, but it is not so convenient for the user, as much rewinding is required. Microfiche is now the most frequently used <2COM>2 output. Normally computer microfiches are 42 or 48 magnification and thus have 270 frames or pages on one fiche. Some also have indexing systems, so that the user can soon identify which frame is required. Microfiche readers can now be purchased quite cheaply. The cost of publishing material by conventional means has risen so sharply that microfiche and microfilm are very real alternatives. Already a number of computer concordances have appeared as microfiche. They may not be as convenient to use as an ordinary book, but they are so much cheaper to produce that the choice is often to publish on microfiche or not at all. Those who are using a computer are at a distinct advantage here because their material can be printed directly from the computer on to microfiche. Microfiche can be produced by many photocomposing machines, and so their text may have many of the qualities of a printed book, with elegant characters, justification and full diacritics. In this chapter we have seen how textual material can be prepared for the computer and what methods are available for input and output. The best method of input would seem to be a <2VDU>2 or a teletype with a special purpose keyboard, which can accommodate all the extra characters not

normally in the computer's character set that a text may require, although some care may be needed in interpreting results as non-standard characters may appear as different characters when another output device, such as a lineprinter, is used for checking intermediate results. However, the non-standard terminal could be used for proof-reading and correcting the text until it is free of error. If such a terminal is not available, it is always possible to prepare material on punched cards. Many texts now in computer-readable form have been prepared in this medium. Results are best output by a photocomposing machine or computer typesetter. The lineprinter is adequate and necessary for a fast turn-round of intermediate results, but for publication the quality of its printing is not suitable. The photocomposer or <2COM>2 machine is capable of producing print equal to that achieved by conventional typesetting methods, and it should be used for any final version for publication, whether in book form or on microfiche. 3.

The production of word indexes and concordances is the most obvious application of the computer in literary research. Indeed it was the earliest, beginning with Roberto Busa"s work on the <1Index Thomisticus>1. A word index is simply an alphabetical list of the words in a text usually with the number of times each individual word occurs and some indication of where the word occurs in the text. When each occurrence of each word is presented with the words that surround it, the word index becomes a concordance. The compilation of such word indexes or concordances is a very mechanical process and therefore well suited to the computer. There has indeed been a large number of such concordances in the past fifteen years or so. The computer can count and sort words much faster and more accurately than the human brain. The days when a person devoted his life to the manual preparation of an index of a text are over, but we shall see that using a computer does not remove the need for human thought and judgment. Some examples of word indexes and concordances made by the <2COCOA>2 program appear as Figures 3.1, 3.2 and <23.3.>2 The first shows a word count, a list of all the words in the text in alphabetical order, with a count of the number of times each word occurs. References, indicating where each occurrence of each word comes in the text, are added in Figure 3.2. Figure <23.3>2 shows a concordance. Each word in the text is given in alphabetical order and the number of occurrences or 'frequency count' for each keyword appears at the top of the entries for that word -- somewhat unnecessary if the word occurs only once, but very useful in a large text where a word can occur many thousands of times. The text accompanying each occurrence of each keyword is usually referred to as the context, and the indication of where the keyword occurs in the text is known as the reference. In this concordance the keywords are aligned down the middle of the page. This is known as a <2KWIC>2 or "keyword in context' index, and is the form adopted by David Packard in his Livy concordance, among others. The lines of context can also be aligned to the left of the page, as in Figure 3.4. The first form is generally considered easier to use, but it does take up considerably more space across the page. Many reviewers of computer-produced concordances have been critical

of them in a number of respects. In particular, they have cited the lack of lemmatisation, the failure to distinguish homographs and the indiscriminate choice of context. I hope now to demonstrate which of these criticisms, if any, are attributable to the computer itself and which can be altered to the choice of the compiler or editor of the concordance. The examples from verse shown above give just one line of context. This is the most frequently used choice of context. It is eminently suitable for verse where the lines are of a reasonable length, but is not so applicable for verse with short lines or for prose text, It is quite possible to program the computer to supply context of a specific number of words on either side of the keyword or to ask it to extend the context up to a specific character or

characters, like full stops, commas, etc. With some programs it is also possible to fill up the width of the lineprinter's line or to a specified number of characters on either side of the keyword. In this case the context may be broken off in the middle of a word, as in Figure 3.5, possibly giving a misleading interpretation of the text. If a <2KWIC>2 format is chosen and the keyword is at the very beginning or end of the context -- that is, it comes just before or after the chosen cut-off character -- the computer may be programmed to extend the context into the other half of the line, and make it appear to "curl round'. An example of this appears in Figure 3.6. This again leads to a slightly unsatisfactory and misleading appearance and the context which has curled round could be omitted, as shown in Figure <23.7.>2 Therefore the choice of context for each keyword can be programmed to suit a variety of criteria although it is not so easy to choose the context for each word individually. The best overall criteria must be chosen, but the selection of criteria is entirely up to the editor of the concordance. The amount and choice of information to be given as the reference for each citation is another matter for the editor of the concordance to determine. When the text is fed into the computer, it can be given reference markers indicating line number, page number and title of the work, and act, scene or speaker in the case of a play. It is then possible to instruct the machine to accompany each citation with whatever combination of these references the editor thinks appropriate. The important point is that it is much easier for the computer program if all the references in the concordance have the same format. Some form of manual intervention or editing would be required if parts of a concordance were to be referenced in one format and parts in another format. References are an important part of a concordance, for they enable a word to be located in a much wider context by pointing to the actual place in a text where it occurs. They also help to show if a word occurs more frequently in any part of a text or does not occur at all in any particular part. It is usually best to specify as much detail in the references as possible, as long as too much space across the page is not occupied. Referencing the words in a play merely by line numbers does not provide much useful information, but the inclusion of act, scene and speaker for each word could indicate which characters favoured which word forms and whether the language of one particular scene or act deviated from another, Abbreviated forms of references, like the ones shown in our examples, are adequate, provided their meaning is clear. Those who have prepared manual indexes or concordances to texts have tended to omit or ignore words which occur very frequently, mostly on the grounds that the information to be gained from a list of all these words was not worth the effort involved in indexing them. It is well known that very few different words make up a high proportion of the words used in a

particular language or text. The editor of a concordance must decide whether to include these or not. For a computer-produced concordance, the extra effort is on the part of the computer. It may take a little longer for the machine to include every word, but it requires no more effort on the part of the editor. It is usually best to produce a simple word count first on the computer. This enables the editor to decide whether to include high- frequency words, since he can then calculate how many entries or pages of printout would be produced if all the words were to be concorded. In a concordance of the Roman historian Ammianus Marcellinus produced in 1972, there were over 125,000 words, of which the word "et' occurred 4501 times. That is only one entry in a word count, but at 65 lines per page of printout, it would produce almost 70 pages of concordance. These high frequency words are often of most interest to those studying linguistic usage. I would suggest that they be produced on microfilm or microfiche if their volume was such that they could not be included in a published concordance. Alternatively they could be kept on magnetic tape, so that the occurrences of selected words could be inspected if required. With most concordance programs it is possible to omit all words which occur above a given frequency, or simply to specify a list of words which are not required as keywords. The latter ensures that words which are required are not omitted simply because they occur frequently, The omitted words would of course still appear in the context of other keywords. Most of the published concordances either omit completely the most frequently occurring words, or they present them in a condensed form, either by giving only the number of times which each occurs, or by giving their references without the context. Another problem facing the editor of a concordance is the order in which the occurrences of a particular word should be listed. The most usual method is to put them in the order in which they occur in the text as we have seen so far. They can, however, also be listed according to the alphabetical order of what comes to the right of the keyword (Figure <23.8)>2 or to the left of the keyword (Figure 3.9). In this way all the instances of a particular phrase can be grouped together. It is also possible to place the keywords in alphabetical order of their endings. Such a concordance is known as a reverse concordance and is useful in the study of morphology, rhyme schemes and syntax. Figure 3.10 shows an example. Once the text is ready to be processed by the computer, little human effort is required to produce any or all four kinds of concordances which have so far been described. The editor of a concordance must consider if there are any words which are to be treated separately, for example, quotations from another author. If the text includes a sizeable amount of quoted material, the vocabulary counts will be distorted. There are several ways of dealing with quotations.

They could be marked by special characters in the text, such as square brackets, and the computer would then be instructed to omit all words with square brackets from the main concordance. Another method is to mark quotations by a special reference character and then compile additional concordances of all the quoted material, maybe even one for each author from whom the quotations are made. If necessary, the quotations could be included in the context of words which are required as keywords. A better solution still might be to retain the quoted words in the main concordance and include them in a separate appendix as well. Stage directions in a play are another example of words which require special treatment. It has been known for them to be included as ordinary words, which leads to high count of occurrences for words like "enter' and "exit'. It seems most sensible to omit them from the concordance by enclosing them in special characters or by not putting them into the text when it is typed into the computer. Foreign words could also be treated in a special way. For example, the Italian word "come' should not be included in the occurrences of the English word "come'. The most obvious way to deal with foreign words is to mark them by some special character at the beginning of the word, such as an asterisk. The computer could then be instructed to list all the words beginning with that symbol as a separate concordance, which would produce an appendix of foreign words separate from the main concordance. Proper names could also be listed separately. In some concordances of classical texts made in Oxford, all proper names are input preceded by a dollar sign, e.g. <2$CLAUDIUS.>2 The editor of the concordance can then choose whether to instruct the machine to treat the dollar as a letter of the alphabet for sorting, and thus list all the proper names together, or whether to ignore the dollar completely and thus put <2CLAUDIUS>2 in its true alphabetical position under the letter <2C.>2 Using such marker symbols when the text is prepared for the computer does not necessarily mean that they are printed in the concordance. It is easy enough to omit them from the printout if desired. A difficulty of much larger dimension is that of homographs -- words which are spelled the same but have a different meaning. A particular example in English is "lead', which is both a verb and a noun, with a different pronunciation and meaning for each one. The treatment of homographs is an area where reviews of computer-produced concordances have been most critical, and it is a problem which does not have any clear solution except extensive human intervention based on careful appraisal of the circumstances. When preparing a text for the computer, proper names are easily apparent and can be marked almost automatically. The same is true of stage directions and foreign words and to a lesser extent quoted material. But, until a complete concordance of a work has been inspected carefully, the editor may not realise that some homographs exist. To the

computer "lead' the verb and "lead' the noun appear to be instances of the same word, as they are both spelled the same. Some computer concordances group the occurrences of homographs together indiscriminately, so that, for example, Latin <1canes,>1 meaning "dogs' and <1canes,>1 meaning 'you will sing', are listed as two occurrences of the same word. It is then left to the user of the concordance to separate them and to take appropriate action when comparing word counts. Another means of dealing with homographs is to pre-edit the text in such a way that the words are listed separately or correctly. This involves considerable work and is unlikely to be error-free. It is probably only possible to discover all the homographs in a text by making a complete concordance without separating them and then looking through it carefully. Some marker symbol can then be inserted in the text before those words which need to be distinguished from their homographs, and then the complete concordance can be re-run. Homographs are recognised to be a difficulty. It is for the editor to select the method of dealing with them that is most appropriate to his needs or the needs of the scholars he serves. Three points should be borne in mind. One is that homographs can easily be deduced from the context but can cause erroneous results when only word counts are used. Secondly, in some texts, particularly poetry, words are deliberately ambiguous, and it may even be possible to have a separate entry for these. A third difficulty arises if the editor has chosen to omit some frequent words and these words are homographs of other words which are to be included. Such may be the case with the English words "will' and "might', which are both auxiliary verbs and nouns. Pre-editing seems to be the only solution here. The discussion of homographs leads to the difficult question of lemmatisation -- that is, the classification of words under their dictionary heading. Many computer-produced concordances have been criticised for their failure to do this. One dictionary defines a concordance as "an alphabetical arrangement of the chief words or subjects in a book or author, with citations of the passages concerned'. This does not necessarily imply that words are to be placed under their dictionary heading. Wisbey, like most other concordance makers, does not lemmatise, but he provides a useful look-up table of infinitives of verbs, although nouns should perhaps also be included. Father Busa decided to pre-edit his text for the <1Index Thomisticus,>1 a mammoth task which has produced what is perhaps a more useful concordance. I would say that for most concordances lemmatisation is not worth the effort involved. In most cases, words from the same lemma are grouped fairly closely together. It is only in languages where there are prefixes and infixes that real difficulties arise. Hyphenated words and words containing apostrophes or diacritics need careful consideration. Most concordance programs offer some flexibility in

the treatment of hyphens. They can either be ignored, as far as the ordering of words is concerned, or partially ignored, or treated as word separators. Treating the hyphen as a character which affects the sorting of words seems to be the best approach, so that for example all the occurrences of <2LADY->2 <2LIKE>2 would appear as a separate entry in the concordance immediately after all those of <2LADYLIKE.>2 If the hyphen was ignored completely, the occurrences of both forms would be listed as one entry. If the hyphen was treated as a word separator, the two halves would be considered as separate words thus distorting the vocabulary counts for <2LADY>2 and <2LIKE>2. Some editors have chosen to make a separate word-index of all hyphenated forms, as well as listing them in the main concordance. This would be important if the printing of the text was being studied. Whichever method is chosen, the point to remember is that the computer cannot treat one hyphenated word differently from another. Hyphens must never be inserted into the text as continuation characters at the ends of lines as they are in normal typescript. These would be treated in the same way as genuine hyphens and give rise to some curious words appearing in the concordance. Words containing apostrophes are a similar case. Obvious examples are <2I'LL>2 and <2ILL>2 or <2CAN'T>2 and <2CANT.>2 Ideally all the instances of <2I'LL>2 should come immediately after all the instances of <2ILL>2 but under a separate keyword heading. If the computer was told to treat all the apostrophes as word separators, the concordance would then no doubt include a large number of words which consisted solely of the letter <2S.>2 Inclusion of <2I'LL>2 under <2ILL>2 would distort the word counts of two words that are under no circumstances the same. A similar treatment could be given to characters which have been used to mark accents in French. <1A,>1 part of <1avoir,>1 should not be confused with <1a%,>1 the preposition. If <1a%>1 was coded for the computer as <1A#,>1 the # indicating a grave accent, the computer could be instructed to list all the occurrences of <1a%>1 immediately after but under a separate heading from all those of <1a.>1 Diacritical marks in Greek can also be dealt with in the same manner. Suppose that the letter H is used to represent the Greek eta in the computer coding. An eta can occur with a variety of diacritics in several combinations such as [[greek]] ,etc. These diacritics can be coded for the computer using different marker symbols preceding or following the <2H,>2 and the various forms with different diacritics would then appear as separate entries in the concordance. It is not true that a computer sorts words only according to the English alphabet order. It can be programmed to sort according to any alphabet order and include in its alphabet symbols, such as asterisks, which are not alphabetic. Concordances of Greek or Russian or Semitic languages which appear in the English alphabet order are the result of lazy use of programs

which assume the normal sequence of characters. The computer manufacturers' own sorting programs usually only allow the English alphabet order and have a rigid sequence of non-alphabetic characters. Most program packages which have been written for analysing literary text allow the user to specify which computer symbols constitute the alphabet to be used for sorting and also to specify which symbols are word separators, which are to have no effect on the sorting of words, and which are to function as diacritics. The manufacturers' sorting programs are unsuitable for text, as their specifications are too inflexible. Careful programming allows a bilingual text to be sorted according to two alphabet orders, e.g. GREEK and English or Russian and English. Provided that the machine was given instructions as to which language it was sorting at any one time, this would be feasible. Concordances of bilingual texts have been made and can be useful in language teaching or studying bilingual texts where one language is undeciphered or not well understood. If a text exists in two languages and is prepared for the computer in such a way that a line from one text is followed immediately by the corresponding line from the other text, and those two lines are specified as the context of any word appearing in them, a bilingual concordance can be produced. This can be used to find which occurrences of a word are translated by one form and which by others. Some computer-produced concordances have been criticised heavily for the choice of a particular edition of a text. Some reviewers have even gone so far as to blame the computer for the use of a particular edition. It is, of course, entirely up to the editor to decide which version of the text to use. There are several points which he could consider in choosing his text. If a concordance is to be published and made generally available, it becomes a tool for other people to use. It is essential for them to be aware of which text has been used. It would seem sensible not to make any alteration or emendation to a text to be concorded, and to choose an edition which is well known and generally available, such as the Oxford Classical Texts for GREEK and Latin. Even typographical errors should also be left as they are. Corrections to them can be inserted in brackets in the text, but the original should not be deleted. If the editor of a concordance does make his own emendations, his concordance is less useful, as it is based on a text which is not published and which therefore cannot be consulted by other scholars. The choice of a controversial text should always be justified and adequate information should be included about the basis of the text if it has not been previously published. We have now seen that there are a number of decisions to be made by anyone proposing to produce a concordance by computer. If a concordance is only for the benefit of an individual research project and not for

publication, the editor will obviously be aware of its shortcomings and be able to make allowances for them. He may for example be able to compile a list of all the words which he would like to see under one particular lemma without setting out the list in a form suitable for publication. He will also be able to mark those homographs which are important for his research, while ignoring the others. A concordance which is to form the basis of further research for one or two individuals only can therefore be produced with fewer clerical alterations to the computer output. It must be emphasised that for publication the editor must take considerably more care. A published concordance should have a good introduction explaining in clear terms exactly how the concordance should be used and setting out any defects which may be attributed to the computer. If no attempt has been made to separate homographs or lemmatise words, it must explain that this is the case. The introduction should describe the chosen method of referencing and relate it to the particular edition of the text with an indication of what, if any, emendations have been made. The methods of dealing with hyphenated words, apostrophes, proper names and quoted matter should all be set out clearly, so that the user of the concordance knows exactly where and how to locate entries. A simple word count of forms is useful as an appendix. It is perhaps appropriate now to consider how some published concordances have dealt with these problems. The Oxford Shakespeare Concordances, one volume per play, were published from 1969 to l972. Professor T.H. Howard-Hill began work on them in 1965. He chose to use as his text the old-spelling edition of the plays. His concordances have been criticised on this count, a criticism which is not at all related to the use of a computer. About the same time work on another series of Shakespeare concordances was begun by Marvin Spevack, an American now working in Germany. Spevack chose to use a new text, the Riverside edition. Inevitably the two have been compared in reviews, though Howard-Hill was aiming to make concordances of individual plays in the old-spelling edition, whereas Spevack used a new edition for a complete concordance to Shakespeare. Howard-Hill puts the keyword to the left of the page (see Figure 3.11), followed by the number of times which the word occurs. In some cases, the number of occurrences of the keyword is given as two distinct figures separated by an asterisk. The introduction explains that the figure after the asterisk is the number of times the word occurs in so-called "justified' lines, where the spellings may have been affected by the compositor's need to fit the text to his measure. It would perhaps be more meaningful to add the two together and indicate the number of justified lines within this total. The justified lines themselves are marked by an asterisk. His context is aligned to the left, and in almost all cases one line is given, a sensible choice

for lines the length of Shakespeare's; but a few entries have more than one line, and a vertical stroke is then used to indicate the end of the first line in them. The criterion for listing more than one line of context is not apparent, and those entries must have been selected manually. He includes the speakers' names as part of the context on those lines which begin a speech. They are included as keywords in the concordance but are indexed separately in all but the first few volumes. No attempt is made at lemmatisation; but for English text the various forms of one lemma are usually close to each other in alphabetical order. The introduction also explains the arrangement of the entries, indicating that the old spelling may occasionally lead the reader astray in the search for a word. In some cases cross-references have been inserted in the concordances to direct the reader to the appropriate place. High-frequency words are treated in one of two different ways. Only frequency counts are given for some very common words, including pronouns, parts of the verb "to be' and some prepositions. A larger number of words, mostly variant spellings, have been given a fuller treatment in that the line number references of where they occur in the text are also provided. One further point can be made about Howard-Hill's choice of layout. When a word occurs more than once in a line, that line occurs only once in the contexts given for that word. Thus it is not so easy for the reader to see that a word occurs more than once in the same line. A better alternative is to list a line once for each time a word occurs in it. This may be monotonous for a line in which the same word is repeated several times, but at least all the information is there. If a <2KWIC->2type concordance was used, each occurrence of the word would in turn be aligned in the centre of the page so that its position in a line would be easier to see. Another possibility is to underline the keywords in the context lines or to print them in italics. Howard-Hill used computer typesetting for his concordances, and the results are pleasing to the eye. The keywords stand out well and it is very easy for the reader to find the entries for a particular word, but perhaps not so easy once the individual contexts need to be inspected. Spevack's first publication of his concordances consisted of a reproduction of computer- printout upper case only and looked very ugly. Since then a complete concordance of all Shakespeare's plays has been published by him and is much more pleasing in appearance. In his <1Index Thomisticus,>1 Busa provides a different layout. His entries are given in columns down the page with about two and a half lines of context for each word, The forms of each lemma all appear as sub-entries under the dictionary heading. For this purpose the text was pre-edited as it was input. Following the lemma, each of its forms appears in alphabetical order. The keyword and succeeding word are printed in bold type in the

contexts, and the keyword is also printed some distance above the contexts. The layout of keywords and contexts is clear, but the references are not easy to follow. They appear at the end of the context for each entry and consist of very short abbreviations and numbers. Their format is described in the introduction, which is itself a separate volume, written in six languages. The completed index will comprise over 60 volumes, but unfortunately its price puts it out of the reach of all but the most affluent libraries. Publishing on microfiche would have reduced the cost considerably and might now be a more suitable medium for such a large work. Concordances need not necessarily be made from texts made up of alphabetic words. In his decipherment of Linear <2B,>2 Ventris compiled grids of the frequencies of all the Linear B symbols by hand. The computer can perform the same task on symbols of undeciphered scripts. A coding system has to be devised to represent each symbol, and then the computer can be programmed to compile a concordance of each symbol. When the symbols are sorted by right or left context, the frequency of groups of consecutive symbols is apparent. The symbols can be drawn on a computer graphics device and the concordance output in the original script. One such concordance was that produced by Koskenniemi, Parpola and Parpola on the Indus Valley inscriptions (see Figure 3.12). A similar method of investigating sign frequencies has also been applied to Linear A and Minoan hieroglyphs. The making of concordances and word indexes is thus a largely automatic task well suited to the computer. We have seen that some manual intervention is normally required to deal with the problems posed by automation, but this is slight compared with the amount of work which the computer itself performs as it counts and sorts words. Lexicography, on the other hand, demands human effort on a much larger scale. There are three basic stages in compiling a dictionary. The first stage consists of assembling the material, usages of words which are potentially interesting, the traditional method being to record each usage of each word on a slip. The second stage is to sort the slips into alphabetical order and file them under the appropriate lemma. The third stage is to edit the examples for each lemma so as to produce the article for each word. A number of successful attempts have been made to use the computer for stage one, the collection of material. This means that all the material to be scanned must be in computer-readable form, which entails a mammoth task if the dictionary is to cover a complete language. This is the procedure adopted by the <1Tre*sor de la Langue Franc%aise.>1 In order to collect their "slips', the editors have made concordances to texts from over 1000 different authors from the nineteenth and twentieth centuries. The examples have

then been collected from these concordances. The <1Dictionary of Old English>1 based in Toronto and the <1Dictionary of the Older Scottish Tongue>1 based in Edinburgh have adopted the same method, though on much smaller corpora of material. A by-product of this method is, of course, a comprehensive archive of texts in computer-readable form, which may be used for other purposes and possibly become more important than the original work. Using such a comprehensive archive is the only way of ensuring that all possible material has been scanned and that no significant usages have been omitted. It does, however, require large resources for preparing material for the computer, processing the concordances and then scanning the concordances for interesting usages. This method is practical only if there are large resources available or the volume of material is not too large -- for example, a dictionary of a specific author, or authors, or a short period of a language, or a language for which there are few known texts. The problem of the context in computer-aided lexicography has been discussed at some length, particularly by de Tollenaere. The dictionary- maker does not wish to refer back to the original text in order to determine the meaning of a word. He must therefore have at his disposal sufficient context for this purpose, which may be up to seven or eight lines. The keyword would most probably appear somewhere in the middle of this context, which would consist of one or more complete sentences. The lineprinter could be loaded with special stationery and the machine programmed to print slips on that stationery, which could then be filed in the traditional manner. Figure <23.13>2 shows an example of computer- generated dictionary slips. In this case the keyword is offset to the left at the top and surrounded by asterisks for clarity in the context. The computer is most useful at stage two of the dictionary-making process, the filing and sorting of slips. It is no longer necessary for the dictionary-maker to have a large number of shoe-boxes containing slips filed in alphabetical order. The computer can be used to hold the information contained on the slips, and when new slips are made the machine will insert them into the correct alphabetical position. If the slips are themselves generated by computer as described above, they can be printed if necessary; but it is more likely that they will be stored on magnetic tape or disc. Accuracy and speed are the main advantages of using the computer here. In a manual system, a slip misfiled is almost always a slip lost for ever. The computer should never lose material once it is stored correctly, and the slips should always be in the correct alphabetical order. If the slips have been created manually, an indication can be given if the word is a homograph so that it will be filed with the appropriate lemma. If the computer has been used for excerption, some method of

lemmatisation must be adopted. There are three ways in which the computer can be instructed to lemmatise forms. First, it may be given rules by means of a program so that it can assign the correct lemma to each word. Such a computer program is very complicated and must allow for many exceptions and is never very satisfactory. Secondly the text can be pre-edited as it is input, so that each word is accompanied by its lemma; but this entails much extra labour at the input stage. The third method is to use a kind of "dictionary' which is already in computer-readable form. The word "dictionary' is here used in its computing sense. It does not imply a complete listing of all words in the language -- a kind of computerised Webster or <1Oxford English Dictionary.>1 Rather it is applied to a computer file which contains definitions of, or further information on, terms or words which a program may encounter. Such a dictionary is likely to be consulted or "searched' by a number of different programs, and may be built up over a long period, with new forms added to it when required. In the context of lemmatisation a dictionary would contain the lemma of each word likely to be encountered. It would be useful at this point to explain the functions involved in a search of a computer dictionary. The human mind learns: it looks a word up two or three times and then memorises it; no more searches are required. The computer cannot do this. Every word must be looked up every time that it is encountered. Therefore an efficient use of searching methods can make all the difference to whether a program runs well or not. Let us suppose that we have a computer dictionary of 1000 words. This dictionary would be stored in alphabetical order on magnetic tape or disc. If the computer were large enough, it could all be held in the machine's main memory throughout the duration of the program. Let us suppose that we are searching in the dictionary for the word "cat'. If our dictionary was in alphabetical order we might have to start at the beginning and ask if each word in turn was "cat'. As the letter "c' is near the beginning of the alphabet, "cat' would be found fairly soon, probably around the seventieth attempt. If "cat' was not in the dictionary, we could determine that by recognising when we had come to a word later in the alphabet. But consider what would happen if the word was "zebra'. If we started at the beginning and compared every word with "zebra' we would have made almost 1000 comparisons, which is very wasteful of machine time. Another method is to construct a small table, which is stored with the dictionary. This table would contain information indicating where each letter of the alphabet starts in the dictionary. We would then establish that our word begins with a <1c>1, look up <1c>1 in our table and find which position in our dictionary the <1c>1s start and also how many there are. We could then make a straightforward search through the <1c<1s to find our word. This method would reduce the number of comparisons to be made for a word at

the end of the alphabet to the same number as for any letter. When new words are added to the dictionary they will be inserted in the correct alphabetical position, usually by sorting them into alphabetical order and then merging them with the words already there. A special program would also update the table indicating where each letter begins and ends. Another more commonly used method of computer searching approximates to the way in which a reader would look up a word in an ordinary dictionary. He opens it at one page and then flips backwards and forwards from there until he isolates the page which contains the word. The computer opens its dictionary in the middle and looks to see whether the word comes before or after that point. If it finds that it comes before, it takes the middle again of this first section and asks the same question. It goes on taking the middle of each section found until it has arrived almost at the correct place. It can then do a straightforward sequential search through the last five or so items to find the exact one. This method is known as a <1binary search>1 because each section that is found is split into two before being searched again, If our "cat' was at position 70 out of 1000, it would be found in only ten comparisons. A binary search such as this combined with the table look-up one described above would provide an even more efficient search technique. The computer would use the table to find the beginning and end of the "c's and then do a binary search on the "c's. Yet another method of dictionary look-up is called hashing. This requires some fairly complex programming, but in essence the method consists of converting each word to a number which indicates its position in the dictionary. The <2SNOBOL>2 programming language includes a feature for table look-up using a hashing procedure which relieves the programmer from having to write all the program for dictionary searches. This process of collecting lexicographic material and filing it inside the computer may take several years of work. With such a lapse of time, one disadvantage of using a computer must be recognised. The life of a computer is only about eight or ten years. The time taken to collect material for a dictionary may be much longer than this and so the editor must ensure that the files he is creating can easily be transferred from one machine to another. It is all too easy to exploit the idiosyncrasies of one machine and then discover that programs have to be completely rewritten for another machine. Those who are fortunate enough to have their own computer for a dictionary project may avoid this problem completely if they can use one machine from start to finish. The third stage of lexicography, that of editing the quotations for each word to form a dictionary article, is basically a task for the human mind. The computer cannot decide whether to include specific quotations, as it has no sense of meaning. However it can be instructed to print out all the slips it has assembled for a particular lemma. The editor can then select the

quotations he requires from the printout and organise them into specific categories of meaning. By using a computer to edit the material for each word in the computer files he can create the final article for his dictionary inside the computer. This may involve some reordering of the entries, but the bulk of the work will consist of deleting superfluous quotations and reducing the length of those required, both of which are simple tasks to perform at a computer terminal. The advantage of this method of editing is that the material remains in computer-readable form and can be typeset directly thus eliminating any further need for extensive proofreading. The computer can therefore only completely replace the mechanical process of filing, sorting and organising the slips. This can itself make a significant contribution to the speed and accuracy with which a dictionary is compiled. The machine cannot make any kind of judgment on what to include or select in each article. That is and always will be the work of the lexicographer.

In this chapter we shall explore some of the ways in which the computer can be used in the study of vocabulary, and in particular words in relation to other words, or co-occurrences or collocations as they are sometimes called. Counting occurrences of words is, of course, nothing new. <2T,C.>2 Mendenhall at the end of the last century was one of the earliest workers in this field and in 1944 George Udny Yule published <1The Statistical Study of>1 <1Literary Vocabulary,>1 Yule's <1k>1 characteristic has been widely used as a measure of diversity or richness of vocabulary. It is of course now much easier and more accurate to make the word counts by computer and also to use the machine to make statistical calculations on the numbers collected. A simple word frequency count produced by the <2COCOA>2 program is shown in Figure 4.1. All words occurring once are listed in alphabetical order, followed by all those occurring twice and so on up to the most frequent. The <2COCOA>2 program can also provide what is called a frequency profile, telling us how many words occur once, how many twice, etc., and giving cumulative totals of the number of different words, or <1types>1 and the usages of those words, sometimes called <1tokens>1. An example of such a word frequency profile appears in Figure 4.2 showing that in this text, 42 words occur once, 14 occur twice, 5 occur 3 times and so on up to the most frequent word occurring 7 times, Vocabulary counts like these are used increasingly in the design of language courses. Teaching can then concentrate on those words and grammatical forms which occur most frequently in the language. This was the approach of the University of Nottingham when they undertook to design a German course for chemists. The aim was to teach chemists sufficient German to enable them to read technical articles. Word counts were made on a number of German chemical journals and a course was designed which concentrated on the most frequent forms. For example it was found that the first and second persons of the verb occurred so infrequently in the technical literature that it was decided not to include them in the course. The Nottingham German course is just one example of the many computer projects which are used in the specification of language courses.

The fact that the computer does not lemmatise forms or classify them under their dictionary heading is of some advantage here as those grammatical forms which occur most frequently can be determined. A frequency count of words sorted in alphabetical order of their endings can also be made. This would be of particular use in inflected languages to

determine which morphological features should be taught first. In one project reported by Dudrap and Emery, a study of French nominal genders was undertaken. All nouns from the <1Nouveau Petit Larousse>1 were prepared for the computer, each noun being terminated by a numerical code indicating its gender. The words were then sorted by their endings and the distribution of nouns according to their endings noted. Another use of vocabulary counts in language teaching is to control the introduction of new vocabulary into courses. Kanocz and Wolff describe the preparation of a German language course for the British Broadcasting Corporation. The computer was used to provide frequency counts of a number of German texts from which words to be included in the course were chosen. As each course unit was prepared, its vocabulary was input to the computer and checked against the vocabulary of the previous units to ensure that new words were being introduced at a reasonable rate and that there was some repetition of words already introduced. A similar method was employed by Burnett-Hall and Stupples in their preparation of a Russian course. Large frequency counts have been made of a number of texts or languages. One of the earliest was Sture Alle*n's frequency count of present- day Swedish. This was taken from some one million words of newspaper material. In all 1387 signed articles by 569 different writers in five Swedish morning newspapers published in 1965 were selected. The articles were chosen so as to be representative of the modern language and thus excluded material written by foreigners, letters to the editor and articles which contained long quoted passages. Unsigned articles were also excluded on the grounds that they might have been written by someone whose native language was not Swedish. Alle*n has produced three volumes of frequency counts so far and more are planned. Newspapers are a good choice of material for an examination of the current use of language. They were also the choice of Alan Jones in Oxford who has been studying modern Turkish. There have been enormous

changes in the Turkish language since the alphabet was romanised in 1928. Many Arabic and Persian loan words have been replaced by newly coined words formed from old Turkish roots or by loan words from European languages, principally French. Seven newspapers and one magazine were chosen as representatives of the language, and a computer program which generates random numbers was used to select samples totalling 40,000 words from each of the eight journals, all from 1968-9. Word frequency counts were made on all of these samples and thus the proportion of Arabic, Persian and English or French loan words was noted. The process is being repeated on samples taken from the same newspapers five years later, and the changes which have taken place in the language over that period are being noted. The use of frequency counts to study loan words in a language entails a considerable amount of manual work to read through the word lists marking all the loan words. A number of attempts have therefore been made to program the computer to select loan words. The approach has been to compile a set of letter combinations which could appear in loan words but not in the native language. One such study of English loan- words in modern French is described in an article by Suzanne Hanon. Letter combinations which would designate an English loan word include the use of rare letters like <1w>1 and <1k>1 and English suiffixes such as -<1ing>1 and -<1man.>1 A total of about 100 graphical entities found was reduced to about 70 because some were redundant (such as -g/-ing) and some were represented in different positions in a word. The next stage was to write a program to recognise these character strings in the relevant positions in the word. If it found them, the program would indicate that the word was English. The program was first tested on a number of known loan-words, which were themselves chosen manually from newspapers. The test program found about 80% of the test data supplied to it. More criteria were devised from the words which it failed to find, but there were still <231>2 words out of a total of 881 which it was impossible to deal with automatically. They were mostly forms like "bluffeur' which consists of a French ending on an English word. The experiment concluded by trying the program on some genuine French text, about thirty pages of a novel. Again the results had to be checked manually. Out of 8952 words, 75 were identified as loan-words. Of these, only 10 were found to be genuine English loans. It was established manually that the program had not missed any genuine loans, but the high number of erroneous finds indicates that this method of selecting loan words by computer is not very satisfactory. It would of course require a different program for each borrowing language and each lending language. As in the case of the Turkish newspaper project, there could be loan words from a number of languages in any one language. It would appear to be

more sensible to approach the study of loan-words by using a word frequency count and select the words manually, although more work is involved. The study of the vocabulary of a specific author or authors can be assisted considerably by the use of a computer. More of this will be found in Chapter Six, but here we can describe some simple applications. Martin, a Belgian who has been studying Dutch vocabulary with the aid of a computer, indicates the sort of methods which may be successful. He describes his work on a poem of 32,235 words, a lyrical epic called <1May>1 which is rich in vocabulary and was thought to contain a number of "new words'. Martin was able to compare his text with a computer-readable version of a thesaurus of Dutch and could thus ascertain that 460 words which occurred in the poem were not in the dictionary. Of these only 16 had a frequency of more than one and only two occurred more than twice. He then went on to investigate the distribution of these words in the poem. This was done by dividing the poem into eleven roughly equal sections. He was then able to calculate the number of new words in each section which would be expected on the assumption that they were evenly distributed. By comparing the actual values with the expected values it was found that a significantly high proportion of the new words came in one section of the poem, that of the song of the god Balder, thus confirming the special character of this part of the work. Martin also investigated the richness of the vocabulary measured by the type/token ratio. This was found to be 5.69, compared with a value of just over 8 for two other Dutch texts of the same length, which confirmed that there was a relatively extensive vocabulary. A further study showed that at least half the surplus of types were concentrated on words which occurred at the ends of lines -- that is, words which were chosen to satisfy the rhyme requirements. Martin's study shows the kind of vocabulary investigation which can be made very easily using a computer, but would not really be feasible without. One very early computer-aided vocabulary study was that by Joseph Raben on the influence of Milton on Shelley. Raben's method could well be applied to study the influence of any one author on another. He attempted to find those lines in one work which consciously or unconsciously echo lines in another work. This does not necessarily mean that the lines are identical, but that a similarity may be recognised such as [[Whence and what art thou, execrable shape, That dars"t, though grim and terrible, advance Thy miscreated Front athwart my way To yonder gates? through them I mean to pass,]]

and [[Horrible forms, What and who are ye? Never yet there came Phantasms so foul through monster-teeming Hell From the all-miscreative brain ofJove; Whilst I behold such execrable shapes,]] Raben could have adopted several methods of dealing with this problem, such as coding his material in a phonemic transcription or attempting to put words in their semantic categories. The method he chose was quite successful. Complete word lists of the texts of both Milton and Shelley were made -- these lists were then modified so that some words, such as prepositions, were omitted. Proper names were retained. Most of the words were converted to what he calls their canonised form, a sort of lemmatisation. Suffixes and prefixes were also stripped off for this purpose, so that for example "miscreated' would be reclassified as "create', and "disobeyed', "obedience', "obedient' would all become "obey'. The texts were then recoded automatically in this canonised form and a simple concordance made. It was then very easy to see which patterns of words were echoed by Shelley from Milton. This kind of vocabulary study can provide concrete evidence to support a subjective impression that one author consciously, or unconsciously, echoes another. A straightforward concordance of one author can provide some material. By looking up words from the second author, similarities of phrases or whole lines can be found. A concordance of two texts combined can provide yet more evidence of these similarities, but would not find combinations such as "disobeyed' and "obeyed' unless prefixes were separated from the root forms. Raben's method ensures that all combinations can be found at the cost of a little extra computing. Thematic analysis can also be attempted by computer, but the machine must be told which words (and grammatical forms of words) denote a particular theme or themes. This kind of study can be based on a concordance, but a program devised by Fortier and McConnell at Manitoba has attempted to reduce the human work. A list of synonyms is supplied to the program, which will then produce frequency and distribution tables for certain sections of the text. The program also draws a histogram on the computer's lineprinter indicating which sections of the work have a high density of a particular theme. Fortier and McConnell's system contains ten synonym dictionaries of French words, and is therefore specific to French, but it could easily be adapted for use with other languages. It is possible to separate words denoting primary or strong evocation of a theme from those denoting secondary or weaker evocation.

Figure 4.3 shows a histogram drawn by a section of the program called <2GRAPH.>2 It uses colons for showing primary evocation, in this case of words denoting jealousy in Ce*line's <1Voyage au Bout de la Nuit.>1 The numbers indicate the number of times which the theme occurs in each chapter. The histograms can also be used to denote secondary evocations, in which case full stops are used instead of colons. These programs are called <2THEME>2 and represent an extension of the concordance. Results have also been published on the analysis of words evoking doubt in Beckett's <1En Attendant>1 <1Godot.>1 We can now move on to look at the study of collocations, a term first introduced by <2J.R.>2 Firth in 1951. Firth does not give an exact definition of collocation but rather illustrates the notion by way of examples, e.g. one of the meanings of "ass' is its habitual collocation with an immediately preceding "you silly ... ' In other words we are looking at lexical items which frequently associate with other lexical items. It is only recently that the notion of collocations has been seriously studied, largely because of the practical restrictions imposed on any large scale investigation of collocating words. The advent of computers has remedied this drawback and led to several linguistic collocational studies. Dr Lieve Berry-Rogghe has been one of the leaders in the field of computer-aided collocation studies. She aims to compile a list of those lexical items "collocates' which co-occur significantly with a given lexical item called the <1node>1 within a specified distance called the <1span>1 which is measured as a number of words. In order to obtain a comprehensive

picture of collocation relations, it would be necessary to process a very large volume of text. Berry-Rogghe's initial experiment was conducted on some 71,000 words, which she herself admits is not sufficient. A concordance- type program was written which limits the context for each keyword to the specified span. All items occurring within the span are then conflated into an alphabetical list and their number of co-occurrences with the node is counted. This list is compared to a previously compiled dictionary consisting of an alphabetical frequency list for the whole text. She then computes what she calls the z-<1score>1 for each collocate. The z-score measures the frequency of the item occurring as a collocate as against its overall frequency in the text. For example, if the word "table' was selected as the node, the word "the' would appear frequently as it is collocated within a

small span. "The' would also appear frequently in many other places in the text and would therefore give a small z-score. However, a word which appears only within the span for "table' would give a very high z-score. The first keyword chosen was "house' with a span of three and as expected those items which co-occur most frequently with "house' are function words, being "the', "this', "a', "of', "in', "I'. Of these words only "this' gives a high "z' score and it is indeed fourth in the list when the collocates are listed by z score, coming after "sold', "commons' and "decorate'. The incidence of "commons' of course occurs through the phrase "House of Commons'. Figure 4.4 shows the significant collocates of "house' for span size three, four, five and six. Increasing the span size to four introduced the new words "entered' and "near' at positions nine and ten in

the z-score ordered list. More words come into the list as the span size increases but it can be seen that the words which enter high up the list like "fronts' for span size five and "cracks' for span size six occur only twice in the text and both times in collocation with "house'. In another experiment Berry-Rogghe used a total of 200,000 words to study what she calls "phrasal verbs' - that is, those occurrences of a verb followed by a particle which are idiomatic, e.g. "look after', "give in'. Phrasal verbs can then be identified by examining the second part of the phrase, that is the particle, and applying to it similar methods of calculation as in the example "house' in the earlier study, but this time with a span of zero. "In' was chosen as the first particle to be studied and a left- sorted concordance used to find all the words which occur to the left of "in'. This would include incidences of "in' used in a phrasal verb as well as its normal use as a preposition. The computation of z-scores as shown in Figure 4.5 indicates that words like "interested', "versed', "believe' are more closely associated with "in' than with items such as "walk' and "sit'. By contrast the words to the right of "in' are completely different and can broadly be broken into three categories: 1. Nouns denoting time, e.g, summer 2. Idiomatic phrases such as "in spite', "in short' <23.>2 Nouns denoting places, e.g. "town'. Dr Berry-Rogghe's papers are both based on what she admits is a very small amount of text for collocational studies, and the distribution of the vocabulary in them is governed to some extent by the subject matter. Her first study was compiled from <1A Christmas Carol>1 by Charles Dickens, <1Each his>1 <1own Wilderness>1 by Doris Lessing and <1Everything in the Garden>1 by Giles Cooper. For the study of phrasal words the Lessing text was used again, together with <1St. Mawr>1 by <2D.H.>2 Lawrence and <1Joseph Andrews>1 by Henry Fielding. A corpus of text which is representative of the language as a whole would have been more suitable for this kind of linguistic study. Such a collection of text has been compiled at Brown University and is known as the Brown University Corpus of Present Day American English, or more simply the Brown Corpus. It consists of a million words of American English, divided into 500 different prose sections each of approximately 2000 words long. The samples were chosen randomly from fifteen different categories of prose including newspapers, religious material, books on skills and hobbies, popular lore, government reports, learned journals and various kinds of fiction. All the material in the Corpus was first published in 1961. The different categories of prose are well-marked, so that it is possible to determine whether one feature occurs significantly higher in one section than another. This corpus therefore provides a representative sample of the

language of the time and was designed specifically for computer analysis of vocabulary and other features. The University of Lancaster has been attempting to compile a similar corpus of British English. This has now been taken over by the International Computer Archive of Modern English in Oslo and when it is complete it should also provide a valuable source of linguistic material for computer analysis. Complete word counts of the Brown Corpus have in fact been published, but the texts have been used in many other studies. In one particular collocational study carried out by Peggy Haskel, the keywords were

carefully selected. The 28 words finally chosen came initially from the published word counts of the Brown Corpus and included only words which occurred more than 200 times. Words which had several meanings were excluded. The final choice of keywords was made using Buck's <1Dictionary of Selected Synonyms in Principal Indo-European Languages>1 to ensure that there was at least one word from each of his 22 groups. A computer program then scanned the text for the keywords and extracted all collocating words for a span of up to four words on either side of the keyword. In this case function words were ignored in finding the span. Their occurrence could make a difference in the final results of such a study, and it would seem advantageous to perform another computer run including function words to see whether there is any significant difference in the results. It would require only a small change to a computer program to do this. The collocating words are conflated into an alphabetical dictionary in a similar manner to Berry-Rogghe's. The third stage of the program calculates the percentage of cases in which the word appears with each of the several keywords that it may on occasion accompany. This is another way of denoting the significance of the collocation, which Berry- Rogghe does by z-scores. A few preliminary results are reported in the article and it is interesting to compare the different categories of prose in the Corpus. In fiction the word "cut' is normally used literally and it associates with "open', "belly', "concussion', "boy', In newspapers its figurative meaning is more common and it co-occurs more frequently with "inflate', "modest', "expenses' and "estimates', Similar tendencies were found for the word "dead', where "issues' are dead in the press, but in fiction "dead' co-occurs with "fight', "mourned', "wounded'. "Top' gives "steeple', "head', "stairs', and "wall' in fiction, and "officials', "personnel' and "executive' in the press. Berry-Rogghe and Haskel's work are only two examples of an increasingly wide variety of collocational studies using a computer. The Centre de Recherche de Lexicologie Politique at St Cloud have been making a comprehensive study of the political documents circulated in Paris in May 1968. They have approached the question of collocations by making a chain from which they can construct what they call a multistorey lexicograph, where the relationships and distances between words are shown in a network-like diagram. Similar work is reported on by Pecheux and Wesselius who have made a particular study of the word "lutte'. Experimental psychologists have adopted similar methods in dealing with word associations. A number of people are asked to give their response to a particular word and these words again used as further "stimulant'. Gradually a thesaurus of associative terms can be built up and a network diagram constructed showing the word relationships. Figure 4.6 shows the network surrounding the initial word "butterfly' taken from an article by

Kiss, Armstrong, Milroy and Piper on an associative thesaurus of English. Such a thesaurus of terms can then be stored in the computer in dictionary form. A program could be instructed to start at a particular keyword, or node, and find all the words up to a certain span from the node. The dictionary can be searched interactively by sitting at a video screen. One word would then be typed in and the computer would respond by displaying a list of associated words. The vocabulary of dialects has also been studied with the computer. The dialectologist typically goes out and collects large quantities of linguistic data and then needs to classify and organise the material collected, usually by sorting and comparing items on many bases. Either the choice of vocabulary or the pronunciation, or both, may be investigated, and the results may be presented in tabular format or as isoglosses drawn on a map. One vocabulary study in dialectology, described by Shaw, adopts a technique new to language studies known as cluster analysis. It originated in the biological sciences where there are problems of grouping species into genera and genera into families. In recent years this technique has been used much more widely in the humanities, for example, in archaeology and

textual criticism as well as in the study of vocabulary. In order to group a number of items together we need to measure the similarity or dissimilarity between them according to a number of different criteria or variables. In the case of dialects, we are attempting to group together or cluster a number of villages according to the vocabulary they use for specific terms. Most computer centres have standard programs for cluster analysis which are not difficult to use. Shaw gives a useful example showing six fictitious villages and ten sets of alternatives for lexical features as follows: It can be seen that Village A has reading one in every case, Village B has reading one in the first seven cases and reading two in the last three and so on. The figures are converted to percentages and similarity matrix formed showing the relationship of each village with all the others. The villages are then clustered according to their similarity as follows. Villages D and E use the same term eight out of ten times and therefore form the nucleus of the first cluster. A and B agree with each other in seven out of ten times, and so they too form another cluster. C agrees with A six times, and so it then joins the A and B cluster. Gradually all the villages join one of the two clusters, which will eventually join together into one

large group. The clusters can either be represented graphically on a map using a computer graphics device, or they can be shown in the form of a dendrogram which illustrates the levels of similarity at which they are linked. Shaw gives an example of fourteen East Midlands villages showing the levels of similarity as a percentage (see Figure 4.7). There are several different mathematical ways of performing cluster analysis. One large computer program called <2CLUSTAN>2 is able to use many of them. It is advisable to try several methods on the same set of data to see whether similar results are produced. Shaw's initial experiment used only 60 features for 14 localities in three independent sets of data. It would still take some considerable time to perform all the calculations by hand for these numbers, but the advantage of using a computer can clearly be seen when much larger sets of data require to be analysed. It would be then impossible to perform all the calculations by hand. The study of dialects often requires the representation of particular dialect items on maps. Lance and Slemons describe a project where the data was taken from the <1Dictionary of American Regional English>1 and consisted of vocabulary items recorded in response to questions asking the names of things. Such a question might be "What do you call a stream of water not big enough to be a river?'. Each different response (they have sixteen in all for this item) is allocated a letter as an identifier and only the letters are marked on the map. Figure 4.8 shows one of their maps, drawn on a lineprinter. This map would look much better if it was done on a graph plotter. No attempt has been made to draw isoglosses separating those places with different responses. If it were done on an ordinary pen plotter which uses several different colours of ink, a different colour could be used for each isogloss, thus making them much easier to see. Not all computer centres have graph plotters and in some cases the lineprinter must be used for drawing. Dialect maps were also drawn by Rubin who used the <1Survey of English>1 <1Di-alects>1 to study pronunciation, in particular the voicing of initial fricative sounds in the south west of England. The survey had reported on 75 locations in ten counties in England. Each location was identified for the computer by a four-digit number, with two digits indicating the county and two the locality, so that for example 3906 was Burley, the sixth (06) locality in county number 39 (Hampshire). Sixty-eight words were listed for each of the 75 localities, giving a total of 5100 items. A coding system was devised for the words which preserved all significant features of the phonetic transcription. An example here shows that the word 'finger' (item <2VI.7.7.>2 of the <2SED)>2 is pronounced

Õ[finger]å at location 01 of county 31 Weston in Somerset). Rubin was then able to sort his data by keyword, citation and location. Inspection of his lists showed that there was wide variation within the area from word to word and from locality to locality. It was then decided to produce dialect maps by computer. An offline plotter was used to draw first an outline map of southern England. The positions of each locality relative to the map were stored inside the computer. A map could then be drawn for each of the 65 words, indicating whether they started with a voiced or unvoiced fricative. Figure 4.9 shows the map drawn for the word "furrow' showing the initial consonant used in each of the 75 localities. Other maps were drawn showing the proportion of voiced to unvoiced words for each place. A simpler method was adopted by the Welsh Language Research Unit at University College, Cardiff, who use concordances to investigate the phonetics of Welsh dialects. Extensive tape recordings are first taken in the selected dialect area. These recordings are then subjected to a detailed phonetic analysis and transcribed into symbols of the International Phonetic Alphabet. These transcriptions are processed by a straightforward concordance program which sorts the words on both the beginnings and endings of words using the phonetic alphabet as a key. The researchers have then been able to print their output in the phonetic alphabet using a computer microfilm recorder (Figure 4.10).

It is clear then that the computer has considerable possibilities in the study of vocabulary, whether it be individual words or relationships of words with other words. From the design of language courses to the study of word relationships in political documents, a whole range of uses is apparent. Though simple vocabulary studies have been carried out long before the use of computers, the machine can clearly make much faster and more accurate counts than the human brain. Cluster analysis techniques in

the study of dialects or other vocabulary would be almost impossible without the computer. The machine has also made feasible the study of collocations which was previously impossible even on a small scale. The advent of larger storage devices could lead to much more work being done in the field of collocations and allow much larger networks of associative words.

We have seen in the discussion on concordances in Chapter Three that the computer cannot classify a word under its dictionary heading without prior instructions. This could only be done if the machine is presented with a series of rules from which it can uniquely categorise a word into its correct form, or if a computer dictionary is available which holds the lemma, or dictionary heading, for every word. Usually these two methods are combined so that a series of rules is applied to remove a possible flectional ending and the resulting "stem' is looked up in a dictionary to ensure that it exists. It follows that a separate program and dictionary must be used for each natural language. In general highly inflected and agglutinative languages are easier to analyse than languages like English in which words are sometimes inflected and sometimes not. A case study from Swedish lexicology described by Hellberg demon- strates the procedures required for lemmatisation. This study was made in connection with the word-frequency counts of modern Swedish mentioned in Chapter Four, which were taken from newspaper material. A dictionary of all possible endings was first stored. An alphabetical word list could then be lemmatised simply by grouping in each lemma words with the same stem but different inflectional endings. The "stem' of the lemma was defined as that part of the word that was identical in all its inflectional forms. The endings therefore did not correspond exactly with the usual linguistic ones. For example, the word "titel', whose plural is "titlar', was assigned the stem "tit-' and endings "-el', "-lar' etc., though the linguistic stem is "titl-'. Irregular or strong nouns which occurred frequently in compounds were treated as extra paradigms. The word "man', plural "ma#n' appeared in so many compounds that it was established as a separate paradigm. The list of possible endings was shortened in one way. The ending "-s' always occupies the last position and can be added to almost every form of noun and adjective having a genitive function or to verbs having a passive function. This would have meant nearly doubling the ending list. Therefore the "s' was removed first and then the ending treated as if it was not there. Homographs in the alphabetical list were always treated on the assumption that a verb preceded a noun, In all, some

112,000 words were lemmatised in this way, and a manual test check revealed an error rate of 3.5%. These errors included fictitious lemmas which the program had derived from foreign words, such as one consisting of the English word "fair' and the French word "faire'. Meunier's system for lemmatising contemporary French follows the Swedish method closely. His text is first converted into an alphabetical list. Words are then looked up in a function-word dictionary, which includes the more common words in all parts of speech. The remaining forms are then treated in the same way as the Swedish words -- that is, each word is compared with the next to see if they could be variants of the same lemma. One of the features of this system is the automatic construction of the dictionary of possible endings from which the computer can determine whether two endings are compatible. A similar method was adopted by David Packard for use with GREEK, another highly inflected language. This program was written originally to assist the preparation of a textbook to teach elementary GREEK, by finding the most frequently occurring forms, but it has more general applications in the GREEK text. Words are analysed in the order in which they come in the text, not from an alphabetical list. Each word is first looked up in a dictionary of what he calls "indeclinables', which consists of prepositions, adverbs and particles as well as highly irregular forms. About half the words are found in this list. GREEK Morphological rules are then applied and the program attempts to remove endings from the stem, starting with the final letter only. If the single final letter is not found in the list of possible endings, the length of the ending is increased by one letter and the search made again. If an ending is found, the remainder of the word is looked up in a dictionary of stems. If it is found and is consistent with the ending, the program identifies this as a possible analysis. However, it does not then move on to the next word, but continues to search for other longer endings in case the form is ambiguous. Contracted forms are treated as separate paradigms, though this procedure produces false stems. Nouns of the third declension are also given special treatment. The nominative GREEK and dative plural GREEK are placed in the list of indelclinables as they cannot be reconstructed from the stem GREEK. Augmented forms which prefix an initial GREEK are found by program, but verbs with irregular augments are included in the stem dictionary, as are reduplicated perfect stems. Many GREEK words are also formed by prefixing a preposition to a stem, and in some cases the final consonant of the preposition is assimilated to the following letter. The prefix GREEK may appear as GREEK or GREEK. The verbal augments come between the prefix and the stem. Packard's program attempts to remove prepositional prefixes from the beginning of a word if the word is

not found in the stem dictionary. It can also reconstruct elided forms, but crasis (the merging of two words into one) is more difficult to recognise automatically. The more common forms of it are therefore included in the dictionary of indeclinables. In many cases the program generates more than one analysis for a form. Instead of printing them all it is allowed to opt for the most likely version. In almost every case of ambiguity between a verbal and a nominal analysis, the nominal form is correct in GREEK and so the program always prefers the latter. The program can accommodate 4000 stems, 2000 endings and 1000 indeclinables. It uses a binary search for the dictionary look-up and so the dictionaries are re-sorted into alphabetical order whenever a new entry is added. For those words which are not found in the indeclinables, several searches would be required for all possible analyses. The dictionary of endings is generated by program. From a nucleus of words, the other dictionaries are added to as required, an approach which is considered quite satisfactory. Packard's program is designed for a large computer and it can hold all the dictionaries in core, He claims that it can analyse 2000 words per second, although he does not give the error rate. Bathurst's account of a system to alphabetise Arabic words does not give sufficient indication of the success level. Arabic has an abundance of prefixed and infixed letters. Each word is built up of a root, which is normally three consonants, together with a combination of infixes, prefixes and suffixes. An added difficulty is that one or more of the radical letters may be one of three weak consonants, such as "w', which are liable to mutation or elision. Bathurst comments that there can be up to 2300 different combinations of prefixes preceding the first letter of a verb stem and more than 1600 for nouns. Infixes within the stems add more than 500 additional patterns. No examples are given, but there is an indication that there was some success in selecting the radicals from a number of words which were then used to sort the words into alphabetical order. It seems that quadriliterals and words with weak radicals were dealt with separately by a dictionary look-up. A suffix removal program for English is described by John Dawson of the Literary and Linguistic Computing Centre, Cambridge. Like most lemmatisation programs the Cambridge system starts by compiling a list of possible suffixes. Following a paper by Julie Beth Lovins, a list of about 250 English suffixes was constructed. This list lacked plurals and combinations of simple suffixes, and when these were added the total came to some 1200 entries. With each suffix was stored a condition code, a numeric marker indicating in which circumstances a suffix could be removed from a word. To cope with a dictionary of some 1200 items, the suffixes were stored by length, and then in reverse alphabetical order (i.e. in alphabetical order of

their final letters). The condition codes consisted of: 1. A number which represented the minimum stem length which may be left after the suffix is removed. 2. An inclusion list, indicating that the suffix may only be removed after these strings of characters. <23.>2 An exclusion list, which gives those character strings after which the suffix may not be removed. Several of these conditions may be combined for each suffix. Following the conditions, we can perhaps look at the suffix -ITE in the words <2ERUDITE, DOLOMITE>2 and <2SPRITE.>2 The program will find the suffix -<2ITE>2 in <2ERUDITE,>2 but it has been instructed not to remove it in <2DOLOMITE>2 because it comes after the letter M, or in <2SPRITE>2 because there are only three letters left then and they begin with an <2S.>2 The text is dealt with one word at a time so that the order of words is immaterial. The program attempts to remove the longest suffix first, always allowing for a minimum stem length of two characters. All but the first two characters of the word are removed and looked up in the suffix dictionary. If this does not make a known suffix, the first letter of this hypothetical suffix is ignored and the remainder looked up again. Allied to the suffix removal process is a procedure for what is described as word conflation -- that is, grouping together all forms of the same stem. This procedure has been developed empirically and allows for the matching of such forms as <2ABSORB ABSORB/->2 and <2ABSORPT/->2 This is done by keeping sets of standard stem endings which can be considered equivalent. The conflation process is applied to blocks of words which are already in alphabetical order. The procedure for matching the groups is described by Dawson in some detail. Fifty-five stem ending sets are used and these do not include many irregular verb inflections. A considerable degree of success was reported as the following examples show: <1ending set example>1 B/BB ROB ROBBING D/DD PAD PADDING G/GG TAG TAGGING But following the same principle, <2HIS>2 was grouped together with <2HISSED.>2 It is always important to note the percentage of errors in such a lemmatisation system. Dawson does not give this, but he does make the

point that the scholar should be provided with a list of all the forms which have been classified with each stem so that errors can be determined easily. A number of other lemmatisation systems have been reported much more recently than those already discussed. It appears that after a lull of several years the possibilities of automatic lemmatisation are once again being explored. Maegaard's procedure to recognise finite verbs in French texts reports 100% success in recognising all the words that could possibly be finite verbs, although the number of erroneous forms is not given. Success with homographs was about 90%. This system again uses a dictionary of stems and a dictionary of endings. The roots were taken from a standard dictionary of French verbs and each root is accompanied by a numerical code indicating what paradigms can be applied to it. The program tests each word for endings by the same method as Packard -- that is, by removing the shortest possible ending first. When a legitimate ending and stem are found, the numerical codes are checked to see whether they are compatible. In French about 25% of the forms which are recognised as possible finite verbs are homographs, in the sense that they can be either a finite verb or something else. These are marked in the stem dictionary which also gives an indication of which homography tests can be applied. For example, the participles ending in "-s' like <1mis>1 and <1pris>1 can only be finite verbs if preceded by a subject pronoun <1je>1 or <1tu.>1 These programs operate on basic text and appear to be quite successful in searching for this one type of verb feature. There have also been attempts at automatic lemmatisation on a number of other languages including Latin, Spanish, German and Russian. It seems clear that a success rate of about 95% can be attained for highly inflected languages. Packard's Greek system seems to be the most comprehensive and his method of looking for a word first in a list of function words and irregular forms immediately takes care of about half the words in a text. One can either begin as Maegaard did on the French finite verbs by constructing a word list of stems from a standard dictionary, or the stems can be compiled from a word list of a text or texts, Whichever way the dictionaries are compiled, once production begins it seems preferable to operate on the basic text if possible rather than on an alphabetical word list. Both Packard and Meunier generate their ending dictionaries automatically merely to facilitate their compilation. Automatic generation of word endings or prefixes can also be applied to the compilation of language dictionaries or in language teaching. Shibayev's <2CARLEX>2 project was designed specifically for the teaching of Russian but it was planned to extend it to other inflected languages. A basic Russian dictionary was compiled of the 4000 most frequent stems. Each stem was punched on a computer card with a code giving information

about its possible parts of speech and the type of inflexion for each part of speech, together with a tentative English translation. A second set of input cards consisted of a complete set of flectional endings given separately for nouns and pronouns, adjectives, verbs and all invariables with the cases they govern. The number of possible forms of these endings combined with all 4000 stems was therefore very large indeed. The computer was then programmed to generate the required inflexion in accordance with the code numbers of the stem data and to print out the entire grammatical structure of any one or more stems. The aim was to generate small amounts of output in response to specific requests rather than create a voluminous quantity of results from large runs. The ideal would be to operate such a program interactively from a terminal, and this was planned by Shibayev before his death. The student could type in either the stem or the English word and the terminal would then display the required inflection. <2CARLEX>2 operated in transliterated form, but given a video terminal with Russian characters it could be operated successfully as a language teaching system for students of Russian. Further developments to the program could allow the user to request a specific form of a word and display just that form rather than the entire paradigm. Figure 5.1 shows part of an example giving all possible forms for the adjective transliterated <2PREKRASNY1.>2 An algorithm for generating Hebrew or Arabic word forms is much more complicated, because both languages allow infixes and prefixes as well as suffixes to indicate various flectional forms. In a project described by Price each stem was stored in a dictionary together with numerical codes indicating which infixes, suffixes or prefixes can be added to the stem. Each word was then transcribed into a series of input descriptors from which the Hebrew word could be generated. This algorithm is part of a research project in machine translation from Hebrew to English, for generating Hebrew words was considered a useful way of examining the procedures required to parse them. Assigning the correct grammatical form or part of speech to a word in order to determine syntax can be a much more complicated process than the removal of flectional endings, but syntax can be studied with the aid of a computer in more simple ways. It may of course require no more than a straightforward search for word forms. Such an investigation is described by Dowsing where some aspects of Old English syntax were studied with the aid of a computer. Her main interest was in the use of "have' and "be' with the past participle and in the development of compound tenses. The use of inflexional endings in Old English meant that the past participle might have several forms. A very simple method was adopted for finding all examples of the past participle. The built-in editor on the computer's terminal system was used to find all the lines which contained any of some

60 different finite forms of "have' and "be'. Those lines containing any of them were edited into a second computer file, which was then searched for all occurrences of participial endings. This search inevitably picked up some spurious material which was later removed manually, leaving the participles in a file, which could then be interrogated to find the distribution of specific verbal forms. A concordance would also have provided her with the material she wanted to find. Using the editor on the terminals is easier, but it requires more computer resources. A special purpose program to identify all the past participles would have taken much longer to write but is another way of approaching the problem. Green's study of formulas and syntax in Old English poetry depends much more on manual specification of the syntax, P108> Each line of his input text is accompanied by a code indicating the syntactic features in that line and the machine is only used to sort the lines by syntactic feature. If therefore only one or two syntactic features are being investigated which can be identified by specific forms, it is much easier to use a concordance-type program to locate occurrences of those forms and then identify the required words manually after that. However, most syntactic studies require a fuller analysis of a text, both to classify words into their grammatical form and to determine their function in a sentence. As the rules for sentence structure and word usage vary from language to language, there can be no general purpose program to perform syntactic analysis. A special program is required for each natural language. One well-known syntactic analysis program which has been used for literary text in English is <2EYEBALL,>2 developed at the University of Minnesota by Donald Ross and Robert Rasche. The original <2EYEBALL>2 was written in <2FORTRAN.>2 Three other versions of the program exist, one from Ohio written in <2SNOBOL,>2 which is also called <2EYEBALL,>2 one from the University of Southern California called <2HAWKEYE,>2 which is written in <2FORTRAN,>2 and one at the University of Oxford called <2OXEYE,>2 which was developed from the Ohio <2EYEBALL>2 and therefore written in <2SNOBOL. EYEBALL, HAWKEYE>2 and <2OXEYE>2 all attempt to parse English text, albeit in a limited fashion, and to produce frequency distributions of each part of speech. The parsing rules for the Minnesota <2EYEBALL>2 are described in some detail in an article by Ross and the other programs are similar. All set out to parse a basic text, to which no additional symbols have been added to facilitate the parsing. <2OXEYE>2 accepts text in a format identical to that required by <2COCOA,>2 and the two packages can therefore easily be used on the same text. All but the Ohio program work in several stages. At each stage the user may intervene and correct any mistakes in the parsing before going on to attempt the next stage. Ross's <2EYEBALL>2 has a built-in dictionary of about 200 function words. The program looks up each word in the dictionary. If the word is in the dictionary, some additional information about the word will be found there. For "and' the dictionary indicates that it is always a conjunction. "The' is always an article, but some function words like "for' can fulfil more than one syntactic category, e.g. conjunction or preposition. <2EYEBALL>2 has a hierarchy for dealing with situations like this. It looks for phrases before it looks for clauses, since prepositional phrases are usually short and are therefore easier to identify than clauses. It tries to find the end of a phrase and then works backwards from that. The end of a phrase could be punctuation, a preposition or an interjection, or possibly the word "not', or a word which introduces a clause, or another preposition, or an

interjection. All words which can begin clauses and all prepositions or interjections would be found in the dictionary. Clauses are more difficult to analyse, especially if they are long, since it is very difficult to identify the end of a clause if it is not signalled by punctuation. Ross and Rasche attempted to include one or two other automatic features in their program. One was to mark all words ending in "-ly' as adverbs, but a number of exceptions were found by the program like "family', "ally' and "belly'. <2OXEYE>2 does mark "-ly' words as adverbs, but it includes the main exceptions to the "-ly' rule in its dictionary. Ross's <2EYEBALL, HAWKEYE>2 and <2OXEYE>2 both work in several stages. A computer file of words is first produced from the text and each word is given a tentative grammatical category. Figure 5.2 shows such a word file created by <2OXEYE>2 before any human intervention. The file can be corrected manually before any statistics are printed. Each part of speech is identified by a letter category as shown here. Adjective J Noun N Adverb "ly' A Participle L Adverb function B Particle prep R Auxiliary X Preposition P Coordinator C Pronoun U Determiner D Subordinator S "To'infinitive T "There' signal H Infinitive verb F Verb V Intensifier Adverb I Miscellaneous M "Not' O Unknown ? Syntactic functions can be resolved further in a second stage of the program. They are indicated by the following symbols: Subject S Predicate V Predicate adjunct A Prepositional adjunct P Complement C Unknown ? When the parts of speech have been modified, the parsing can be attempted and the results of that are shown in Figure 5.3. Figure 5.4 shows the text printed with the corrected grammatical categories and parsing underneath each word. The Ohio <2EYEBALL>2 does not allow manual intervention and one result is seen in Figure 5.5, where "Hammersmith' is parsed as an infinitive verb showing one major problem in the automatic parsing of

English, the word "to', which can be either a preposition or part of an infinitive. Some of the statistics produced by <2OXEYE>2 using the corrected information are shown in Figure 5.6. It is then possible to ask the computer such questions as how many sentences begin with a noun, or how many sentences have consecutive sequences of two or more adjectives. These questions are formulated as <2SNOBOL>2 patterns, a feature peculiar to <2SNOBOL>2 but very easy for the non-programmer to understand. Figure 5.7 shows the printout from such specific questions. <2EYEBALL>2 and its derivatives are the only programs which have been used to any extent to parse literary English. Ross himself has worked on Blake's <1Songs of Innocence and of Experience>1 using the program and has been able to calculate the percentage distribution of each category of speech in each poem and overall. <2EYEBALL>2 and <2HAWKEYE>2 have been used to study Tennyson, Keats, Coleridge, Blake and Wordsworth, as well as Joyce's <1Ulysses.>1 <2OXEYE>2 has been used on Mrs Gaskell, as well as on Henry Fielding and Marlowe. The procedures which have so far been described in this chapter are required for any kind of machine translation system. A sentence in the source language must be analysed grammatically -- that is, the function of every word must be determined either by a dictionary search or by suffix removal or by a combination of both, and then the clause structure must be ascertained. The translation of each word must then be determined from a dictionary and a sentence constructed in the target language which is both grammatical and meaningful. It may well be that the word order is quite different in the two languages, or that one displays much more inflection than the other. In the early days of computing it was thought that machine translation might be the answer to many problems of communication, but the first work in the field soon discovered the difficulties and it was found to be uneconomic. There has been a recent revival of interest in the subject, notably in Hong Kong for translating Chinese, but most effort in the western world is concentrating on the first stage of the procedure, the analysis of the syntax of natural language. One machine translation system, called <2BABEL,>2 developed by <2T.D.>2 Crawford at University College Cardiff, well illustrates the procedures involved. The input text, in this case Russian in transliterated form, is processed sentence by sentence. Each sentence is broken down into words and each word looked up in a dictionary. For each Russian word, the dictionary contains information about its grammatical category and subcategories, an indication as to whether it is a homograph and a provisional English translation. In this case the dictionary includes every form of every word, though it would be equally possible for the program to

strip the morphological ending from the words and then search only for the root. Crawford's view is that the increased space of storage devices on modern computers makes it more economical to retain every form in the dictionary. As in all systems, the amount of storage required is to be balanced against the amount of extra program needed to perform the morphological analysis correctly. Once all the grammatical categories for a sentence have been found, the machine operates a phrase-structure-type grammar until the sentence has been resolved. If the sentence contains a homograph the grammar is rerun. It may need to be run several times if there are several homographs. When the analysis is complete, even if it has been found impossible, another set of rules operate which attempt to construct a sentence in the target language, in this case English. If the previous grammatical analysis is unsuccessful the resulting English will be ungrammatical. The chief inadequacy of this system, like many others, is the inability to match meaning to context. It will construct sentences which are grammatically correct but have no sense. Many homographs and other ambiguities cannot be satisfactorily dealt with by machine translation, particularly if they have the same grammatical category. The same Russian word is translated as "about' and "in' in English. Crawford's program always prints it as <2ABOUT/IN>2, as shown in Figure 5.8. Homographs are not found so frequently in technical text and it is on this area that most machine translation systems have concentrated. The somewhat staccato English produced by a program such as <2BABEL>2 would be acceptable to a scientist if the alternative was not to be able to read the article at all. The problem of semantics in machine translation has not yet been satisfactorily resolved, and a solution seems unlikely until very large dictionaries of collocations are available. In the future it may be possible to store information with each word which would indicate which words can or cannot be juxtaposed with it. This would result in a much larger dictionary and is beyond the capacity of current computer hardware. Crawford's dictionary consists of some 20,000 items, but each new text to be translated could include words which are not already in the dictionary. Therefore new texts must first be scanned for these words and the dictionary updated before any translation is attempted. More acceptable English is produced by the Chinese University of Hong Kong project, but in that system the source text is pre-edited to alter syntactic irregularities to a form that the machine can recognise, for example, by inserting a subject in sentences which do not already have one, by inserting a copula to link subject and predicate, or by breaking up a complex sentence into simpler ones. The English produced (Figure 5.9) flows much more easily than that of <2BABEL,>2 but it is doubtful whether the results justify the pre-editing necessary.

Machine translation is regarded by many as merely an academic exercise. It is uneconomic, not only because of the complexity of the computer programs required, but also because of the time taken to keypunch the material initially. But machine translation of technical material, especially if it is available in machine readable form as a by- product of computer typesetting methods, can be of use to scholars who would otherwise be unable to read the material at all. The language of technical texts is usually of a much simpler and more straightforward structure than that of literary material. The same is also true of business correspondence where there is also a need for translation from one language to another. It is in these areas that machine translation has been found to be least unsuccessful, where the translation does not have to be elegant, but does have to give the meaning adequately. There is a lot of work still to be done on syntax analysers. The best

method of dealing with such programs is by no means apparent, but they are now being increasingly designed for analysing search requests in information retrieval systems. This would allow a user to submit a request in natural language, and the computer would then analyse the request before performing the search. It is arguable whether this is economic. Training a user to specify a search request in a particular form does not take long, and it reduces the amount of computing to be done for each request, freeing more time for the searches themselves. The recent revival of interest in syntax analysis and machine translation

results from the development of computer hardware to the speed and size where it might be more feasible and where it might have a useful application. At the present time its practical uses are limited and it is not within the scope of this book to expand further on a subject which is itself the subject of many books. Some references are given which may be a useful introduction to the world of computational semantics and artificial intelligence, but readers of them will note that the success level of those systems applied to literary text is not very high.

Except for the production of concordances, the computer has been used and abused more frequently in the analysis of literary style than for any other form of literary text analysis. In particular, it has been used to "solve' problems of disputed authorship. In this chapter we shall point out those areas where the misuse of the computer has been common and attempt to demonstrate how it might be used to real advantage in the analysis of style. Can style be defined? and if so, what definitions of the style lend themselves to computational analysis? Style manifests itself for example in an author's choice of vocabulary, in his use of long or short sentences or in the syntactic constructions he prefers. Traditionally opinions about style have been largely intuitive, but the computer now can provide additional data to make an objective study. However, in order to make any kind of stylistic analysis by computer, specific features must be defined in terms which the machine can understand. Most computer-aided analyses of style have therefore restricted themselves to those features which are easily quantifiable and easily identified by machine. These features fall broadly into two groups: on the one hand, word and sentence length, together with the positions of words within sentences; on the other hand, the study of vocabulary, that is the choice and frequency of words. A third possible category is syntactic analysis once the syntactic features have been adequately defined and classified. There is of course nothing new in the study of these features. It is merely that their study is facilitated by using a machine. Some of the earliest stylistic analyses were made by <2T.C.>2 Mendenhall and published at the turn of the century. Among other things, Mendenhall applied himself to the question of the authorship of the Shakespearean plays, studying only word length -- that is, the number of letters per word. He employed two ladies to count words, recording the number of words with one letter, with two letters, with three letters and so on. This they did for all the Shakespearean plays as well as for extensive material from Bacon, Jonson, Marlowe and others. In all some two million words were counted. It was shown that Shakespeare and Marlowe were alone in their high usage of four-letter words. All the others peaked at three letters, except for the two-letter peak of John Stuart Mill.

Mendenhall's account of the counting process is of some interest: [[... excellent and entirely satisfactory manner in which the heavy task of counting was performed by the (two) ladies who undertook it ... The operation of counting was greatly facilitated by the construction of a simple counting machine by which a registration of a word of any given number of letters was made by touching a button marked with that number.]] It is a pity that Mendenhall did not live seventy years later. His results would have been obtained very much faster, although he might not have dispensed completely with his two ladies. Mendenhall presented his results as a series of graphs showing how many words of each length were found for each author. For comparative purposes the counts for two authors on one graph were placed as shown in Figure 6.1 . Mendenhall illustrated his points well by presenting his results visually. Most modern analysts of style have preferred to present tables of numbers and then to manipulate these numbers further.

This manipulation of numbers inevitably requires statistical methods, and it is in this area where computer-aided stylistic analyses have been heavily criticised. Once numbers have been collected, we need to know how to interpret them and to determine whether they are of any significance or not. The basic mathematics of this are simple and can be understood by the non-scientist. The bibliography to this chapter gives several references for elementary statistics. Muller is good for statistics applied to linguistic analysis. Here we will limit ourselves to a few simple examples. One of the simplest and most obvious statistics is what is known as the arithmetic <1mean,>1 the commonest of the three "averages' used in statistics. If we take ten words by author A and see that their lengths in letters are 2 4 2 6 1 7 1 4 2 1 The average or mean word length is the total of these divided by the number of words, that is <230: 10 = 3>2 Another ten words by author B might yield the following counts 2 4 3 3 1 3 5 2 4 3 also giving a mean of <23. A>2 simple inspection of these counts shows that author B"s word lengths are grouped more closely around their mean than author A"s. Statistically we can measure this spread or variation by performing another calculation, the result of which is called the <1standard>1 <1deviation.>1 It is derived as follows. Take each value in turn. Find the difference between it and the mean. Square this difference and add up all the squared values. If the value of

each letter count is x and the mean is Y, we have The total 42 is called the <1sum of squares>1 about the mean. Divided by the number of items, in this case 10, it gives the <1variance>1 4.2. The standard deviation is the square root of the variance, and this may be obtained from square root tables which yield 2.049. For author <2B,>2 we can also calculate the standard deviation this time abbreviating the table to show how many words there are for each count. x is the count, n is the number of times that count occurs and x is the mean count. Here we have a sum of squares of 12, a variance of 1.2 and a standard deviation of 1.095. The two sets of words are thus shown to differ in their spread. These groups of ten words are obviously insufficient to make any representative comments about an author. They merely serve to illustrate the calculation of the simplest of statistics. But how much text should be used? If the whole text is available in machine readable form, it is obviously preferable to work on the entire text, but in many cases it will be impossible to calculate statistics from the entire text or texts under study (the <1population>1) . However in literary studies as in other disciplines valid statistical conclusions can be drawn from <1samples>1 of text. From the statistics of a sample, such as the mean and standard deviation, it is possible to estimate values for the entire population. How and where to choose samples has been the subject of much discussion particularly in the case of text, since words do not appear in random order in a text but are related to each other in sequence. Traditionally there are two accepted ways of taking samples. One method is to categorise the text in some way (e.g. by genres of literature) and then sample from each category. This ensures that the samples are representative. The other method is to be entirely random. That is not to say that sampling should be haphazard. If you drop your text and just let it fall open, it is far more likely that it will open somewhere near the middle rather than at the first or last page. Similarly if you close your eyes and stick a pin in a page, the pin is far more likely to fall in the middle, True random samples are chosen by inspecting tables of random numbers. The best method of sampling text falls half way between these two

methods. Inevitably much is already known about a text which is to be sampled and it can be divided into sections on the basis of this knowledge. Within these sections samples can be chosen using random number tables to indicate the starting page, line or word, and continuous lines of text can be taken from there. The size of textual samples varies considerably, but for a prose text at least 100 and preferably many more sentences should be taken consecutively. Ellegard reckoned that a sample must contain at least ten occurrences of the feature being tested, and so if the feature is particularly rare a lot of text may be required. If several texts are being used for comparative purposes, two or more samples from the same text can be tested against each other to see whether a particular feature occurs with regularity in that text before it is used in an attempt to discriminate between authors. A simple statistical test called the X2 (chi-square) is frequently used to indicate whether there is any significant difference between two groups of data. This test is much favoured by literary scholars because it is easy to calculate and does not involve any mathematics more complicated than squares and square roots. It should always be based on absolute counts, never on percentages or proportions, and it should not be used where the value in any category is less than five. If this is the case in the "tail' of a distribution of numbers, those at the end should be added together and treated as a single value. A X2 test would be appropriate for examining the distribution of sentence length within a text. Suppose that the text is divided into four sections <2A, B, C>2 and <2D,>2 each of 300 sentences, and we wish to discover whether section A is typical of the text as a whole. The example shows how this could be tested using X2, Sentence lengths are first tabulated:

Note the way in which the information is grouped into sections of five words. Above the level of forty words per sentence the values are small, in most cases less than five, and so they should be grouped together into a single section for 41+ giving A B C D Total 41+ 7 7 9 6 29 From these totals we can calculate what the expected number of words in each section would be if the number of words per sentence was distributed evenly throughout the text i.e. the total for each section divided by four giving Expected values No. of words No. of per sentence sentences The next stage is to subtract the expected values from the observed ones and square the result. This will yield a positive number whatever the result of the subtraction. This figure is then divided by the expected value. For section A this can be tabulated as follows

The figures in the final column are added up to give a value for X2. We then need to know what this value means. We want to know if the counts for section A differ significantly from those for the rest of the text. Therefore we look up our figure for X2 in X2 tables found in statistics text books, for which we also need to know the number of <1degrees of freedom.>1 This is always one less than the number of values in each column, in other words one less than the number of items measured separately. In our case the values were finally divided into nine groups which gives eight degrees of freedom. In the tables for eight degrees of freedom a <1probability>1 of 0.05 is given for a value of 15.5, which is the nearest value given to our 15.69. Probabilities are always given between 0 and 1. A value of 1 indicates that an event is bound to happen, 0 that it will never happen. A probability of less than 0.05 indicates that the event has less than a 5% chance of occurring -- that is, a one in twenty chance of happening thus by chance alone. Likewise a probability of less than 0.01 indicates less than a 1% chance, or one in a hundred chance, of occurring thus by chance alone. A result which has less than a 5% chance of happening is generally considered to be significant, while one which has less than a 1% chance is generally considered highly significant. In our case the probability of obtaining a X2 value as high as 15.69 if section A is indeed typical of the text as a whole is less than 0.05, and so we would conclude that the difference between section A and the text as a whole is significant at the 5% level or in other words, there is only a 5% probability that the difference occurred by chance alone. Examining section B of the text gives the following table. Observed Expected In the X2 tables this gives a probability of about 0.85 indicating that that distribution of values may well have occurred by chance and that the results are not significant. In the two examples given, each section is tested

against the entire body of the text including itself. It is also possible to use a X21 test to compare one section of the text against all the others excluding itself or against each individual section. These latter tests are more likely to produce significant differences but can be used with discretion. Means, standard deviation and X2 are simple statistics, which in many cases can be computed by a standard computer program, but it is important to understand what these numbers mean. It is useful to perform some of these simple calculations by hand on a small quantity of text to ensure that the principles behind them are thoroughly understood, before allowing the machine to take over. One other point can be noted. Some computer programs when calculating the variance divide not by <1n,>1 the number of items, but by <1n->11. This may also be true of those pocket calculators which can work out standard deviations. If the number of items is very large, this is not going to make much difference to the results, but it is as well to be aware of the possibility. As many literary scholars have begun to use statistics, so many statisticians have begun to take an interest in text, as a new and interesting source of data. A number of articles have been published by professional statisticians and mathematicians attempting to find new mathematical ways of describing text. Of particular interest is the distribution of items of vocabulary in a text. This does not conform to any known distribution and has therefore attracted many attempts to find a new one. No one has yet succeeded in finding a measure which describes all vocabulary. Non- mathematicians should be warned that these articles require a fair knowledge of mathematics. We have now established some of the ground work for the statistical analysis of literary style and for authorship studies. Whether using a computer or not, it is necessary to understand some simple statistics in order to quantify the results. With a X2 test the 1% and 5% probability levels can be used to indicate whether an event is likely to have happened by chance or not. If at all possible, it is advisable to operate on the whole text, not on samples. If samples have to be taken, they must be chosen with care applying random number principles. It is useful to test individual samples from a known work against each other for homogeneity before attempting to make comparisons with disputed texts. A feature of style which is not consistent in known works cannot serve as a discriminant. As many features as possible should be investigated in authorship studies, and external evidence must not be forgotten. Good historical evidence may carry more weight than stylistic analysis. In the simplest case of a disputed work known by external evidence to be by one of two or more candidates, the work can be compared with text by all the possible candidates. If the test is merely to question the authorship of a work and not an attempt to assign it to another author, it must be compared with a representative set of

material by the author, and it must not be forgotten that an author's style may change throughout his life or may depend on the subject matter. These tests are very rarely conclusive. Frequently they may only provide negative evidence. In only a very few cases have problems of disputed authorship been solved totally and these have been ideally suited to computer analysis. Using such quantitative methods will however accumulate as much evidence as possible and will allow many different tests to be applied systematically. One feature of style which has attracted some interest recently is syntax. We have seen in Chapter Five how the computer may (or may not) be used to parse a text. Here we shall investigate those syntactic analyses which have been performed primarily for stylistic purposes. In many cases this has entailed manual parsing of the material rather than computer parsing. In other cases using computer programs like <2EYEBALL>2 and <2OXEYE,>2 the computer coding can be modified to correct any errors. Statistics from <2OXEYE>2 have already been illustrated in Chapter Five, Another example in Figure 6.2 also shows means and standard deviations. Milic in his work on Swift adopted a different method of coding which used a series of numbers rather than letter codes. Though written as long ago as 1967, his book gives some useful information for the quantitative analysis of prose style, but the method does have some drawbacks. A two- digit code is assigned to each word indicating its part of speech, and a separate code is used to mark the end of a sentence. Only the codes and not the text were input to the computer. A typical sentence then becomes 31 81 01 51 31 03 01 61 05 41 05 31 81 01 01 21 33 03 On the basis of these codes Milic was able to draw up tables of the frequencies of the different parts of speech in the prose works of Swift. This method of coding was taken up by one of his pupils, Mrs Koster, in an attempt to discover the author of the <1Story of St Albans Ghost.>1 She recoded by hand some of the texts which Milic had studied and in many cases arrived at a different total for each part of speech, showing that some subjective judgments must have been made before the material was put into the machine. The results are therefore only as reliable as the coding of the text. If this kind of hand coding is to be done, rigid rules for the coding should first be established to help to iron out inconsistencies. If a machine's parsing program is used it should at least be consistent, and inconsisten- cies will only be introduced at the stage of manual correction of the data. Milic, who also worked on samples of text from other authors, including Addison, Johnson, Gibbon and Macaulay, found three tests which he considered to be the most reliable discriminators. These were high scores for the use of verbal forms, for what he calls introductory connectives, and

for different patterns of three-word groups. This latter test was conducted by selecting each successive group of three words and counting how many times each possible group was found. Milic established that in these three areas Swift behaves very differently from the other prose samples he studied. Mrs Koster took samples of known texts of two authors, Arbuthnot and Wagstaffe, and also a collection of what she calls pseudo-Wagstaffe, prose which is attributed to Wagstaffe but not signed. Very simple computer programs can be used to operate on this numeric data to obtain the counts necessary for the first two of Milic's discriminators. The third discriminator was tested by merely counting the number of three-word patterns. When these programs were run on the disputed work, it was found to have a lower score than Swift for two out of three discriminators. Three samples of Arbuthnot and Wagstaffe were each run separately and were found to be inconsistent in themselves. Mrs Koster gives absolute totals and percentages in her tables of results and states that there are differences in the results. It would have been preferable to use a x2 or other test of significance to determine whether these differences were noteworthy. A test of homogeneity within known works is a sound starting point, but the method of coding renders these studies somewhat suspect. A similar approach using manual coding of data was adopted by Leighton in a study of seventeenth-century German sonnets. This was another attempt to distinguish the style of a number of sonnets, the intention being to produce a kind of literary geography, to determine whether certain towns or certain regions were distinguishable by specific stylistic traits. The syntactic structure of the sonnets was used, as it was felt that the rigidity of the rhyme scheme imposed certain limits on the type of structure possible. Contemporary evidence showed that most of the poets put their rhyme schemes together and then set about writing a sonnet to fit them. The investigation therefore looked at sentence structure in relation to rhyme structure. A letter coding system was devised to indicate the features to be examined for each line, for example code main clause a interrupted main clause b completion of main clause c elliptical main clause d extension phrase in apposition e Symbols for apostrophe and question mark and for figures of speech affecting sentence structure, such as anacolouthon and inversion, were also included, and so was a marker indicating the end of a line and a separate

marker for the end of a line when it coincided with the end of a sentence. Thus there were on average three or four coded letters for each line. The data could then be input very simply, one sonnet per computer card. A short program counted the features and performed some simple calculations such as the mean number of main clauses per poem, incidences of enjambments and apostrophes, and number of sentences per poem. As each line was represented by a sequence of letters, the computer was also programmed to find the most frequently occurring patterns for each line. Even when tested on a small pilot set of data, twenty sonnets by Fleming and thirty-one by Gryphius, Leighton was able to show that each poet had his own favourite line pattern frequencies. Leighton's approach was very simple but is worth considering for the analysis of clause structure. On a one-man project such as his, difficulties over inconsistencies in coding were less likely, and the data was reduced to a very small amount for each sonnet, much less than in Milic"s coding of parts of speech. Such a process could easily be included in a vocabulary study by adding the stylistic features as an extra "word' at the end of each line of text. In this way, both the actual words of the text and the codings could be subjected to analysis. Oakman's study of Carlyle's syntax adopted a different procedure. In this case samples were chosen randomly from those works known to have been written by Carlyle. At total of two hundred paragraphs were chosen by using a table of random numbers, first to select the volumes and then to select the pages within the volumes. Oakman used an automatic syntactic analyser developed at the <2IBM>2 San Jose* Research Laboratory. He admits that this parser was unable to cope adequately with some of Carlyle's work, but given that it was written to analyse technical and scientific material, this is hardly surprising. Oakman does not give an error rate but merely remarks that "glaring errors in the parsing were taken into account when the results were tabulated'. It seems clear that some kind of manual correction must be applied to machine parsing of literary text before any meaningful results can be presented. <2EYEBALL>2 and <2OXEYE>2 have an advantage here as they allow correction of the parsing before compiling statistics. Much more fruitful results have been obtained in those analyses which combine the study of vocabulary with word and sentence length, One of the earliest such studies was made by Ellegard on the so-called Junius Letters, a series of letters which appeared in the <1Public Advertiser>1 in 1769-72 under the pseudonym of Junius, There were thought to be about forty possible candidates for authorship, of whom the favourite was Sir Philip Francis. Ellegard published two books about his attempts to solve this authorship problem. The first, <1Who was junius?,>1 describes the background to the problem and includes one chapter on statistical methods. The second, A

<1Statistical Method for Determining Authorship,>1 gives a full description of the methods he used. He first looked at sentence length but found considerable variability in the works of known authorship and thus concluded that the number of words per sentence could not help isolate one candidate for the Junius letters. Sentence length is of course a very easy feature to study by computer, but on many occasions it has been found to be so variable within a group of samples known to be from the same author that it is of little use in an authorship study. Ellegard then turned to vocabulary and looked for words which occurred more or less frequently than would be expected in the Junius letters. These he called "plus words' and "minus words'. As he did not use a computer for counting words, but merely for performing calculations on the results, he had to rely on his own intuition as to which words occurred more or less frequently than expected. Ellegard's study was done in the early days of computing when it was not easy to handle so many words by machine. Clearly it would have been advantageous for him to have used a machine for the counting and to have counted all words, not just those he specifically noticed. Nevertheless Ellegard's work has served as a model for many later authorship studies and gives a useful introduction to the subject. To examine the frequency of words which do not occur very often, it is obviously necessary to look at a substantial amount of text. Ellegard used very large samples of text, up to 100,000 words, far larger than is necessary for most studies. But as he was investigating words which did not occur very often, he believed that his samples had to be large enough to include at least ten occurrences of the words or words in which he was interested. He began by testing several samples from each known author against each other for homogeneity and discarded those items of vocabulary which were not uniform in occurrence. Though he was not able to reach a firm conclusion about the authorship of the Junius letters, he did find many features of their style and vocabulary which resembled closely those of Francis. Another early computer-aided authorship study was performed by Mosteller and Wallace on the <1Federalist Papers, a>1 series of documents written in 1787-8 to persuade the citizens of New York to ratify the Constitution. Three people were known to have written these papers, John Jay, Alexander Hamilton and James Madison. The papers eventually totalled eighty-eight in number, and they were all published first in newspapers under the pseudonym of Publius and then in book form. Of these papers, the authorship of only twelve was in dispute, and then only between Hamilton and Madison -- it was known that they were not by Jay. Since there were only two known candidates and a lot of comparative material of the same nature, Mosteller and Wallace were able to test their selected criteria thoroughly in papers known to be by Hamilton and Madison

before applying them to the disputed papers. Early work on these papers by Mosteller and F. Williams consisted of a study of sentence length. After much manual counting the results produced were: mean sentence length Hamilton 34.55 words Madison 34.59 words They then considered whether there was any variability of sentence length between the two men. The standard deviations of sentence lengths for each paper were calculated and the mean of these taken. The results again were very similar: mean standard deviation Hamilton 19.2 Madison 20.3 Some years later Mosteller attacked the problem again, this time in collaboration with Wallace. They began a systematic and comprehensive study of the vocabulary of the papers. A computer was used to count the words and they concentrated initially on those words whose presence or absence in one or other of the authors had been noticed. They rapidly discovered that Hamilton always used "while' when Madison preferred "whilst' and also found that the disputed papers preferred "whilst', as did the three papers which were considered to have been written jointly by the two. Other words such as "upon' and "enough' emerged as markers of Hamilton. "Upon' was found to appear only 7 times in the 37,095 words in Madison's papers and 110 times in 34,577 words of Hamilton. It appeared twice in one of the disputed papers and not at all in the others. 'Enough' appeared 24 times in the known Hamilton papers and not at all in any other papers. In their early work, Mosteller and Wallace calculated the percentage of nouns, adjectives and short words using manual counts. On the basis of these they constructed a statistic which they called a linear discriminant function, which was intended to separate the writings of the two authors by giving high scores to Hamilton and low scores to Madison. Though not definitive, this measure favoured Madison as the author of the disputed papers. Further evidence supported this theory more and more as they later applied more advanced statistical techniques, in particular Bayes' theorem. <2It>2 is generally accepted that this particular authorship problem has been solved conclusively, but it must be remembered that the disputed Federalist Papers were very suitable for this kind of approach. There were only two possible candidates, both of whom had written a considerable amount of undisputed material on the same subject matter. The quantity of material to be studied (some 70,000 words of known text divided equally between the two authors, and some 25,000 words of disputed text) was not

too small to make the calculations meaningless, but also not too large to make them unwieldy. Sampling was used to test known works for homogeneity, but once this was found the calculations were performed on the entire texts. A word frequency count will show which words occur frequently in a text, and comparisons can then be made to establish whether their frequent occurrence is unusual. The most frequent words in a text are almost always function words, which are largely independent of the subject matter. Cluster analysis techniques can be applied to study function words when comparisons are to be made between several texts. Ule gives an example showing a number of Elizabethan texts measured according to the distribution of three of the most common words in English, "and', "the' and "to'. The vocabulary frequencies are given as percentages, such that the total number of occurrences of all three words is 100%. For example in Woodstock "and' is 39.70%, "the' is 36.64% and "to' is 25.66%. The figures for fifteen texts in all are given in this form. From these a cluster analysis is performed showing which texts are most like each other on the basis of the usage of these words. The occurrence and distribution of function words can prove fruitful in stylistic comparisons, for if the feature being considered itself occurs frequently, conclusions can be drawn from smaller amounts of text. Such function words have formed the basis of a large number of analyses of Greek prose made by Andrew Morton and his associates. Greek is a language which has many function words such as connective particles, and both the position and the frequency of these particles can also be considered, Morton has popularised the use of computers in authorship studies and has received some criticism for his work. In a newspaper article in the early 1960s he claimed that only four of the epistles attributed to St Paul were written by him. This claim was largely based on Paul's use of the Greek word <1kai>1 ("and'). Since then Morton and his collaborators have worked extensively on the New Testament and other Greek prose authors. They have recently turned their attention to English text and even to the analysis of statements given to the police, and they have been called upon to give evidence in court on the basis of these analyses, It would seem that police statements are too short to give any sound statistical basis for this work, but nevertheless Morton has been much in demand since his court appearance. Morton's methods of stylometry are developed from those of W.C. Wake, a scientist who was interested in Aristotle. Wake's methods required the study of a frequently occurring event which is independent of subject matter. The occurrence should be examined in a wide range of texts to establish general conclusions about a writer. Wake was soon able to show that what is most characteristic of an author is not his personal

idiosyncrasies but the rate at which he performs the operations shared with all his colleagues. In Greek this would be the choice and usage of particles in particular. Collectively Morton's stylometric methods can yield a comprehensive and systematic analysis of a text. However Morton does tend to concentrate on one or two criteria. The work of <2A.J.P.>2 Kenny on Aristotle will be seen to be much more comprehensive in its coverage but it was based initially on Morton's methodology. The sort of table which Morton very frequently produces is shown in Figure 6,3. This table shows the number of sentences which have the particle <1gar>1 as one of the first four words. Each sample is the first two hundred sentences from each book of Herodotus, Choosing a starting point by random number tables might have been preferable, but Morton very frequently takes the first two hundred sentences of a text as his sample. A X2 test could be used to indicate whether the high incidence of yap in Book 7 is significant or not. In this case it is found not to be so. Morton has also studied the lengths of sentences in GREEK. He does not delve very deeply into what constitutes a sentence, merely taking the punctuation inserted by later editors into the text to signal the ends of sentences. The question of what is a sentence in GREEK prose cannot be resolved here, but clearly the same definition of a sentence must be applied to all texts. When the texts have been prepared by different editors it is as

well to point this out. A typical Morton sentence-length distribution is shown in Figure 6.4. The sentences can vary in length up to 155 words. They are grouped in sets of five. Sentence length seems to be a more reliable discriminant for GREEK than for English. Another feature studied by Morton is the last word in a sentence, whether it be a noun, a verb, an adjective or another part of speech. Word order is significant in GREEK, and although many sentences frequently end with a verb, other words may be emphasised by being placed at the end of a sentence. Morton's work should by no means be disregarded. He was a pioneer in the field of stylistic analysis of Greek and his methods could usefully be

followed by many. However, his work can be criticised for its failure to apply more than one or two tests to a set of data. Assumptions about the authorship of a text are made on what is statistically very questionable evidence. Morton has been criticised for his excessive use of the X2 test, which he also computes in a different way from other scholars. We have seen that this test is easy to calculate and can be used to indicate whether results are significant or not. For that reason it is recommended, provided the data is valid for such a test, and Morton certainly uses X2 only where it should be used. Morton's methods were adopted and expanded considerably by <2A,J.P.>2 Kenny in his study of the Aristotelian <1Ethics.>1 Within the <1Nicomachean>1 and <1Eudemean Ethics>1 three books effectively appear twice. Kenny has made a lengthy and systematic comparison of the style of the two <1Ethics>1 and the three disputed books. Using word frequency counts, he has established that the differences between the disputed books and the <1Nicomachean Ethics>1 are much greater than those between the disputed books and the <1Eudemean>1 <1Ethics.>1 Kenny again concentrated on common words, dividing them into tables according to their grammatical function. In 20 cases out of 36 particles, the difference in usage between the disputed books and the <1Nicomachean Ethics>1 is too great to be attributable to chance; but in every case but two there is no significant difference between the disputed books and the <1Eudemean Ethics.>1 Similar results were obtained from the examination of prepositions and adverbs. A number of adverbs in particular occur much more frequently in the <1Nicomachean Ethics>1 than in the <1Eudemean Ethics.>1 Again these are not found so frequently in the disputed books. Having completed an analysis of the common words in the <1Ethics>1 which themselves totalled some 53% of words in the entire texts, Kenny then studied the disputed books, not as a whole but in small sections of about 1000 words each. As the samples had now become smaller, the words to be studied were taken in groups of words of similar meaning, not as occurrences of single words. The groups were selected so that they consisted of words which were favourites either of the <1Nicomachean>1 or of the <1Eudemean Ethics,>1 but not of both. Following Ellegard, a distinctiveness ratio for each group was computed. This consisted of its <1Nicomachean>1 frequency divided by its <1Eudemean>1 frequency. Thus a word which occurs more frequently in the <1Nicomachean Ethics>1 will have a distinctiveness ratio greater than one, and one which occurs more frequently in the <1Eudemean Ethics>1 will be between zero and one. Only words of high frequency were chosen for these groups, so that a frequency of not less than ten could be expected in a sample of 1000 words. By comparing the expected number of occurrences with the actual number, it was possible to determine for each sample in turn whether it resembled the <1Nicomachean>1 or <1Eudemean Ethics.>1 The results showed that every one of the 1000-word samples resembled the <1Eudemean>1

much more than the <1Nicomachean Ethics.>1 Kenny has also studied other features such as word length, sentence length the last word in a sentence and choice of synonyms. In every case, although the entire text was examined, block of samples were built up and each block tested with all the others from the same text before being compared with samples from another text. Those features which do not occur uniformly were noted but discarded as possible evidence for an attribution study. Radday's work on Isaiah has used more complicated mathematical techniques, although like Kenny, Morton and others he has concentrated on function words. In collaboration with a statistician, Dieter Wickmann, he has devised a new test called an <1arcsin>1 test which is used to distinguish between the rate of occurrence of a word in two samples of text. The method requires some knowledge of mathematics to understand. The calculation produces a value which is called z, and if z is greater than 1.96, Radday and Wickmann demonstrate that the two samples can be said to come from a different population with a 5% chance of error. The z value must be at least 2.58 for a 1% chance of error, Working entirely with relative frequencies of function words and with this z value, they have discovered a number of cases where books of the Old Testament are not homogeneous in the usage of a single word. Their study of Isaiah is based on twenty different words and divides the book into roughly four sections. Using the definite article alone, 14 books or sections of books in the Old Testament have a z value greater than 1.96 and many of these are greater than 2.58. Some of these can be explained -- for example, the one section in Joshua which has a z value of 19.333 but consists of a comparison of narrative with geographical lists. It seems rather pointless to include such obvious known differences, but perhaps it does at least validate the method. In most cases, then, these analyses of style have been based on simple calculations which are not beyond the non-mathematician. A computer can ensure that such a study is comprehensive and not based solely on isolated occurrences of words. A concordance can be a valuable start to a stylistic analysis, although unlemmatised word-frequency counts should be used with care. Words should be considered first as separate items of vocabulary and then grouped under their lemmas. There has been considerable misuse of statistics in the published literature. It is not at all difficult to learn enough statistics to use the simple tests which have been described in this chapter. More complicated statistics should be avoided unless they are understood thoroughly. If necessary, a professional statistician should be consulted. It is foolish now to embark on any quantitative stylistic analysis without a knowledge of elementary statistics. With such a knowledge it is possible to judge whether the differences in style between two or more samples of text are significant

or not. However, deductions from such evidence to conclusions on the authorship of the texts should be made with the utmost caution. One or two interesting results do not necessarily indicate that the authorship of a text is in dispute. Only in the case of the <1Federalist Papers>1 has it been possible to provide a conclusive solution to a disputed authorship question, and as we have seen there were features of that problem which made it unusually suitable for a proof of that kind. Kenny's study of the Aristotelian <1Ethics>1 has found what appears to be sufficient evidence to attribute the disputed books to the <1Eudemean>1 rather than the <1Nicomachean Ethics,>1 but nevertheless he stops short of claiming a conclusive solution to the problem. The computer will not produce an absolute solution to an authorship study, but it can be used successfully to provide sufficient evidence on which powerful conclusions can be based. It is used to best advantage in work which requires lengthy and laborious counting. Such a quantitative study using comparative tests can give sufficient significant evidence, provided it attempts to cover as many as possible of the features mentioned in this chapter. Only then can any conclusion be reached, and in many cases this may only be a negative conclusion.

There are several stages involved in the preparation of a critical edition of a text. Normally the text editor begins by studying all the manuscripts or editions of his text and then collates these manuscripts to find all the places where the readings differ. Variants may be anything from a misspelling or mistyping of a small and insignificant word or incorrect punctuation to the omission or insertion of whole blocks of text which may be many lines long. Once the variants have been found, the editor must decide which if any of them represents what the author wrote. This is usually done by investigating the relationships between the manuscripts. The oldest one might be considered to have the most likely reading, or alternatively the reading supported by most manuscripts may be the true one. The text can then be reconstructed using the chosen readings from the variants, and those variant readings which are considered important may be provided for the reader in the form of an apparatus criticus usually presented as a series of footnotes at the bottom of each page. The final stage is the printing of the text and apparatus. There are therefore five stages in the preparation of a critical edition of a text: 1 . Collation of manuscripts 2. Finding the relationships between manuscripts <23.>2 Reconstruction of the text 4. Compilation of the apparatus criticus 5. Printing of the text and apparatus The computer has been used with some success in stages one and two. Stages three and four are really a matter for human judgment, but they can be facilitated by using material already stored in the computer from stages one and two, and of course if the material is already in computer-readable form it would be sensible to use the machine for stage five -- that is, the typesetting of the final edition. Provided that there is no doubt about the readings in the manuscripts, the collation of manuscripts is a purely mechanical process, consisting of

merely identifying places where two or more texts do not match. On the face of it this lends itself easily to computing and indeed the machine has been used sensibly on the kind of text which is suitable for it. Texts which differ considerably present rather more problems for computer collation, as the machine is far more likely to lose its place when comparing two manuscripts with frequent and major variants. To prevent this it may be necessary to make many comparisons between a word in the master text and successive words in the comparison text until the place is found again. These comparisons can use up a lot of computer time if care is not taken in working out a suitable method for doing them. Algorithms for computer collation of texts have therefore attracted the attention of computer scientists interested in finding the best method of collation, that is the one which uses the least number of comparisons without losing its place. Different procedures may be involved in collating verse and prose. Verse is much easier to deal with than prose provided the line format remains the same from manuscript to manuscript. Variants should also comply with the metre of the words they replace. It is therefore much easier for the computer to keep its place in verse than in prose, where the line numberings, beginnings and ends vary from manuscript to manuscript. The format in which the text and the variants are printed out is important in computer collation. The machine is being used to simulate a series of human operations in which certain words must be noticed quickly. The computer should therefore produce output which can easily be scanned by eye for the variants. It should not be necessary always to refer back to one of the original texts when examining the variants, though this may be desirable from the editor's point of view. Almost all of the early computer collation systems assume that the computer input from manuscripts has the same line structure. Some computer programs compare only two manuscripts at a time. Others allow a large number of manuscripts to be compared. It is usual to select one text as a base or master text and consider the others as comparison texts. The various methods developed can best be illustrated by examples to show how each system would collate the following three lines, which we shall call <2A, B>2 and <2C.>2 We shall assume initially that they are presented to the computer as lines which should be compared with each other without the possibility of further variants of the same line appearing on preceding or succeeding lines of text. <2A.>2 The computer can collate many of your manuscripts together <2B.>2 This computer is able to collate your manuscripts together <2C.>2 The <2ICL>2 computer can collate many of our manuscripts at once. In <1La Critique des Textes et son Automatisation,>1 Froger includes some

specifications for a collation system, although he has apparently tested his theories only on a very small amount of data. His system assigns a number to each word and to each space between words as the text is read in. One text is specified as a base against which all others are to be compared. The computer then starts at the beginning of the texts comparing corresponding words from each. When it finds a mismatch it compares a larger block of words until the texts match again. The variants are saved and the text realigned where the match was found. Froger's method of printing the variants is difficult to follow, although it corresponds in appearance to the traditional apparatus criticus. The variant is accompanied by an identification of the manuscript where it was found and a code indicating it to be deletion, addition or substitution. The place where the variant was found is indicated by the numbers assigned to each word or space as the text was read in. For our lines, Froger's program would produce: <2A.>2 The computer can collate many of your manuscripts together <2B.>2 This computer is able to collate your manuscripts together <2C.>2 The <2ICL>2 computer can collate many of our manuscripts at once 2,S,this,B; <23,A,ICL,C;>2 6,S,is able to,B; 1o,D,many of, <2B;>2 14,S,our,C; 18,S,at once,C Here the first group indicates that position two -- that is, the first word (allowing for a space before it to be numbered one) -- has a substitution coded by <2S,>2 the text of which is <1this,>1 and which occurs in manuscript <2B.>2 The A in the second group indicates that there is an addition, namely <1ICL>1 in position 3 in manuscript C and so on. The chief drawback with this method, of course, is the numbering system. Even with our short example, it is not easy to see where <1at once>1 is substituted. Finding, say, word 352 with no other identification would be impractical. However, the output does approximate to the traditional apparatus criticus. It would be vastly improved if variants were referenced by line number, or if the words in the base text of which they are variants were also given. Dearing's collation program claims to be able to handle up to fifty manuscripts at one time. They are fed into the machine in blocks or sections, the size of each section being specified by the user. At the input stage the user must know that a block from one manuscript corresponds to the same block from another and that this correspondence applies also to each line within each block. The program then compares manuscripts line by line. It compares words at the beginning of each line of text and continues until a variant is found. It then compares words at the end of the line and works backwards. All words between two points at which the texts do not match are marked as variants. The base text is compared in this manner with each of the comparison texts. If the line is missing in one

manuscript, the comparison text is searched ten lines backwards and forwards until a matching line is found. An indication is then made that the line numbers do not correspond. If no match is found, an interpolation or omission is marked. The printout contains the complete line from the base text, together with the identifiers of the manuscripts which agree with the base text. Where there is not a perfect match, the variant portion together with the identifiers or other manuscripts having this variant is printed. After all the variants have been printed, the identifiers of the manuscripts which omit the line are given. If an interpolated line is found in a comparison text, the first such text where it is found is treated as the base text and the procedure repeated. Using his system our texts would give: A1 . The computer can collate many of your manuscripts together <2B1.>2 This computer is able to collate <2C1 . ICL>2 computer can collate many of our manuscripts at once As is rapidly apparent, many words are classified as variants which should not be so, simply because of jumping to the right of the line and working backwards from there when a variant is found. Small variations at each end of the line would generate many unnecessary variant words. This printout contrasts with Froger's, which gives many more shorter variants for the same text. Another program by Silva and Bellamy arrives at results very similar to Dearing's, but by a method which consumes more computer time. Their system is again designed for poetry and therefore expects roughly the same words in the same line in each manuscript, thus creating a false impression of simplicity. It only operates on two texts at a time. There are two stages in their process. The first attempts to establish the alignment of the texts by comparing every fifth line from the two texts. If they do not match, the program then attempts a line-by-line search backwards and forwards for up to twenty lines in both manuscripts. When the texts are realigned, the occurrence of an interpolation or deletion is recorded. One line is assumed to be a variant of the other, if no match is found in this twenty-line search. Once the texts have been realigned in this way, a letter-by-letter search left to right and then right to left, following Dearing's method, is carried out on each line. All characters between two mismatches are considered to be variants. The printout consists of the complete line from the base text, followed by the line from the comparison text indicating where the texts do not match. (A compared with <2B)>2 A1. The computer can collate many of your manuscripts together <2B1.>2 **is computer is able tc collate *******************************

<2(A>2 compared with <2C)>2 A1 . The computer can collate many of your manuscripts together C1 . **** <2ICL>2 computer can collate many of your manuscripts at once Like Dearing, Silva and Bellamy's method creates apparently lengthy variants when there are two simple variants at the beginning and end of a line. Widmann's collation of A <1Midsummer Night's Dream>1 adopts a very different approach. The computer is not used to search back and forth for variants, but rather to print out all the editions in a format which aids eye collation. Professor Widmann is the editor of the New Variorum Edition of this play and was therefore faced with the task of collating between 80 and 100 editions. Her method is much simpler, but it clearly requires more human endeavour. It relies very heavily on the correct alignment of the editions, line by line, before any collation is attempted. Each line of text is prepared on a separate computer card, the first four columns of which are reserved for through-line numbering. The text begins in column eight, and if it is too long for one card it is then continued on to the next card. If one edition has such a long line, this is matched by a blank card in all the other editions. Therefore a line which appears as two lines in one manuscript is allocated two cards in every edition. This allows the computer to assume that the texts will always match line for line. A sample of Widmann's output is shown in Figure 7.1. The line is given in full from the copy text q1, together with its through-line number. Only the variants are listed after this. Manuscripts which agree completely with the copy text are indicated by their identifier only. Widmann's text is coded for upper and lower case letters and differences between these are considered variants, as are differences in punctuation. The symbols >w indicate a difference in line length. The example shows one line which is in effect two lines in many editions. This is indicated by the many occurrences of >w. For our lines, Widmann would give: 1. The computer can collate many of your manuscripts together B This is able to >w C <2ICL>2 our at once>w It is not clear from her description how an omission in one of the comparison texts is marked on the printout. The >w shows that the line length is different, but it would be convenient to know where. In other respects, Widmann's method appears to suit her needs and presents a clear readable printout. It requires more clerical effort, as the initial alignment is

done by hand, but given that drawback it produces the most useful output we have seen so far in collating verse. Four algorithms have been published for collating prose, although only three of them have been applied to a substantial quantity of real text. The earliest published prcse program, named <2OCCULT,>2 is described by George Petty and William Gibson. <2OCCULT>2 is an acronym for the Ordered Computer Collation of Unprepared Literary Text. Petty and Gibson were initially interested in comparing two texts of Henry James' short novel <1Daisy Miller.>1 The same program was later applied to Melville's <1Bartleby the Scrivener.>1 In their method the computer compares texts in blocks

of twelve words to determine whether they are similar. They consider two sets of twelve words to be matched if the sum of the number of words in the longest continuous matching sequence, plus the total number of matching sequences of two or more words, is six or more. To examine our texts by this method we need to add another sentence to bring the number of words up to twelve. <2A.>2 The computer can collate many of your manuscripts together. It works well. <2B.>2 This computer is able to collate your manuscripts together. It can work. <2C.>2 The <2ICL>2 computer can collate many of our manuscripts at once. Really? Comparisons are performed only on two texts at a time. On taking A and B we obtain the following score: matching sequences: your manuscripts together. It Therefore, for one matching sequence of four words, the score is five. <2OCCULT>2 would say that these two lines are not variants of each other. A and C produce the following results: matching sequences: computer can collate many of For one matching sequence of five words the score is six. According to <2OCCULT>2 criteria these lines are just variants of each other, Because of the smaller variants inserted more frequently in either A or <2B,>2 these two lines appear to <2OCCULT>2 to be more dissimilar than A and <2C.>2 Petty and Gibson have found that about 90% of matching sequences can be established by this rule. Our one simple example implies that this may be an overestimate. <2OCCULT>2 works only on additions and deletions. A substitution is considered an addition in one text and a deletion in the other, which could be a drawback in some work. The shorter text is always taken as the base text and the longer one as the comparison text, which is again not necessarily the best methodology from the editor's point of view. If the twelve words do not match, up to three hundred words are searched forwards and backwards from the point of mismatch until matching blocks of twelve words are again found. All the intervening words are considered variants. This could well lead to a large number of spurious variants if a

mismatch is found on a comparison like that between our manuscripts A and <2B. OCCULT's>2 chief disadvantage is said to be the time it takes to perform all the comparisons. It was written in <2SNOBOL>2 and tested before there were efficient <2SNOBOL>2 compilers. The time disadvantage may be less important now, but its methods of finding a mismatch are somewhat suspect. <2OCCULT>2 produces printout in two columns. The left-hand column gives those words in the base text which have variants in the comparison text and the right-hand one gives the actual variants. One word preceding the variant is printed in both cases. The references indicate page number and line number within each text. Assuming that <2OCCULT>2 was collating our original three lines and they had been found to be variants of each other, we would obtain (A and <2B)>2 1.1The 1.1This 1.1 computer can 1.1 computer is able to 1.1 collate many of 1.1 collate <2(A>2 and <2C)>2 1.1The <21.1The ICL>2 1.1 of your 1.1 of our 1.1 manuscripts together 1.1 manuscripts at once The printout is quite well presented. There is sufficient indication of what the variants are and where they occur, but it could perhaps be improved by saving the variants from several collation runs and printing them side by side. But, however the variants are printed, they are only useful if the methods of finding them are satisfactory. Another algorithm for collating prose was published by Margaret Cabaniss. Her program allows the user to specify various features such as the number of words which constitute a match -- that is, a return to alignment of the texts once a variant has been found. The user can also specify how many words the program is to compare, in order to find its place again after a variant. The ability to change these options adds more flexibility to the program, which also includes what is called a "coarse

scan', for finding the alignment. This would normally cause every tenth word in the master text to be compared with every word in the comparison text. Here again the user can specify how many words are to be skipped in this coarse scan. Cabaniss was the first to provide a method of dealing with function words: the user can supply a list of words which are to be ignored when looking for variants. The chief defect of this program is again the format of the printout. It is not lined up satisfactorily as Figure 7.2 shows. Two lines which are variants of each other are not printed side by side. Collation by eye is then required to find the variants. It should not have been too difficult to modify the program so that lines which are variants of each other are printed side by side. This system again only allows two texts to be collated at once. Our example would appear as follows: (A and B) 1. The computer can collate many 1. The computer (is able to) collate of your manuscripts together ( ) your manuscripts together (A and C) 1. The computer can collate many 1. The <2(ICL)>2 computer can of your manuscripts together 1. many of (our) manuscripts (at once) Cabaniss' method allows the user flexibility in determining the number of comparisons to be made, but her program is marred by poor presentation of output. The variants are too lengthy and must be collated by eye. Printing out the variants side by side is reasonable when there are only two texts to be collated. Cabaniss's program was designed to collate twenty- four texts. If each of twenty-three comparison texts is to be compared with one base text, the printouts themselves would occupy too much room side by side to be useful. None of the programs we have considered so far are all-purpose programs and they do not entirely satisfy the needs of the textual editor. In contrast much more thought was put into the most recent program to produce results from actual text, which was devised in Manitoba by Penny Gilbert and is called <2COLLATE.>2 It is part of a series of computer programs which were written to aid the production of a critical edition of Buridanus' <1Quaestiones Supra Libros Metaphysicae,>1 of which seven manuscripts exist. The collation process is divided into a number of simple stages, at each of which manual intervention is possible. The first ten columns of the

input cards were used for a reference for the line of text. The first stage of the program recreates the base text in a computer file and makes a second computer file of all the words in the text, each accompanied by its reference. The second stage performs the comparison itself. When a mismatch occurs, the computer takes the next two words in the base text and compares them to successive pairs of words in the comparison text, up to ten pairs. If this test fails, it compares the next but one, and next but two words after the mismatch from the base text to the comparison text in exactly the same manner. Then groups of four words for up to 50 words in the base text are tested against 100 words in the comparison text. The next level of search works on groups of nine words and searches 100 words in the base text and 200 in the comparison text. Finally the program adopts a different approach and searches for paragraph markers, comparing the first four words of the next ten paragraphs in an attempt to realign the texts at the beginning of a paragraph. The actual levels of search and the numbers of words to be considered in each case were chosen by Gilbert to suit the material she was collating The program stops if it is completely unsuccessful, and the human may then realign the texts for it; but Gilbert did include a facility for control cards to be inserted to instruct the machine to ignore badly mutilated sections. As the variants are found, they are stored in another computer file. (It is surprising that none of the other collators mention the machine's obvious ability to store the variants once they are found.) As the base text is compared against more manuscripts, the variants found can be merged into a file of variants found in previous comparisons. Gilbert's first printout attempts to simulate a traditional text with apparatus criticus. The base text is printed out in full with identification numbers pointing to the input text. Underneath there is a list of variants each accompanied by its identification number and the manuscript in which it appears. The identification numbers seem to consist of the line number followed by the position of the word in the line, although they are not described and are not easy to follow. Assuming this is so, for our lines we would obtain: <2(A>2 and <2B)>2 The computer can collate many of your manuscripts together 10000 The A 10010 This B 10010 can A 10030 is able to B 10030 many of your A 10050 your B l0070

<2(A>2 and <2C)>2 The computer can collate many of your manuscripts together 10000 computer A 10020 <2ICL>2 computer C 10020 your A 10070 our C 10080 together A 10090 at once C 10011 Gilbert's system was devised after some considerable examination of the previous attempts. She appears to be unique in attempting to integrate the various stages of text editing. As we have seen, variant readings from all manuscripts can be printed underneath the text. A further stage allows variants which the editor considers insignificant to be removed. No exclusion list like that of Cabaniss is provided, because it was felt that words to be discarded were the decision of the scholar; but it would not be difficult to add this facility. The file of variants can be altered by using an editing program on the terminal, which we have seen is a very simple process. Where the same variant occurs more than once only the first example is stored together with the reference identifiers for the other examples. A final stage allows the text and variants to be printed in the traditional form. Another collation algorithm is described by Cannon, although he does not show any examples of his algorithm working on text. He is a computer scientist concerned with efficiency of programs and has pointed out that most collation systems are slow, simply because of the number of comparisons they have to make when a mismatch occurs. He claims that the Petty/Gibson and Gilbert methods in particular perform many repeated comparisons. His method uses a table stored inside the machine to record whether a match has been found when two words are compared. He can then access this table very quickly when he needs to compare those two words again rather than do a lengthier character comparison. His algorithm does not appear to have been programmed and he does not deal with the problem of what to do when a total mismatch occurs. In fact his approach is that of the programmer whose interests may conflict with the problem which has to be solved. The text editor should always be consulted to ensure that the finished product is useful. Computer collation is, then, much easier for verse text, although the efforts to produce suitable programs for certain prose texts have been quite successful. It is arguable whether the results can justify the work involved, especially as it is necessary to prepare all the texts in computer readable form, no mean task if there are more than a few texts involved. Widmann's Shakespeare collation was performed on a small text of something over

2000 lines, but on 100 editions and therefore 200,000 lines of input. Gilbert's seven texts filled some 80,000 cards. The human errors which inevitably creep into such large undertakings of data preparation will no doubt be discovered by the collation process, but these must all be removed before conclusions can be drawn from the results. Collating two or three editions of a short text would be suitable for a computer, particularly if there were no large variants. It is arguable whether it is worth it for much larger amounts of text. The Gilbert method was successful, but it was a relatively large undertaking. The most suitable course seems to be to follow Gilbert and to make the collation only a part of a larger computer system for preparing a critical edition. The methods which have so far been described are not suitable for comparing texts which are markedly different. Oral texts such as folk plays have been handed down from one generation to another and can exist in many quite different versions before they are finally written down. Many scholars have shied away from the complex textual problems associated with orally transmitted material, but M.J. Preston has proposed a different method of comparing such texts in order to quantify the differences between them. He divides a text into overlapping units. Each begins at a word and continues for fifteen characters, either letters or spaces. These blocks of text are compared with similar blocks from another text and are said to have made a match when thirteen of the fifteen characters match and the first word matches or is an inflectional variant, although he does not describe how an inflectional variant is recognised. In fact several aspects of this method are not entirely clear, but he appears to have found a method which is suitable for his own material. The method could be explored further in a number of ways. It may be possible to find out how many or how few matches may be required for it to fail completely by varying the number of letters which are to be matched in each group. The second stage of the task of preparing a critical edition, assuming that all the variants are known, is to try to deduce the relationship between the manuscripts. An attempt can be made to find the manuscripts which are more closely related to each other and those which may perhaps be considered to hold the original reading when a variant occurs. Even when there are not many manuscripts, this can involve many comparisons. When the number of manuscripts and variants becomes large, these can be handled much more easily by a computer. Traditionally manuscript relationships have been represented as a kind of family tree, with the oldest texts at the head and those which are thought to be derived from them shown as later generations. Reconstructing such trees can be attempted by computer even when some of the manuscripts have been lost. It is frequently possible to construct several trees from the

same set of variants. Only the scholar can decide which is more appropriate on the basis of his philological knowledge. Let us suppose that we have five manuscripts now called <2A, B, C, D>2 and E which read A Jack went up the hill B Jack climbed up the hill C Jack went up this hill D Jack went down this mountain E Jill went up this mountain We can consider what would be the relationship between these manuscripts. The variants can be represented in a tabular form. From this table we see that for variant number one <1Jack>1 and <1Jill>1 <2ABCD>2 have reading number one, that is <1Jack>1 and E has reading number two, that is <1Jill.>1 In the right hand column we see how the manuscripts are grouped for each reading. Following Froger's method, based on the mathematical theory of sets, we can select one manuscript, let us say <2A,>2 as an arbitrarily chosen base. The manuscripts can then be grouped in subsets of the set of all manuscripts as follows:

from the second drawing a stemma can be reconstructed as follows: leaving the * to denote a missing manuscript which might have read Jack went up this mountain A computer program can construct such a diagram by examining all the variants. The computer will also provide all other possible stemmata taking each manuscript in turn as the arbitrary base. The scholar with his knowledge or the manuscripts can then select the stemma which seems most appropriate. Poole states quite rightly that Froger's method depends on the existence of many readings which have only two variants, and that it will only produce a satisfactory stemma if there are no coincident variations, which Poole calls anomalies. If these occur, a set of manuscripts may be included in more than one higher set, thus destroying the stemmatic relationship. He demonstrates that when pairs of variants can be arranged in an endless linked sequence there is no possible stemma into which they can all fit. One of the readings at least must be anomalous. Accordingly he has devised a computer program which can detect the presence of an anomaly in the data. He can make some assumptions about which readings are anomalous and use this procedure to eliminate them, leaving the material for constructing a stemma. His program was tested on 54 lines of Piers Plowman, of which seventeen manuscripts exist. These 54 lines present profound textual problems because there are 400 readings with variants recorded. This trial showed that cbjective results could be obtained even from a very small sample of text showing such complicated variations; but the entire program depends on the assumptions made at the beginning, and these may not necessarily be true. Poole concludes that though the computer can only rarely produce a definitive stemma, it can be used with considerable success to provide reliable materials for the reconstruction of a stemma. Dearing also has a set of programs which attempt to reconstruct manuscript stemmata. The first called <2PRELIMDI>2 merely provides basic genealogical connections between texts. These relationships are established without consideration of which manuscript may be at the top of the tree. Two further programs, called <2ARCHETYP>2 and <2MSFAMTRE,>2 accept the data provided by <2PRELIMDI>2 or similar input. <2ARCHETYP>2 locates the position of the archetype on the basis of fully directional variations among the texts or states of the text. It gives the information necessary for drawing up a family tree. The second program, <2MSFAMTRE,>2 constructs textual family trees on a theory of probability as to the general mechanics of growth. The user can specify the particular conditions for growth, such as the probable number of manuscripts copied before any were lost sight of or destroyed, rates of copying as compared to rates of loss, and the number of

extant manuscripts. It is assumed that the user can reflect his knowledge of historical conditions in these instructions. The program supplies all possible family trees for the extant manuscripts, leaving the scholar to decide which is the most probable or valid. Another approach to the reconstruction of manuscript trees was adopted by Zarri following the theories expounded by Dom Quentin in the 1920s. Zarri has constructed a number of computer algorithms to simulate and test Quentin's work. Quentin's theory of "characteristic zeros' is adopted as follows. A set of three manuscripts <2A, B>2 and C can be said to be related in a linear fashion. only when A and C never agree with each other against <2B. B>2 is therefore intermediate between the two manuscripts. Only "significant' variants are used, and Zarri has adopted an interactive computer program so that he can decide which variants are to be significant when the program is actually running. Three manuscripts only provide the simplest case. When four are used, they can be divided into four sets of three manuscripts. In the case of four manuscripts <2A, B, C>2 and <2D,>2 we can suppose that the following triplets are characterised by zeros. and that <2ACD>2 has no zero. This produces a linear chain <2CABD>2 where the origin can be either C or <2D.>2 Zarri's algorithm then goes on to examine all possible subsets of size five, split into groups each of four manuscripts using the results obtained from the previous stage to continue the linearity. It continues with larger and larger groups until either the size of the group is equal to the total number of manuscripts or until the linear construction fails. It is apparent that this algorithm can only consider linear relationships and must discard all others which cannot create a chain. In a simple example such as if the <2ABD>2 group above produces <2B--A--D,>2 the chain is lost. If this relationship is discarded, it leads to a loss of information. The algorithm was therefore modified to allow non-linear structures. The analysis is now more complicated, and there are several possible groupings which may or may not allow for missing manuscripts. Zarri experimented with his algorithm on a set of real manuscripts and found that it only gave unambiguous results when the number of "linear' and "non-linear' elements did not exceed twenty. He does not regard the ambiguous results for higher

numbers as discouraging. Further work has led him to be able to reconstruct graphs of the type but he claims always that the scholar must use his own knowledge to decide which of these manuscripts is most likely to be the oldest. Zarri"s computer model is said to have provided some appreciable results, initially on the manuscripts of the Chanson de Roland. It is not perfect, but it can provide the scholar with all possible interpretations and thus supersede purely manual reconstructions. When given all the significant variants, the computer can easily derive all the characteristic zeros and generate all possible reconstructions from these, but this method depends entirely on the acceptance of the validity of Quentin's arguments. The reconstruction of family trees for manuscripts, or at least the attempt to find all relationships between them, lends itself to computing. The solution is rarely unambiguous; but any other solution, whether using computers or not, is unlikely to be conclusive. The advantage of using the computer is that it can be programmed to generate all possibilities. It is then for the scholar himself to consider the importance of the variants. It may be possible to develop further a system like Zarri's to allow for weighting of the important variants. The same manuscripts can also be analysed repeatedly, each time using a different subset of the variants. This is an area where more research would be useful. Zarri's method employs some mathematical methods of graph theory and matrix analysis, and anyone trying his methods would be well advised to consult a professional mathematician. A very different method has been adopted by Griffith in studying the manuscript tradition of a number of classical texts. Griffith believes that the family tree methods are not appropriate to the study of the relationships between manuscripts. He points out that different scholars have produced totally different trees from the same variants, a view supported by the computer programs already described. The family tree method implies that the manuscripts are in a fixed set of relationships to each other. Griffith considers that the relationships between manuscripts should be considered as a whole rather than as a complicated series of linear structures.

Following methods well established in the biological sciences, he used cluster analysis to study and classify his manuscripts. This technique has already been described in Chapter Four with regard to variant words in dialectology and mentioned in Chapter Six in connection with stylistic analysis. It can be applied equally to manuscript variants. His first analyses were made manually by the method known as seriation. Under this technique the manuscripts were placed along a notional spectrum line such that those which were most like each other were placed together at one end of the line and as far as possible from those to which they bore least resemblance, which were at the other end of the line. Griffith soon discovered that a computer can compile a similarity matrix much more easily than a human. He now uses a simple program, to which is fed a list of variants, accompanied by the identifiers of the manuscripts which have those readings. The seriation method improves on the family-tree method in allowing more readings to be considered. The program has therefore been extended so that it can work in a multi-dimensional fashion, the number of dimensions being one less than the number of manuscripts. The values in the similarity matrix are transformed mathematically, and from these results only the first three dimensions are taken, as the others have become insignificantly small numbers. The computer's graph plotter is used to provide a visual representation of their spatial relations as a three- dimensional drawing. This method of visual presentation is preferred to the dendrogram, so that the distance between each pair of manuscripts can be measured more easily. Griffith claims rightly enough that such a cluster analysis only produces a classification of the documents, but this classification makes allowance for all the variants, whereas with the mechanism for constructing family trees many variants may be discarded. Griffith can operate his program on up to 100 variants at once from about 25 manuscripts. This could and should be expanded to use many more variants to give a clearer overall picture. He has found that the clustering appears to be constant over long stretches of the same work when he has analysed it in sections. It is of course possible to weight the variants, but he has avoided this because it introduces some subjectivity, A simple experiment of weighting some variants in Juvenal by duplicating them yielded results which were only slightly different from the non-weighted ones. Griffith uses his own programs, but standard programs which perform cluster analysis can be used on manuscript data if desired. If some of the more intractable problems of textual relations are to be analysed thoroughly, mathematical methods like these are clearly useful in modern textual criticism. We can conclude this review of computer techniques in textual criticism with an example which is characterised by the vast number of variants, so

many that they can only be organised at all by computer methods. A new critical edition of the New Testament is based on some 5000 manuscripts, and there are in addition many other witnesses to the readings, such as quotations. In the compilation of this edition, Ott does not plan to use the computer to find the variants initially but only to store them all, and then to compile the apparatus criticus (which he expects to be approximately one page for each line of text). It would clearly be impossible to include all 5000 manuscripts in the apparatus criticus. All major ones will be represented and a selection of the minor ones included. The selection was made on the basis of the relationships between manuscripts and one thousand passages were selected on which these relationships could be established. The computer is used for storing and sorting the variants according to a number of different criteria. For each variant one card was punched containing an identification number, followed by the text of the reading and the Gregory numbers of the manuscripts which have this reading. Ott's examples show 243 manuscripts at 30 selected passages in the epistles of St Peter. He is able to print out the variants in several different tabular formats. The first printout gives the complete passage where the variant occurs, and under it is listed the first variant with the total number of manuscripts which have that variant and all their identifiers. Each variant is treated likewise in turn, so that the editor can see at a glance which is most frequent. The manuscripts which have a complete lacuna at this passage are also noted. The next printout is a table of passages against manuscripts. Within the table each element holds the number of the variant which occurs at that passage and manuscript. A similar table shows the size of the group of manuscripts which have the same reading for the specified passage, as shown in Figure <27.3. A>2 table like this immediately isolates the occurrence of single witnesses, and from it the reading supported by the majority of manuscripts can easily be found. A similarity matrix is then computed for the manuscripts from which a cluster analysis could be performed. The whole matrix for 30 passages in 243 manuscripts must be fairly large. Ott says that it occupied 23 pages of printout. Absolute values found in the similarity matrix must be converted to percentages for clustering, and of course they cannot make any allowances for the manuscripts which have lacunae. Ott has preferred to calculate the degrees of relationship between each manuscript and all the others rather than attempt to cluster them all together, but there are many other ways in which the relationships between these manuscripts can be expressed. The computer is being used merely as a tool for handling large amounts of data. It is enabling the compilation of the most comprehensive critical edition yet to be made. Handling such large quantities of variants would be almost impossible manually, and

inaccuracies would inevitably appear. The computer can provide its classification on the basis of existing readings, thus providing the scholar with a better foundation for his judgment. There is therefore considerable scope for using the computer in the preparation of a critical edition of a text. We have considered computer collation in some detail and found that it suffers from the major disadvantage of the need to prepare all versions of the text in computer- readable form, but it may still be worthwhile in the long run. A more straightforward application for the computer is the analysis of the relationship between manuscripts. It is preferable to input to the machine one base text together with all known variants. Providing an initial base text allows the editor to produce a concordance which may be very helpful in selecting which variants to insert, by showing their occurrences elsewhere in the text. The computer is being used to simulate those operations which have previously been performed manually. Besides speed and accuracy, its advantage is that it can handle a much larger number of variants and manuscripts. The development of a Zarri- or Griffith-type model requires some knowledge of mathematics, but procedures such as those adopted by Ott are much simpler and can provide results which are equally useful. Once the editor of the text has selected his readings the computer can be used to create the final version of the text. If one complete version of the text is already in the machine, an editing program on a terminal or a special- purpose program can be used to produce the required text. In either case the computer can display the line with each possible variant reading, and the editor can then choose one or insert his own emendation. A computer program called <2CURSOR>2 was initially developed at the University of Waterloo to perform such a task. The important variants from the file of variants can be retained for the apparatus criticus, together with the as an interactive text editor, the scholar can establish the final version of his text and then use a photocomposing machine to typeset his edition. There will then be no more need for proof-reading as the final version will already be accurate for printing.

The analysis of metre in poetry and rhythm in prose, and the investigation of poetic features such as alliteration and assonance, can be assisted by computer. These features are essentially incidences of regular patterns and, provided the patterns can be specified sufficiently clearly, the computer can be most useful in finding them. Once the features have been found, the computer can be used to compile tables of particular patterns or to determine where and when they are significant. An examination of sound patterns can be very simply programmed for the computer, particularly in a language where one graphic symbol has one unique sound. A straightforward example is the investigation of sound patterns in Homer performed by David Packard, which is based on a count of letter frequencies in each line. Figure 8.1 shows a table produced by the program, from which it can be seen that 536 lines have no alphas, 1909 have one, 2540 have two and so on, in Homer's <1Odyssey.>1 The counts for L indicate a total for all liquids GREEK. P gives all labials GREEK, T gives all dentals GREEK and K all gutturals GREEK. Nasalised gamma is separated under the heading <2yK.>2 Many of the nineteenth-century Homeric scholars commented on lines in Homer which they thought had an unusually high proportion of one particular sound. While a high density of a sound in one particular line can be readily noted, it is not so easy without a computer to indicate the frequency of that sound in the work as a whole. Packard notes these scholars' attempts to isolate such lines. For example Stanford draws attention to a line in the <1Odyssey>1 which has three deltas. Packard's counts show that this happens about once in every fifteen lines in the <1Odyssey.>1 Similarly Stanford notes the "ugly gutteral sounds' describing the Cyclops' cannibalism, citing one line with three gutturals and one with five. It can be seen from the table that these densities of gutturals are not at all infrequent. Packard's program also records the line numbers of those lines which have a high density of a particular sound, and it is frequently useful to examine the content of the lines. The line in the <1Iliad>1 (Book 23, line 16) which contains most alphas (11 in all) is well known for its imitation of the sound of stamping feet. Many of the lines which have five or six etas deal

with youth and lovemaking. The verse with the highest concentration of liquid letters in the <1Iliad>1 mentions the fair-flowing river Scamander. Packard cites many other lines as examples of high density of particular sounds. This type of study is extremely easy for a computer, but it would be almost impossible for the human brain to perform it accurately.

The occurrence of groups of letters, such as specific clusters of three or four consonants, may then be examined. Packard identified a number of Homeric lines which have unusual densities of consonant clusters. He also inspected those lines which have unusual juxtapositions of vowels, with a particular interest in the hiatus of omega and eta. Following Dionysius, he was able to calculate a "harshness' factor for each line. Each sound was assigned a harshness number and the final harshness factor is the sum of the numbers for each sound in the lines, times ten, divided by the number of sounds. It is again easy to program a computer to apply this formula to all Homeric lines and then study the context of the lines to see whether the subject matter relates to the sound. A similar study is reported by Kjetsaa on Russian. An eighteenth- century writer Lomonosov had attempted to classify the letters of the Russian alphabet as either <1tender>1 or <1aggressive.>1 The object of Kjetsaa's research was to see whether those lines of Lomonosov's own poetry which had high frequencies of either tender or aggressive sounds did in fact relate to these feelings in meaning. A high frequency of a letter was defined as the occurrence of that letter three or more times in one line. The categorising of lines into tender and aggressive was done manually, as the computer could not determine the meaning of a line satisfactorily. A first experiment on some of Lomosonov's odes showed that about 70% of lines tender in meaning were also tender in letters. The correlation was less clear for aggressive lines, but there was some indication of this. Kjetsaa also carried out further studies relating to the distribution of letters within lines, particularly investigating which letters occurred frequently in the tender and aggressive lines. These simple frequency counts of sound patterns are based on counts of letters, but the study of alliteration and assonance presents rather more problems. Let us consider alliteration in English. The letter c alliterates with s in some cases, but with k in others. c in combination with h giving <1ch>1 makes a different sound again. These distinctions could not be detected merely by letter counts. In English it would also be necessary to deal with silent letters like k in <1knight.>1 Alliteration is frequently observed only as the repetition of initial letters of words, but it may also be seen in the initial letters of syllables which do not themselves begin a word. It may therefore be necessary to use some kind of phonetic transcription when studying alliteration. But whichever method of transcription is employed, the computer can be put to best use in the mechanical searching for repetition of patterns. An investigation of alliteration in an older Scots poem, <1Barbour's Bruce,>1 illustrates some of the ways in which alliteration may be analysed using a computer, and some of the problems which may be encountered. One program simply prints out all those lines which contain initial alliteration,

which is defined as all those lines which contain more than one word beginning with the same letter. The inadequacies of this simple definition soon became apparent. Many function words were found beginning with the same letter which could not be considered to contribute to alliteration. A number of problem letters were identified, notably <1q>1, <1th,>1 <1c>1, <1w>1 and <1v>1 which either gave faulty alliteration <1(ch>1 and c)>1 or missed alliterations <1(k>1 and <1q)>1, The program was altered to print out a line of asterisks whenever it found alliteration on one of these letters. This would draw the researcher's eye to a problem which could then be solved manually. The possible missed alliterations were also dealt with by hand, once the computer had printed out those lines which contained one occurrence of a difficult phoneme. It would surely not have been difficult to program the computer to look for combinations of letters which make certain sounds. For example in English, a letter <1c>1 is always soft when it is followed by e i or y and hard before other letters except <1h.>1 Once the alliterations have been found it is possible to ask such questions as how often does an alliterative letter continue into the next line. How many times is the letter repeated in the same line and then in successive lines? Is there double alliteration in one line? The number of such questions is endless, but the computer will supply rapid answers to all those which can be specific. There have been attempts to measure the frequency of alliteration in several different ways. Wright attempted to include embedded as well as initial consonants to locate areas of high concentration of consonants and if possible to devise some means of measuring them. An experiment was conducted on two of Shakespeare's sonnets. The first stage was to count the number of times each letter occurred in each sonnet and then to calculate the probability of occurrence for each letter. Another method tried to partition a sonnet into fourteen equal segments, not necessarily consistent with line divisions. High occurrences of letters in specific segments were noted, but the very partitioning caused some information to be lost, as high clustering over a segment boundary would not be noticed. Wright could have overcome this problem by running his program again with different segment boundaries, but he does not appear to have done this, nor does he explain why he adopted this somewhat arbitrary method to begin with. He presents a table showing the percentages of occurrence of each letter in each segment. He chooses to omit five letters which occur less than four times in total, as, for example, the incidence of three <1q>1s in one line, and this must also have missed some noteworthy features. A further more successful attempt used <1Hildum's association measure>1 which calculates the relationship between each possible pair of letters from the length of the gap between their occurrences. The measure gives a value of zero if there is no association and values greater than zero for higher

associations with a maximum of 0.5. Negative values indicate a high degree of disassociation. In the table shown in Figure 8.2, letters occurring less than four times have again been eliminated and values of between -0.19 and +0.19 are omitted as insignificant, except for the diagonal, which indicates the association of each letter with itself. It can be seen that t associates highly with <1h>1, which would be expected, but the central diagonal does not show much alliteration. From the inspection of his sonnets, Wright expected a higher figure for s with itself as there were two lines with six ss, but the figure given is due to the distribution of many other s<1s>1 in the texts. Wright's calculations were made on letters rather than phonetic coding, but the method could equally be applied to phonetic codes. He concludes that the phenomenon of alliteration is one of domain or locale, which cannot necessarily be captured by the measurement of median gap distances alone. Leighton measured alliteration in German sonnets by the basic criterion of three identical consonants within one line. He carried out a number of experiments before making this choice and admits that his criterion ignores alliteration which exists solely in the juxtaposition of two words with identical initial consonants. Though this latter type of alliteration is very easy to find by program, it produced too many errors to be useful. Leighton has now developed a program to break words into their syllabic components and, with a strict metrical pattern, this will allow the distribution of stresses to be taken into account in determining where there is alliteration. It will also allow medial consonants to be considered for alliteration. The bunching of function words beginning with a <1d>1 or <1w>1 in German can be a problem. If they are ignored completely, an unusual grouping of them would not be found. Leighton has overcome this problem by weighting them with a lower value than content words, so that they will still be found but only in higher concentrations. Chisholm, in investigating phonological patterning in nineteenth- century German verse, follows Magnuson's concept of phonological frames, dividing texts into larger and larger units. Recurrences of consonants are only recognised if they have the same position within a syllable, i.e. if both are prevocalic or postvocalic. The text is transcribed phonologically using a dictionary look-up from the base text. Three rules were devised for identifying syllable boundaries. These were 1. between words 2. immediately before all free lexical items and derivational suffixes beginning with a consonant, e.g. <1Herbst-tag>1 <23.>2 after a single consonant or in the middle of consonant clusters where neither of the previous two rules apply. The cluster is divided evenly

with any extra consonants placed to the right of the syllable boundary, e.g. <1Ta-ge, fin-den>1 Two <2SNOBOL>2 programs then operate on the data. One counts the relative frequency of all sounds in prevocalic, vocalic and postvocalic positions in stressed and unstressed syllables. The second program identifies all repetitions of sounds in a specified number of frames and prints the results as shown in Figure <28.3.>2 Here only the syllables which have repetitions are shown in their metrical positions within a 16-line frame of one poem. Since the number of syllable pairs varies from one frame to another, the measure of sound matching in each frame is expressed as a ratio of the number of sound repetition pairs to the total number of syllable pairs available for comparison. Chisholm is then merely investigating the repetition of sounds, albeit syllables not single phonemes, and examining their relationship with each other within specific units of the poems which vary in size from two lines upwards. Leavitt attempts to quantify alliteration in a different manner. He used ten pieces of test data ranging from a Shakespeare sonnet to a <2SNOBOL>2 manual and transcribed them in three different ways: according to the International Phonetic Alphabet, then following Chomsky and Halle, and finally following Fromkin and Rodman. He first calculated both the occurrences of high frequency consonants and the largest gaps between two occurrences of these consonants. From this he can derive what he calls a gap recurrence function which is used to represent the frequency and density of the features. Certain types of phoneme such as initial sounds can be weighted if necessary. He found that the work on the <2IPA>2 transcription was in general unsatisfactory but undoubtedly an improvement on the basic text. The most satisfactory results were obtained from the Fromkin and Rodman features, using which he was able to rank his texts according to their alliteration. It is arguable whether these attempts to measure alliteration mathematically have any value in literary research. Merely finding and counting their occurrence is much simpler to perform, and the results are much easier to follow. The computer can be programmed to select from a poem all those words which rhyme with each other, provided it has been fed the appropriate rhyme scheme. The relationship between these words can then be investigated. Joyce studied rhyming words from the standpoint of mathematical graph theory calling them networks of sound. His networks include arrows pointing in the direction of the rhyme. He makes the second of two rhyming words point to the first, not vice versa, on the grounds that the rhyme is not noticed until the second word is reached. This may not necessarily be true, but the method can be applied for whichever direction the rhyme is considered to go. Figure 8.4 shows all words linked by the rhyme scheme to <1thee>1 in ten of Shakespeare's sonnets. The number 2 indicates that this particular link occurs twice. Larger and larger networks can then be built up as more text is analysed. This method of investigating rhyme words is very close to that described in Chapter Four for analysing

closely related vocabulary and can be applied to any text of which the rhyme scheme can be specified. We shall now consider how the computer can be used in the study of rhythm and metre. A metrical study consist of two stages. The first scans the verse and the second investigates the frequencies of specific metrical patterns either overall or in certain positions in a line. It is possible to scan verse automatically in some languages, but it is not so easy in others. In English, for example, metre depends on the stress of a syllable and the computer has to be told the stress for every word. Normally this would be done by looking the word up in a computer dictionary of metrical forms. In some languages, such as Classical Latin and GREEK, metre depends on the length of the syllable rather than the stress. There are a number of fixed rules governing syllable lengths. These can be written into a computer program which can then be used to scan the basic text automatically. Two programs have been written to scan Latin hexameters and one for Greek. Hexameters are the easiest metre to scan. Each line consists of six feet, a foot being either two long syllables (a spondee) or a long followed by two shorts (a dactyl). The last foot of each line is always two syllables, the final syllable being either short or long. In the vast majority of hexameter lines, the fifth foot is a dactyl. Every foot begins with a long syllable and there are fixed rules for determining when a syllable is long. A vowel followed by two or more consonants which are not a mute followed by a liquid is always long. A vowel followed by another vowel is short, unless those two vowels constitute a dipthong. When a word ending with a vowel or m is followed by a word beginning with a vowel or <1h,>1 the word ending syllable is ignored (elided). On the basis of these rules, Ott was able to devise a computer program which scans Latin hexameters. His program would operate on a line of Horace as follows: \quodcumque ostendis mihi sic incredulus odi\ <1Que>1 is recognised as an elision, as it comes before the o of <1ostendis,>1 and that syllable is thus discounted. The following syllables can be recognised as unequivocally long according to the rules described above.

\quodcumqu(e) ostendis mihi sic incredulus odi\ At this stage the first two feet are established as two spondees and the two syllable final foot is also clear. The computer then attempts to fill in the rest of the line, If it made the <1mi>1 of <1mihi>1 long, that syllable would be the second of a foot and therefore the next syllable <1hi>1 of <1mihi,>1 would also have to be long as the first syllable of a foot is always long, This leads to the following scansion. \quodcumqu(e) ostendis mihi sic incredulus odi\ This scansion would force the word <1sic>1 to be scanned as long as that is the only possible quantity for a syllable coming between two long syllables. There are then three syllables left unscanned. This scansion would result in seven feet which contradicts the rule of six feet per line. The computer then makes a second attempt taking <1mi>1 to be short, Short syllables always occur in pairs in a dactyl and so the scansion proceeds as follows: \quodcumqu(e) ostendis mihi sic incredulus odi\ <1Sic>1 must therefore be long, forming the first syllable of the fourth foot. Three syllables then remain unscanned, so the fifth foot must therefore be a dactyl. The occurrence of a dactyl in foot five is so common that a scansion program should be able to assume that this is true and only discount this if it is unable to scan the rest of the line. Ott reports success in about 95% of scansions. In another <23->24% more than one scansion is proposed and the remaining 1-2% of lines are abandoned. In many of the cases where the computer has been unsuccessful it was later found that the Latin author had not in fact followed the rules. Greenberg reports a similar success rate to Ott but adopts a slightly different method. His program begins in the same way as Ott's by finding all the long syllables. It then attempts to match this pattern of long syllables against all the possible combinations of scansions for the hexameter lines which have previously been fed into the program. An example from his program is shown in Figure 8.5 where 1 indicates a short syllable, 2 a long syllable and & an elision. The program can either print out the text, together with its scansion, or write the scansion patterns into another computer file for further analysis. When writing a program such as this, it is of course necessary to test the results. Ott notes that very few erroneous scansions were found and they were mostly in words like <1aurea>1 where the final a is the long a of the ablative case and the e is contracted and therefore does not count as a syllable, Ott modified his program to print asterisks by the side of those lines containing

an <1ea>1 and the asterisks immediately draw his eye to those lines which may need interpretation. The same is true for the 5% cf lines for which it was unable to provide an unequivocal scansion. Greenberg prints messages alongside those lines with alternative scansions as well as by those lines which are noteworthy in some respect. On the basis of his automatic scansions, Ott has gone on to provide a vast number of tables. He has published a number of books in the series <1Materialen zum Metrischen und Stylistik>1 <1Analysen,>1 each of which deals with a separate work in Latin hexameters. In the first section of each book each line is given with its scansion for the first five feet. There is also information on word accent, word boundaries, elision, hiatus and punctuation which has been stored with the scansions. Ott makes use of the computer's most basic storage unit, the bit, in representing his material. Each line of six feet can be considered as a maximum of eighteen syllables, allowing three for each foot. These eighteen syllables are represented by eighteen bits which have 1 for presence of a feature and 0 for its absence. Therefore in our example line, word boundaries are represented as 001 000 101 100 001 001. These eighteen bits are then grouped into threes, one group for the first bit is a 1, it counts as 4, the second bit counts as 2, and the third bit as a 1. The set of bits can then be added together to give an octal number, e.g. foot three has 101, giving one 4, no 2 and one 1, a total of 5. The complete octal representation of this bit pattern is 105411, which is how Ott stores the word boundary information for this particular line. 040000 gives its information for elision, as there is only a single elision at the beginning of the second foot. From the information given in the first section of his books, Ott is able to produce many different tables including frequency distributions of metrical forms at word endings, a complete metrical index and an examination of all verse and types by metre and word ending. Packard reports 98% success with his program to scan Greek hexameters. As the Greek alphabet distinguishes long e and o as separate vowels it is easier than in Latin to produce automatic scansions. Packard claims that, although his program made eight false scansions in 440 lines of the first book of the <1Odyssey,>1 it recognised that it may have been faulty in seven of those eight cases. He also notes that the lines which the program found difficult are those for which the rules must be relaxed. Coupled with his grammatical analysis program described in Chapter Five, it provides a powerful tool for analysing Greek hexameter poets. A recent study in Oxford has proved that it is also possible to scan Sanskrit verse automatically as again the rules for scansion are rigid and the length of each syllable can easily be determined. It now seems clear that automatic scansion is possible and indeed practical for rigid metres in

languages where the metre depends on the length of syllable. In the case of the hexameter, there are only thirty-two possible scansions for each line (or sixteen if the fifth foot is assumed always to be a dactyl). This is not the case of the iambic senarii of Plautus which have been the subject of a computer study by Stephen Waite. Plautus wrote some two hundred years earlier than the best known of the Latin hexameter poets and there is considerable doubt whether his metre is based on the length of syllable or is partially or wholly dependent on the stress of individual words. The fundamental unit of metre is an iamb (u--) of which there are six in each line. The iamb can be replaced by either a spondee (--), an anapaest (uu--), a dactyl (-uu), a tribrach (uuu) or a proceleusmatic (uuuu). There are therefore 65 or 7776 theoretical possibilities for each line and it would be impracticable to attempt an automatic scansion similar to that for hexameters. Waite therefore adopted the dictionary look-up method for assigning metrical values to the lines of Plautus. He began by making forward and reverse concordances of the texts. A simple word list would have been inadequate for determining the metrical form of a word because the quantity of the final syllable is frequently influenced by the next word, for example by elision. Using the concordance as a reference tool, another word list of the text was produced to which the metrical forms were added. This was done using a computer terminal interactively. The computer displayed each word in turn and its metrical form was then typed in, using the number 2 to indicate a long syllable and 1 for a short syllable. There are of course a number of written forms in Latin which do not have a unique scansion. For example the nominative and ablative singular of the first declension both end in <1a,>1 but the a of the ablative is long. These forms were marked with a W in the dictionary to show that they were ambiguous. The advantage of working from a word list rather than the text itself, as in all dictionary look-up processes, is that words which occur more than once are only coded once in the dictionary. A second interactive program was then run from a terminal to insert the scansions into the text. Each word is looked up in the dictionary and its scansion thus found, If a word is found to be marked with a <2W,>2 the entire line is displayed on the terminal so that the correct scansion can be typed in. So far the text has been treated as a series of individual words which now have their scansions inserted in them. A further program then derives the scansion for each line by considering elisions and lengthening short final syllables when the next word begins with a consonant. Word endings, hiatus and elisions are marked, and the scansion is checked to see whether it conforms to an iambic senarius. Waite finally arrives at a data file consisting only of the scansions for each line. He is then in a position to ask the computer, for example, to search for all the lines which begin with each

of the possible forms, or the number of times a word end coincides with a foot boundary. He can summarise his findings as shown in Figure 8.6 which indicates the iambic senarii in the <1Truculentus.>1 Waite's method initially requires much more work on the part of the scholar, but the computer is used to perform the counts and the calculations. The metrical dictionary he has prepared can be used for the other plays of Plautus and only the new words in these plays need be added to the dictionary as they are required, In the last century it was common practice for scholars to perform all these tasks by hand. It is much faster to do it by computer, even using Waite's method, and the results are much more likely to be accurate. A further study by Waite investigated the interplay of verse ictus and word stress in the same texts of Plautus. It is unfortunate for our purposes that this second study does not discuss the computer methods used but rather concentrates on the results obtained, Much of the discussion centres on the position in which a number of words frequently occur. He notes particularly that forms of the possessive adjective make up a larger than expected proportion of disyllabic words at the ends of lines. Therefore a combination of metrical features and the words themselves must have been used to formulate the results. Middle High German dactyls were the subject of a computer study of verse rhythm by Rudolph Hirschmann. His input consists of the basic text with the vowels of all stressed syllables marked and his program contains a number of rules which must be satisfied for the line to scan successfully. He managed to classify correctly all but thirty-four syllables in a text of over 16,000 syllables. He prints out each line of text with the syllabic categories and types underneath, as shown in Figure 8.7. Stressed syllables are underlined. <1B>1 indicates that the next word begins with a consonant, and <1A>1 that it begins with a vowel. The rhymed caesura is marked when the text is put into the machine, and <1C>1 in the printout marks this feature. Five types of syllables are distinguished by the numerals 0, 2, 4, 6 and 8. When the vowels are stressed these numerical values are increased by one, so that all odd numbers represent stressed syllables and even numbers unstressed syllables, Once the scansion is complete the same kind of questions may be asked on the data produced. Various features of English prosody have been investigated by Dilligan and Bender with the assistance of a computer, Their study was performed on the iambic verse of Gerard Manley Hopkins using Halle and Keyser's definition of an iambic pentameter -- that is, a sequence of ten positions in which syllables receiving primary stress occupy even-numbered positions, Two optional unstressed syllables are permitted at the end of a line, and position one may be unoccupied.

The computing part of the study initially adopted similar methods to those of Waite on Plautus. First, a concordance was made for reference purposes from the basic text. This text was then transformed into a phonemic transcription, using a word list and dictionary look-up method. Dilligan and Bender do not seem to have had an interactive terminal at their disposal, and the entire dictionary was transferred to the machine by punched cards, making the process much slower than it would otherwise have been. The concordance was used to identify homographs and function words which need to be marked in the machine-readable text. The scanning program operates on this transcription as its input and

generates lists of vowels and initial consonant clusters. The occurrences of stress and punctuation are recorded as bit strings in a similar manner to Ott. For some reason, the authors choose to use the rightmost bit to represent the leftmost syllable in the line so that scansion of a line such as From crown to tail-fin floating, fringed the spine, appears as 1010101010 as it has stress, represented by 1, in positions 2, 4, 6, 8 and 10. Vowels are recorded as integer numbers, each phoneme being assigned a unique integer, the first one indicating the first vowel phoneme, the second for the second vowel, and so on. This line gives a series 28 42 26 38 2 54 4 2 28 44 The initial consonants are stored as the following character codes: <2FR KR TT -- FL --FR) S>2 When the information is recorded in this way it can easily be evaluated for different patterns. Assonance patterns can be found by the repetition of a particular vowel number and then represented as another bit string. Assonance in this line would be recorded as 0010010000 Alliteration for T would become 0000001 100 and for <2FR>2 0010000001 The separate bit strings can be combined using Boolean operations to find lines where certain features coincide. A Boolean <2OR>2 can be performed to find all the positions in the line where there is alliteration. This operation produces a third bit string which has a 1 in all positions where either of the two originating strings have a 1.

0000001100 0010000001 ---------- 0010001101 showing that alliteration occurs in positions one, three, four and eight. A Boolean <2AND>2 operation shows where assonance and alliteration coincide as it results in a 1 in those positions only which have a 1 in both the originating strings. 0010010000 0010001101 ---------- 0010000000 The result indicates that assonance and alliteration coincide at position eight. Similar bit strings are used to test stress maximum and metricality. This use of bit strings is the most economical way of storing information inside the computer and combined with the use of the Boolean operations <2AND>2 and <2OR>2 makes very sensible use of the machine's capacities. Though most of their information is kept as bit strings, Dilligan and Bender record the phonemes which give assonance or alliteration and the line numbers of those lines which contain some unusual metrical feature such as those without stress maximum. Their program keeps totals of thirty-eight different features automatically and many more can be requested specifically. They draw histograms showing the distribution of assonance and alliteration between the ten positions. As we have seen before, simple diagrams made on the lineprinter present results in a manner which is much easier to read than tables of numbers. There have therefore been a number of attempts to write special-purpose programs to deal with patterns of metre. In some cases, it need not be necessary to write a special program as patterns of rhythm can be found in much the same way as patterns of words. The <2COCOA>2 concordance program was used in one project to search for metrical patterns in Russian verse. The verse, Pasternak's <1Volny,>1 was coded by hand and the machine used merely to find patterns within the coding. Five different symbols were employed to indicate metrical features and the punctuation marks and word boundaries were retained. Rhyme was also coded in letter form, there being seven different categories of rhyme which were selected by human judgment. The codings such as )*)10)1)01)010,)ADE>2 may appear strange, but the result is that the computer can be asked to find all the lines which have a specific stress pattern such as 10101010, ignoring all the other symbols. Usage of punctuation could be inspected by making a

concordance of, for example, full stops, and rhyme markers can also be investigated by concordances sorted on endings. The author of this study, Wendy Rosslyn, felt that it would take less time to do the coding manually than to attempt to automate it, but she nevertheless found it worthwhile to input the coding to the computer and search for patterns automatically. The <2COCOA>2 program was also used in a brief experiment in Oxford to study Latin prose rhythm. The rhythm of prose text has not received a great deal of scholarly attention, but it is a subject that, for classical languages, is suitable for some automation. The Oxford experiment used manual coding of the text where L represented a long syllable and S a short one. Other features such as elision and hiatus were marked and a special symbol $ used to indicate a word boundary, as it was not possible with <2COCOA>2 to make a concordance of spaces. The concordance program was then used to find all occurrences of all combinations of syllables. Another program to derive the rhythm of Greek prose was recently developed in Oxford and has had some success, but Greek has the advantage of long e and o appearing as separate letters. These experiments indicate that there is much more scope for the examination of prose rhythm. A dictionary look-up similar to Waite's for Plautus may well be the best approach for larger amounts of text. Both experiments in Oxford were performed on small amounts of text within a short time scale and deserve further work. Wishart and Leach report on a study of prose rhythm in Plato which examined thirty-three samples of text. Their chief interest was in the clustering methods; and the scansion method is not described, although they do state that it was performed by a <2FORTRAN>2 program and that each successive group of five syllables was classified by its scansion. Metrical analysis can then be aided considerably by the machine. In all languages it can perform counts and statistics required of various metrical features. The scansions may have been performed entirely manually and typed into the machine, as for example in the Rosslyn study on <1Volny,>1 or they may have been generated by a dictionary look-up system such as those described by Waite and Dilligan and Bender or they may have been generated automatically as Greenberg, Ott and Packard did for classical hexameters. Whichever method of generating scansions is adopted, it is clear that the machine can take away most of the mechanical work of counting their occurrences. Storing the scansions in a computer file, as Waite and Dilligan do, seems to be more sensible than generating lists of every possible feature as Ott does, though it could be argued that his books reach a wider audience because they are not only in computer readable form. Alliteration and sound pattern studies are also suitable for computer analysis. Provided that sufficient thought is given to preparing the text, whether by using a phonemic transcription or by allowing for exceptions and peculiarities in a program, the study of these features can be greatly helped by the computer. There still does not appear to be an effective method of measuring alliteration. The mathematical methods which have been tried have not been universally adopted, but some of them may be considered useful. In many cases mere counts will suffice. Whichever method is chosen the advantage of using the computer is that it finds every line which satisfies certain criteria, not just those which immediately strike the eye.

The previous chapters have been concerned with the use of computers in analysing text, that is literary or linguistic text. In this chapter we shall consider the various ways in which the computer can be used to manipulate information which is structured in some way. By this we mean the sort of material which is traditionally recorded on file cards, where one card represents one particular object such as a book, a piece of pottery or biographical information about a person. In computer terms a set of file cards would be replaced by a <1file>1 within which each object is represented by a <1record.>1 Each record is broken down into a number of different categories of data which are normally called <1fields>1. Obvious examples for a book would be title, author and publisher. A computer file may consist of many thousands of such records which are all described by the same field structure. The number of fields per record may also be quite large, its limitation being more often due to the program which will manipulate the information than to the computer or information itself. The information contained in each field may be textual or numeric or binary in nature. Most bibliographic material is textual, but a date is numeric. Historical and archaeological data can include many numeric fields, which can either be whole numbers <1(integers)>1 or numbers containing a decimal point which are known as <1real>1 numbers or <1floating point numbers.>1 Binary fields indicate one of two possible states -- that is, normally the presence or absence of a particular feature. All three types of information can be present in the same record although they cannot really be combined inthe samefield. A distinction is also made between what are called <1fixed-length fields>1 and <1variable-length fields.>1 A fixed-length field occupies the same amount of room in each record, for example a four-digit date. Variable-length fields occupy amounts of space which differ from record to record. This is important in reducing the total amount of storage space on a computer tape or disc which the information occupies. If all fields in a bibliographic file were the same length, a title such as <1Loot>1 would occupy as much room as <1The>1 <1Importance of Being Earnest.>1 If data is organised entirely in a fixed length manner, the longest title must be found first and that amount of room

allowed for each title. This is clearly very wasteful of computer storage as well as not allowing for still longer titles to be added in the same format. There are two frequently used methods of storing variable length fields. One is to insert some terminating character such as * at the end of every such field, provided of course that this special character is not expected to appear anywhere else in the data. Our titles would then appear as <2LOOT*>2 <2THE IMPORTANTCE OF BEING EARNEST*>2 The other method is to precede each title with a number indicating the count of characters in the title. Our titles would then become <24LOOT>2 <231THE IMPORTANCE OF BEING EARNEST>2 A third possible method is to keep a table or directory at the beginning of every record indicating at which character position each field begins in the record. An example giving author, title and place of publication would be <2THEROUX PAULTHE GREAT RAILWAY BAZAARLONDON>2 When both fixed-length and variable-length fields are present in each record the fixed-length records usually precede the variable-length ones on the storage medium, though this does not necessarily mean that the fixed- length records must be processed first in any analysis of the data. There may be many categories of information missing, for example, in historical and archaeological data. The treatment of such missing data depends on the programs being used for the analysis, but it is usually possible to allow for it in some way and it should not be considered as an inherent difficulty. Though the kind of data we are discussing may be very large in quantity, it is usually possible to reduce the amount of typing required to transfer it to computer readable form. The material may be repetitive and can thus be abbreviated into a form which a computer program can then expand. Let us suppose that we are compiling a catalogue of Elizabethan literature and are including all the plays of Shakespeare. It would be wasteful of time to type the name of the author for every play. Instead it could be given once with an indication that it should be taken as the author of every following record until a new author is encountered. The following could represent in

a simplified form the lines of data <2$$SHAKESPEARE>2 <2THE TEMPEST>2 <2THE TWO GENTLEMEN OF VERONA>2 <2PERICLES>2 <2$$MARLOWE>2 <2DR. FAUSTUS>2 etc The $$ sign is used to indicate that the information following it on the same line is a new author and not a continuation of the titles following the previous one. A program could expand the information into <2SHAKESPEARE THE TEMPEST>2 <2SHAKESPEARE THE TWO GENTLEMEN OF VERONA>2 <2SHAKESPEARE PERICLES>2 <2MARLOWE DR. FAUSTUS>2 Another alternative is to retain the information in a computer file in the format in which it was typed ensuring that the programs which operate on the data always know that this is its format. This economy of storage can be practised if all the information is prepared for the computer at the same time. It would not be suitable for a file which is to be continually updated with more material. Historical data allows more opportunities for reducing the amount of typing. An example from one of the computer projects at Oxford illustrates the point. The History of the University project has been using the computer to compile indexes of biographical data of students who were at Oxford. Initially the period up to 1500 was covered, but recently more information has been added to the computer files covering the years 1501- 1540 for Oxford and up to 1500 for the University of Cambridge, this latter file being used for comparative purposes. In all cases the data was taken from Dr <2A.B.>2 Emden's Biographical Registers and updated by some of his unpublished records. There are now almost 30,000 biographical records on the computer. The name of each scholar is accompanied by a number of fields indicating such features as his college, hall, faculty, religious order, place of origin, whether he held any office, etc., as well as an indication of the date when he was present. A number of conventions were used for coding this

data which could usefully be adopted by others. A date such as 1428-1435 consists of nine characters. The dates were reduced to two characters, by grouping them into twenty-year periods. 1428-1435 therefore becomes date code 42, by taking the middle two digits of the first date to cover the period 1420-1439. If the scholar was recorded as 1438-1445 he was allocated two date codes 42 and 44, as he comes in two twenty-year periods, 1420-1439 and 1440-1459. It will be noted that this requires two different items to be in the same field. Although some information was lost by this method of coding dates, it did reduce the amount of storage and coding considerably. Such grouping of data into bands of measurements is common practice in many computer applications in the social and biological sciences. It is usually necessary for what is called <1continuous data>1 where the number of possible values is infinite. Such is the case for heights of people where one person may measure 5'6" and another 5'7", but where there may be many heights between these two values. <1Discontinuous data>1 allows only a finite number of values with a fixed interval between them. In the History's data, information for most of the other fields was compressed for typing, using two-letter codes. There is no need to use only numeric codes in modern computing. Letter codes are much easier to read and understand, and even single letters allow up to twenty-six different codes, whereas single digits allow only ten. Two letter codes allow more possibilities and are also more meaningful. There were seventeen colleges in medieval Oxford and each was represented by two letters as follows <2AS>2 AllSouls <2BA>2 Balliol <2BE>2 St Bernard's College <2BC>2 Brasenose <2CA>2 Canterbury College <2CC>2 Corpus Christi <2DU>2 Durham College <2EX>2 Exeter <2GL>2 Gloucester College <2LI>2 Lincoln <2MA>2 Magdalen <2MC>2 St Mary's College <2ME>2 Merton <2NE>2 New College <2OR>2 Oriel <2QU>2 Queen's <2UN>2 University College Only these codes were typed in the college field. A similar set of codes were

used to identify halls, orders, offices, etc. The only two fields which were typed in full were the name of each person and a four-digit identification number. The best way of preparing such information is to rule out sheets of paper with columns for each field and enter information on these sheets as soon as it is found. This coding process can be very tedious, but it would also be necessary in some form for manual processing. In the History's project, the two letter codes, rather than the full version of the data, were stored in the computer file. This was done to save computer time in searching the file as the quantity of information would be very much larger if it was held in full. The full versions can be printed out by any program which operates on the data simply by substituting the full version for the code every time that it is encountered. Accuracy of the data is of course essential, but structured data is easier to check, since the computer can be used to find errors which in text can only be found by proof reading. If we consider the example of the History's data again, in the college field there are only seventeen possible two-letter codes. A computer program can be written to check that each code appearing in this field is one of the seventeen possibilities. If it is not, the computer will reject the record as erroneous and it can be corrected. This kind of computer checking will not of course find as an error a valid code which is the wrong one for that record, e.g. MA for Magdalen instead of ME for Merton, but it will find common typing errors such as WU instead of <2QU>2 for Queen's. A computer program can also be used to check dates to see whether they are in a valid range. This would detect for instance 2977 mistyped for 1977. Sizes, weights and classification of archaeological material can also be checked in this way. It is more difficult to write a program for the checking of textual information, but it is possible to test some fields. Another project in Oxford, which is compiling a lexicon of Greek personal names, has structured, mainly textual, data and uses a lengthy checking program before new data is added to the computer file. Each name is accompanied by three other fields, the town of the original bearer of the name, his or her date, and a reference to the source document in which the name is preserved. The name itself is transliterated from Greek and this field is checked to ensure that it contains only those characters which exist in the transliterated Greek alphabet and those used to represent diacritics. The name of the town can also be checked in the same way. Permissible dates are in the range 1000 <2BC>2 to 1000 <2AD.>2 There are also a number of codings for vague dates, like Hellenistic or Imperial, and a means of denoting which dates refer to centuries rather than exact dates (e.g. a date of the third century <2BC>2 is typed as <23B>2 and a date of <23BC>2 is typed as <23BC).>2 The towns and names can be checked further by compiling an alphabetical list and

visually searching that for any obvious spelling errors. It is not possible to perform any automatic checking on the reference field because it contains very varied information. In this project there should be no missing information, and so the checking program can also ensure that every field is present. So far we have considered those situations where the field and record format is designed for a specific problem or program and does not conform to any recognised standard. Such a standard does exist for bibliographic material in libraries. It is called <2MARC,>2 an acronym for MAchine Readable Cataloguing. <2MARC>2 was devised at the Library of Congress in 1965. Its original format was based on that Library's own catalogue cards. A revised version was introduced in 1968, as a result of collaboration between the British National Bibliography and the Library of Congress, and this format is now the standard for library cataloguing records on magnetic tape. The bodies involved began distributing cataloguing information on magnetic tapes in 1969, and since 1971 the British National Bibliography has been produced by computer typesetting from <2MARC>2 tapes. There have also been a number of retrospective conversion projects so that libraries could hold their entire catalogue on computer tape, not merely those items which have been added since <2MARC>2 began. The British Library Bibliographic Services Division now provides a service of current and retrospective <2UK>2 and Library of Congress files for subscribers. Requests for individual records can also be met. These are matched against the various <2MARC>2 files and the appropriate records sent to the library on magnetic tape. <2MARC>2 records from other countries will be added to the database as they become available. The <2BLBSD>2 is therefore a valuable source of bibliographic material in computer readable form and a number of computer packages for handling <2MARC>2 tapes already exist. The aim of <2MARC>2 is to communicate bibliographic information between libraries who will use the information in many different ways. For this reason the format of the records is lengthy and complicated, the average size of a <2UK MARC>2 record being 780 characters. National organisations have each diverged slightly from the international format known as <2UNIMARC,>2 but they should be able to produce <2UNIMARC>2 records for international exchange. <2MARC>2 data fields are variable length. A three digit tag is used to indicate each field and there is also a directory for each record which can be divided into four parts: 1. A label which consists of twenty-four characters and contains standard information such as the length of the record, and status of the record (new, changed, deleted).

2. A directory which gives the length in characters and starting position of each field. An entry appears in the directory for each field in the record with a total of twelve characters for each field. <23.>2 Tags 001-009 are fields which control access to the main record. For example tag 001 is the <2ISBN>2 or <2BNB>2 record number of the item. 4. The bibliographic information is contained within variable fields. Each begins with two numeric indicators and ends with a special marker. The information in each field is divided into subfields which are themselves tagged by subfield marker. These variable fields hold a full bibliographic description and subject information under a number of different classification schemes. Though <2MARC>2 is much more complicated in format, it should be considered for any bibliography project simply because it has become a recognised standard. Material obtained from elsewhere may easily be incorporated into the data. <2MARC>2 records are also available for serials, printed music, music monographs and sound recordings, and there are computer programs which will generate <2MARC>2 records on tape from a much simpler data format. The method of organising records within a file is fundamental to the efficient access of the data. It can vary from one file to another and may depend on the hardware facilities available for storing it (tape or disc) and also the applications for which it is required. The earliest large information files or <1databases>1 as they are now called were stored on magnetic tape as this was all that was available. Records on magnetic tape can only be accessed in sequence, which is known as <1serial access.>1 It is not very easy to read the tape backwards once the end has been reached though it can be rewound to the beginning. Data which is stored on tape is therefore organised so that one pass, or at most two, is all that is required by one program. Winding a tape backwards and forwards several times in the same program soon wears out the tape, and it makes very inefficient use of computer resources. On tape, then, each record appears in serial order, and as in our Shakespeare example above information which occurs in many records is repeated in all the records. Such a file would usually be in alphabetic or numeric order of one or more of the fields. If more records are to be added to the file, they are first sorted into the same format and this small file then merged with the larger one. Some files are so large that they may occupy several 2400 foot tapes. When discs first became available, they were frequently used as if they were tapes so that records were only accessed serially. This was mostly because programs were already written to operate in this way and because

tapes, which are much cheaper, could easily be used instead if the data file grew too big to be held on disc. The advantages of disc-based storage soon became apparent, for the data can be accessed much faster. It is no longer stored as a very long sequence where it is necessary to read everything up to the section which is required. In <1random access>1 files an index or directory of keys is used to point to the appropriate records in the file. The computer searches the index to find out where a particular data element is stored and then finds that place on disc immediately without having to read all the preceding records first. Records on disc may still be stored serially to facilitate operations which require a complete pass of the file, such as finding the average weight of pottery. Files organised in this way are called <1indexed sequential>1 files. Recently new methods of data storage have been devised and the term database technology has become increasingly used in computing. A modern database moves away from the ideas of serial or sequential files completely. Information, though conventionally viewed as records and fields, is presented to the computer as a series of sets of related information. A master item such as our Shakespeare will appear only once in the database but will "own' a number of sub-records, those of all the titles of the plays. The items are linked by a series of pointers which indicate where the "owned' records are stored on disc. The records can be interrelated in many different ways and a very complicated network of pointers established. The total amount of storage required is much less than for a serial file, as normally data which appears in many different fields is stored only once in the database. When new records are added to the file containing the same item, another pointer in the chain is set up. These databases can of course only be held on disc. Modern databases therefore require very complicated programming to organise the data. Many computer manufacturers are now providing computer packages for this purpose, and the computer user is freed from any data organisational programs. He merely has to define his data in terms of the database management program and then may concentrate entirely on using the data for various applications. One such package is called <2IDMS>2 (Integrated Database Management System). It allows the user to write programs in <2COBOL>2 or <2FORTRAN,>2 which view the data in the conventional record and field format without consideration of how it is stored on disc. <2IDMS>2 has many applications in the commercial world where most computing consists of processing very large structured files such as personnel records, bank accounts and stock holdings. It is equally relevant to the needs of arts researchers and could also be used for <2MARC>2 records. Material held in a disc-based file can be updated very easily, usually by a

terminal operating interactively. A file held on magnetic tape must be updated by a special program which copies it to another tape with the amendments inserted at the appropriate place. For the sake of security, the old file is usually retained, giving rise to the "grandfather, father, son' concept of keeping old versions of the data. When new information is added to create a new "son', the last but two version of the file, the new "son's' "great-grandfather' is discarded, leaving only the previous two versions to be retained. Now that we have seen how information can be coded and stored in a file, we can examine some of the operations which can be performed on it. In brief, everything which can be done with the traditional file card system can be done using a computer, but in addition the computer allows the data to be sorted into alphabetical order according to many different criteria very rapidly. Several files of the same data may be kept sorted according to different fields. The sorting procedures are very similar to those used for concordance programs. Several fields can be considered in the sorting process and if two records are found to have the same information in one field, other fields can be compared until the records are assigned to the correct order. Many computer packages exist which can make indexes and catalogues, and the sorting procedures required by them are some of the most widely used processes in computing. One of the commonest indexing packages in universities is called <2FAMULUS>2 and it serves as a useful example for textual data. <2FAMULUS>2 was developed at the South West and Forest Range Experiment Station in Berkeley, California, in the late 1960s and now runs on several kinds of computers. It is a collection of programs which are used to create, maintain, sort and print files of structured textual information. Up to ten variable-length records are allowed for each record. Each record must of course be described by the same fields, though information may be missing for some fields. Each field is assigned a four- character tag or label chosen by the user, which is typed before the information for that field when the data is first input to <2FAMULUS.>2 The format of the input data requires that each field should be on a separate card and that there should be a blank card or line to separate each record. This is rather wasteful of typing time because of the repetition of the field labels, and also wasteful of cards if they are being used as the input medium. It is better to choose a more compact form of data entry and write a small program to convert it into the format required for <2FAMULUS.>2 In the following example it is assumed that * and $$ will never appear anywhere in the data and so * is used to separate the fields, whose labels are not typed, and $$ to indicate the end of the record.

A program has added the field labels TITL, <2AUTH,>2 <2PUB>2, etc, to the information which now appears as one field per line. <2FAMULUS>2 expects to read data from eighty-character cards and the single S on the last line of the first record shows how it expects to find continuation cards. In the second record the fields <2AUTH>2 and <2ISBN>2 are missing. In the data input this was indicated by two successive asterisks. The <2ADD>2 field contains a series of index terms, separated by semicolons. It will be seen that some of these terms consist of several words. These are what <2FAMULUS>2 calls descriptor terms, and it is possible to make an index of them by indicating to the computer that the semi-colcn is used to delimit them. <2FAMULUS>2 can deal with cross-references rather neatly. Suppose we wanted to make a cross-reference for the term <2BLACK AND WHITE>2 <2PHOTOS>2 in the <2ADD>2 field of the first record above so that it would also appear under <2PHOTOS.>2 We could create an extra record for <2FAMULUS>2 consisting of <2TITL SEE BLACK AND WHITE PHOTOS>2 <2ADD PHOTOS>2 An index of terms in the <2ADD>2 field using the TITL field as subsidiary information would print <2PHOTOS>2 <2SEE BLACK AND WHITE PHOTOS>2 thus creating the appropriate cross-reference. <2FAMULUS>2 maintains its information in a serial disc file which is created by one of the programs in the package. Each record is assigned a

unique number in ascending order as it is added to the file. The field labels are recorded once in a special record at the beginning of the file and not with each record. Instead a small directory is present at the beginning of each record to indicate which fields are present and where they begin and end in the record. A file of many records can be sorted into alphabetical order of publisher, by author or date or any other field, if required. <2FAMULUS>2 can print the indexes or catalogues in a catalogue-type format where the key-entry is off- set to the left and is not repeated for subsequent entries until a new key term is found. Such a simple author catalogue for a few travel books is shown in Figure 9.1. The file can be rapidly inverted to make a publisher catalogue as shown in Figure 9.2 or again by date as in figure 9.3. A simple index of the <2ADD>2 terms merely indicates the number of the record in which they occur, but using another <2FAMULUS>2 program called <2MULTIPLY>2 the file can be expanded to hold one complete entry for each subject term and a complete subject catalogue produced. MULTIPLYing our first example on the <2ADD>2 field would give us six records. The first two will suffice to indicate the results. <2TITL EXPLORING WALES>2 <2AUTH CONDRY,>2 WILLIAM <2PUB FABER>2 <2LOC LONDON>2 <2DATE>2 1970 <2ISBN 0>2 571 09922 <2ADD MAPS>2 <2TITL EXPLORING WALES>2 <2AUTH CONDRY, WILLIAM>2 <2PUB FABER>2 <2LOC LONDON>2 <2DATE>2 1970 <2ISBN 0>2 571 09922 <2ADD WELSH PLACE NAMES>2 Part of a subject index of <2ADD>2 terms with the author, title and publisher as subsidiary fields is shown in Figure 9.4. <2FAMULUS>2 can also produce a keyword in context <2(KWIC)>2 index of any field. A <2KWIC>2 index of all titles will help a reader to find a book or article if he is not sure of the exact title. The <2KWIC>2 index is really a concordance as can be seen from Figure 9.5. Common words like "and' and "the', etc., have been omitted as they are of no use for this purpose. A <2KWIC>2 index may be compiled with a <1stop-list>1 specifying all the words not required or with a <1go-

<1list>1 specifying only those which are required. Any indexes or catalogues produced by <2FAMULUS>2 can be computer- typeset for publication rather than be presented in lineprinter symbols. If the output format is not appropriate, it is much easier to write a small program to change the layout of a sorted file rather than attempt to do the sorting oneself. <2FAMULUS>2 is an indexing program which creates and maintains files of sorted indexes. It does not decide which terms are to be included in the indexes. Automatic indexing and abstracting of documents is a very different matter which has attracted much interest recently. This in effect requires a parsing program, similar to those used in machine translation, to analyse the meaning of a document and abstract the most significant information from it. It must then generate an intelligible abstract which faithfully represents the meaning of the original document. One such system called <2LOUISA>2 is an indexing system which deals only with abstracts. It uses a comprehensive dictionary, together with rules for concatenation, disambiguation and amalgamation, in an attempt to produce index terms and paraphrases of abstracts. It is arguable whether this is yet economic, but the volume of material now being published makes such research inevitable, particularly for indexing and abstracting technical material. The computer can only be of marginal help in compiling an index to a book for publication. If the whole work is in machine readable form, the computer can produce an index of all the words, but even if the list is reduced to a manageable size by a comprehensive stop-list, the human must decide which entries are useful. A better method of using a computer in compiling a book index is one which was adopted for this book. The text was read, and, as items for inclusion in the index were found, they were typed into the computer together with the page number and any other subsidiary information. The computer was only used to sort and file the entries. For compiling cumulative indexes it is necessary to use a thesaurus of known terms or controlled vocabulary. The computer then keeps a record of the allowable index terms, possibly including synonyms. All the different forms of one lemma can thus be reduced to one term, e.g. <2EXPLORE>2 could stand for <2EXPLORATION, EXPLORING,>2 etc. Most indexes which are made for cumulative publication keep such thesauri, updating them as new terms are encountered. It must not be forgotten that the computer only does the mechanical part of such an indexing process. It cannot decide what to put in the index. Some such indexing projects have been known to go astray because too much was being expected of the computer and not enough of the human. Many recent bibliographies have been compiled with the help of a

computer. Packard and Meyer's Homer bibliography is one example. Two advantages in using a computer in this way are that information which is already in computer readable form can be updated much more easily and that the material is then ready for computer typesetting. Such bibliographies are now so commonplace that they rarely warrant publication of the method used. One exception is a bibliography of Scottish poetry being compiled at South Carolina, which expects to include over 15,000 entries. As the aim of the project was to compile a good literary bibliography which required a detailed description of each book, fourteen fields were used, and these differed from those in a library information system. The fourteen fields consisted of two main categories. In the first were those concerned with the author such as his epithet and dates, as well as biographical and bibliographic references. The book itself can occupy up to ten fields. Apart from the usual title-page information, the editor's name is added if appropriate; also the number of pages or volumes and the name of at least one library which holds a copy of the book. One additional field is reserved for any other useful information, such as the format of books printed before 1800. In the final printout the author's name will appear only once, followed by the author fields and then by all the references attributed to him. Another indexing project at the Oxford Centre for Postgraduate Hebrew Studies uses <2FAMULUS>2 to compile subject indexes to a collection of Hebrew periodicals dating between the French Revolution and the First World War. These periodicals cover a period of upheaval in the Jewish world and for that reason are important to students of Jewish history and Hebrew language and literature. The indexes are intended to serve as guides to the contents of the periodicals. For this reason articles may be listed under many headings ranging from the vague such as "Hebrew literature' to detailed descriptions of the contents. For each periodical it is planned to make a table of contents on the lines of the Wellesley Index to Victorian Periodicals. In the table each article is assigned an identification number which is used to refer to that article in the indexes rather than using the complete bibliographic reference. For each article up to eight categories were allowed. Apart from the identification number, author and title, a second more specific title was assigned if the main title was vague or gave little indication of the subject matter. Four types of keywords were used. The first keyword field holds keywords associated with the main title. These keywords then appeared in the main subject index with the main title as subsidiary information underneath. The second title if present also has a series of keywords associated with it. A third category of keywords was also allowed which contain much wider terms such as "poetry' or "biography'. Three separate keyword indexes are first compiled of all three categories of keywords.

These three indexes can then be merged to form a master subject index. A final category is reserved for keywords which are in Latin characters. This is used to make a supplementary index for such items as European books reviewed and discussed in Hebrew journals. This project illustrates how several different kinds of indexes can be made using a simple package like <2FAMULUS.>2 It was felt initially that the limitation of only ten fields would prove insurmountable if all the bibliographical reference for the article was to be included in the index. The Table of Contents with its reference number proved to be a neat solution to this problem. The indexes are not cluttered with long bibliographical details and the table can be used to browse through the contents of a journal. If a user is only interested in a particular period, he need only inspect that part. In many applications both inside and outside the academic world, the computer can also be used to search files of information to find all those records which satisfy specific search criteria. This process is known as <1information retrieval>1 and it is usual to specify which field or fields are to be inspected. The computer will then pass through the whole file and print out only those records which have the required information in that field or fields. A simple search will specify only one request but it is however usually possible to combine requests to make them more specific using Boolean operators <2AND, OR>2 and <2NOT>2 with the following connotations: <2AND A>2 record will be retrieved if it contains all of the terms linked by <2AND>2 <2OR A>2 record will be retrieved if it contains any of the terms linked byOR <2NOT A>2 record will be retrieved if it does not contain the term following <2NOT>2 It is often possible to link many terms together with a combination of Boolean operators. Sometimes brackets are needed so that the request is not ambiguous. Consider the following search request: <2LOUVRE AND MONET OR MANET>2 As it stands this request is ambiguous. It is not clear whether the first term is <2LOUVRE>2 and the second <2MONET OR MANET>2 or whether the first is <2LOUVRE AND MONET>2 and the second just <2MANET. A>2 computer would interpret it as the latter. To ensure that it is correctly understood, brackets should be inserted as follows

<2LOUVRE AND (MONET OR MANET)>2 This would ensure that both terms within the brackets are connected to <2LOUVRE>2 by logical <2AND.>2 Most computers have a standard package to perform retrievals on structured files. Some may be oriented towards textual data, while others provide for both textual and numerical data. One such package called <2FIND2>2 operates in <2ICL>2 1900 computers and is used for the History of the University's project in Oxford which was described earlier in this chapter. This package allows several fields to be specified in the search request. The computer may be asked such questions as "Who were all the scholars who read theology at Merton between 1400 and 1460?' lt will search the entire file and print out indexes of all the records which satisfy the request. In this case three fields, faculty, college and date would be inspected and the record only retrieved if it satisfied all the request. Not all of each "hit' record need be printed out. The user may specify which fields are to be printed for those records found and the others will be omitted. Figure 9.6 shows an example of <2FIND2>2 printout showing scholars at Magdalen in the period 1440-1459, that is, at the time of its foundation. By extracting data from the raw material in this way the History of the University project has been able to analyse comprehensively the distribution of scholars among the colleges of the University, which colleges attracted the students of particular subjects, and (so far as the raw material allows) the geographical origins of scholars. By varying the dates specified in the search requests it has been a relatively simple matter to repeat these studies on different periods of the University history, so that a picture of the changes in the scholarly population of medieval Oxford has been built up. The volume of the raw material is such that it would have been totally impractical to produce such analyses without the assistance of the computer. The computer can also be asked to transfer those records which have been found to another file, for example if further searches are to be performed on that part of the data. This saves computer time as a smaller number of records is then searched. It can also merely count how many records satisfy a search request and not print them out. If the number is too large, the search can be modified using Boolean operators in the hope that fewer records will be found the next time. For numerical data all those records which have a value greater or less than a specific value in one field can be retrieved. There are six different relationships between numerical values. They are usually (but not in <2FIND2)>2 abbreviated as follows:

<2EQ>2 equal to <2NE>2 not equal to <2LT>2 less than <2LE>2 less than or equal to <2GT>2 greater than <2GE>2 greater than or equal to Note the difference between LT and <2LE>2 and between <2GT>2 and <2GE>2 relationships can also be combined with the logical operators to find values within or outside a particular range. Information retrieval is a very common use of computers for all kinds of data, and there are many archeological and historical examples to illustrate its use in the humanities. In the simplest case, the record is retrieved if the search term matches the whole of the field inspected. Sometimes when there are several fixed-length entries in one field the computer is instructed to move along the field in steps of so many characters until it has either found a "hit' or exhausted the field. Such a request is performed on the History of the University's file to look for colleges, faculties and dates as there can be several entries in these fields. For example, since all the colleges are entered as two-character codes in the same field, the computer is instructed to move on two characters at a time through this field as it searches for the one or ones required. The retrieval of textual information is more complicated. It is usual for search terms to consist of complete words and for them to be found anywhere within a field or fields. Some programs permit a search for all the words which begin with a series of letters by specifying a truncation character. E.g. <2RESID*>2 would find all the words which begin with the letters <2RESID.>2 This may be intended to find all records concerned with <1reside>1 and <1residence,>1 but it could also produce unwanted material like <1residue.>1 Synonyms must also be considered when searching text, which is one reason why a database of textual material usually has a thesaurus associated with it. This contains the terms which are to be included in the search request and the user may spend some time studying it before formulating a request. Even so, a search may produce some information which is not relevant and it may miss some which is relevant. These two factors are known as precision and recall and can be measured as percentages as follows: number of relevant references retrieved precision= --------------------------------------- x100 total number of references retrieved x100 number of relevant references retrieved recall= --------------------------------------- x100 total number of relevant references in database

The two measures are inversely related to each other. The only way to measure recall is to check the database and count the number of relevant references manually. Experiments to compare computer and human information retrievals have shown that the computer is much more successful than the human at finding information, but its precision is not so good. In the early days of computing, retrieval was performed as a batch process. A number of requests were prepared and submitted as a job to the computer, and the results were printed out later. Nowadays it is common to search a database interactively from a terminal. The advantage of immediate response enables the user to redefine the search many times before he is finally satisfied. He may ask for some of the retrieved records to be printed at the terminal. If these are satisfactory all of them can then be printed on the lineprinter. A number of commercial organisations provide bibliographical services for searching abstracts and indexes on a computer file. Most of these databases are designed for scientists and social scientists. The best known one is <2MEDLINE,>2 the database of medical publications operated from the National Library of Medicine, Washington <2D.C.,>2 which was originally called <2MEDLARS.>2 Lockheed of California also offer over thirty databases of mostly scientific publications on their <2DIALOGR>2 service. Recently three historical databases have become available for computer searching, and it is anticipated that these services will be expanded to cover more arts- based subjects. Access to the database is provided to all libraries and other information services, who pay a subscription in addition to the cost of individual searches. The response time is very quick, some seven seconds to find over 4000 articles. Though technically these databases can be searched by a novice, it is usual to have an operator who is trained in their use. The search terms are first found from a printed thesaurus before any contact is made with the computer. The operator knows roughly how many references are likely to be found and can help a user to construct a fruitful set of terms which find a manageable number of references. There are a number of computer packages which can search textual material on-line, and these usually include prograrns which set up and update the database. When the database is changed, an alphabetical list of meaningful words is created which contains references indicating where the words occur in the documents. When the user makes an initial query on- line, the computer will use this index, which can be searched rapidly as it is in alphabetical order, to give a count of the occurrences of the term. The user may then need to refine the request several times before he is satisfied, and then the document references or the entire documents may be printed at the terminal. Some retrieval programs may help the user by retaining a list of common synonyms or by providing a facility for building up files of

frequently used requests. The programs may also contain messages to assist the user if he cannot remember which instruction to issue next. Among the earliest users of such computer packages were lawyers who need to search through vast amounts of statute and case law to find information relevant to a particular case. Subsequent programs have been designed with their needs in mind, though there is no reason why they should not be used for historical or other documents. <2STATUS,>2 written at the Atomic Energy Research Establishment, Harwell, is one example. It was originally designed to search those statutes relevant to atomic energy but has subsequently been used for a number of different applications, including Council of Europe legislation in both English and French. Another similar program was written at Queen's University, Belfast. The original version of it was called <2BIRD,>2 but it is now marketed by ICL's software company Dataskil as <2QUOBIRD.>2 An experimental version of it called <2QUILL>2 exists for research purposes in universities. The most recent attempts at on-line retrieval systems allow complete freedom of input from the user. Since the user's request is formulated as the user wishes, not as the computer is programmed to expect, the program must incorporate a natural language parser, which must be quite complex even for simple requests. Such a system is being developed at the Institut fur Deutsche Sprache, Mannheim, initially in connection with a project for the Stuttgart Water Board. The parser is however intended to be used with other databases and is therefore sufficiently comprehensive. An approach of this kind enables anyone to use the computer system at a cost of the extra time spent in parsing the request, but it is debatable whether it is worth the extra effort. In this chapter I have attempted to show how the computer can be used in the manipulation of structured material. These operations are performed very frequently in computing in the commercial world. A file of personnel records and a file of bibliographical data are very similar in nature, and similar operations will be performed on them. Both will be sorted into alphabetical or numerical order according to various fields and both will be searched for records which satisfy certain criteria. There exist very many programs which can sort and search these files. Though they may not satisfy your needs exactly, every computer manufacturer produces software for commercial applications, as that is where most of their customers are. As in the case of <2FIND2,>2 it may be found to be just as useful for the arts user.

We have now covered the major areas of computer usage in the humanities and it should be possible for the reader to decide whether the computer can be of any use in a research project. If so, some advice on how to start will be found in this chapter. This advice is inevitably given in general terms, as there will be differences between one computer centre and another, and should therefore only serve as a guide. University computer centres in the <2UK>2 are funded largely by the Computer Board, a government body, which was set up to provide computing facilities initially for research and later also for teaching. At the time of writing there is no charge for approved research workers in the <2UK>2 who wish to use their university computer centre. The methods of organising computer users vary from country to country, but in most cases it is not necessary for the user himself to pay. Often each university department has an allocation of time or "money' units on the computer. Although you will probably not have to pay to use the computer, the computer centre will not do your computing for you. You will have to learn how to use the computer and organise your work for it or else pay somebody else to do it for you. The centre merely provides the machinery and back-up services. You will first need to register as a computer user at your local computer centre. The computer will have been programmed to accept only valid user identifiers and you will be issued with one of these if your department does not already hold one. On different computers they are called account numbers job numbers or usernames and are used by the computer centre to record the amount of computer time each user has. They will also be used for charging if real or notional money is involved. Each account number will have a certain amount of computer time allocated to it as well as units of file space on disc, perhaps magnetic tapes and even paper for printing. These budgetary categories vary from centre to centre, but an explanatory leaflet on them will be provided. In some centres the time units are updated weekly, in others monthly. Others again operate a different system of allocating time which is divided into shares of daily usage. You will probably be allocated only small computer resources to begin with to encourage you to use them economically. Your budgets can be increased by the centre if you find you need more time. The computer centre will have a number of card punches, tape punches and terminals for users to use. Some large users of the computer will also have a terminal for themselves. These are usually allocated by the centre to the people they feel need them most and they may have to pay for the terminal itself. Working in the centre itself is often easier, as all the other equipment is also there and you will not have to walk far to collect printout. It will also be necessary to find out what facilities, if any, the centre offers for arts computing. If you are lucky they may already have several people from arts departments already using the computer and a specialist in the computer centre to help them. If not, you may have to find out much more for yourself. All university centres provide some kind of advisory service for users -- that is, staff whose job it is to hold a "surgery' to help users, The level of assistance varies from centre to centre, but it is there to help you rather than to do the computing for you. A number of centres now have specialist advisers for computing in the humanities or what the computer specialists prefer to call "non-numeric' computing. These people may well have had previous experience of problems similar to yours. If there is no such person you will have to explain your project to a computer specialist who may never have heard of terms like textual criticism or iambic pentameters. You may find that the scientists regard your work as somewhat amusing and certainly esoteric. Try to ignore this attitude and explain as clearly as you can exactly what you want the computer to do for you. The computer centre will tell you whether they have any standard packages to suit your requirements. Most centres now have access to a concordance package such as <2COCOA>2 either on their own machine or at a remote site over a network. When working on a text it is often useful to make a concordance first. From it you will be able to decide which features of your text to study further. Even without a standard program for concordances much useful information can be obtained from the data by manipulating it from a terminal. If the computer has a good editing program and allows a reasonable amount of terminal usage, you will be able to use that method to search the text for instances of particular words or phrases. The usages found may be transferred to another file and manipulated again. The main advantage of searching a text file interactively is the virtually instant response, but it can be more expensive of computer time. A concordance selecting only a few words as keywords can also provide much useful information and even a small concordance done on just a few pages of text may help you to plan your future computing. The computer centre may have other information retrieval or database programs which you could use. They will probably recommend you to read

a users' manual for any package which they think suitable for your project. Many computer manuals are written by the programmers who wrote the package, and although they may be very useful once you have started to use the computer you may find them a little difficult at first. Most computer manuals, at least those for packages, have plenty of examples. Look at these carefully before you read all of the text, for a good manual should show what comes out of the machine as well as what goes in. It is often easier to see what the package does from examples rather than from a detailed explanation. In some cases the examples may be on data which is different from yours, perhaps scientific or commercially-oriented. Do not be put off by this but look at the method used. The procedures which have been applied to the data may be exactly the ones you want to use on your data even if its nature appears to be very different. Some packages require data to be in a particular format. If your data is already in computer readable form, but in a different format, this does not mean that you cannot use it. A simple computer program can be written to transfer the data to another computer file or tape in the required format. If this is the only program you need which is not already written, it may be possible to persuade someone at the computer centre to do it for you. If there are no packages to suit your requirements and the terminal editor cannot be used for your purpose, you will have to learn to write programs yourself. Learning to program may be frustrating in the early stages but very rewarding when everything finally works. It is not at all difficult to write simple programs and these can be used as a basis for more complicated work as the project grows. The choice of computer language depends on what is available at the computer centre. The most commonly used computer language in the academic world is <2FORTRAN.>2 <2FORTRAN>2 was originally written for scientific purposes and is not ideal for handling textual information, but as it was one of the earliest computer languages it has become so well established that its use is self-perpetuating. Most programs are written in <2FORTRAN>2 because everybody knows <2FORTRAN>2 and everybody knows <2FORTRAN>2 because most programs are written in it. All computers except the very small ones support compilers for <2FORTRAN>2 and this could be a suitable choice of language if you know that you want to start your project on one computer and then move it to another. Another advantage of using <2FORTRAN>2 is that all the program advisers at the computer centre will know it and there will therefore be more help available to you. If the computer centre already does some arts computing, it will probably also support other computer languages which are more suitable, such as <2SNOBOL>2 (or <2SPITBOL>2 as it is sometimes called) or <2ALGOL68>2 or <2PASCAL.>2 Any of these would be easier to apply to an arts project than <2FORTRAN>2 or <2ALGOL60. SNOBOL>2 in particular is very easy to learn. It <223> is the language usually taught in specialised courses for arts computing, and is particularly recommended for the non-mathematician. The business language <2COBOL>2 is also suitable for some arts computing but it is rarely used in the academic world and an academic computer centre may be unable to give assistance for <2COBOL>2 programs. Many computer centres provide courses for their computer users to learn programming. These are frequently "crash courses' of the kind that require attendance on every day for a week and for this reason are often held during the vacation. This is the best way to learn, as you can concentrate entirely on computing to the exclusion of other work. The courses should include some opportunity for you to try out some small programs on the computer. Do make the most of this opportunity, as trying it for yourself is by far the best way to learn. These courses are sometimes presented on videotape, but there should be somebody available to answer questions. Computing courses are, however, often directed at scientists, and the arts user may not be able to understand the examples. He will understand the logic of the programming language but find that many examples are mathematical. The same is true of many computer programming manuals. They explain the logic and syntax of the languages well, but illustrate it by mathematical examples. It is possible to learn to program from such courses or manuals provided you ignore all those examples which you cannot understand. We have seen throughout this book that very little mathematical knowledge is required to use a computer. However, most people who write computer manuals are mathematicians and scientists and do not realise that their approach is not suitable for non-scientists. Those universities which are now enlightened about arts computing sometimes put on special courses for arts users. These will contain non- mathematical examples and should be much easier to understand. If these courses are planned by computer scientists, they may include non- mathematical examples which are not particularly useful to the arts user. Try the examples during the programming course in order to learn the language and then later you can program yourself the procedures which are useful to you. On the other hand, some computer courses for arts people which have been planned by computer scientists are very simple and do not give full enough coverage of what the computer can do. The arts user is no less intelligent than the scientist and is no less capable of handling complex problems on a computer, but a different approach is required. If you attend an arts computing course which appears to be very elementary and without any useful substance, do try and find out more about the subject by reading or asking people. A very simple course will give you enough information to read some manuals and find out what else is possible.

preparation service which can transliterate from a non-Roman alphabet, and in this case you may well have to write it all out for them in transliteration. For those working on literary texts, the situation is becoming much brighter. Once a text has been prepared for the computer it can be kept in computer-readable form permanently and can therefore be made available for other people to use. Indeed it is common practice for such texts to be made available for others to use at little or no charge. There are several ways of finding out if a text is already available. There are now two archives of texts in computer readable form which are kept specifically to act as a clearing house for that material. One organised by Professor S.V.F. Waite, at the Kiewit Computation Center, Dartmouth College, keeps Classical texts, both Greek and Latin, and now has a large library including most of the well-known classical authors. Recently another has been established at Oxford University, which specialises in English literature. Both these archives will supply copies of the texts they hold for a very small charge. (Their full address may be found on p. 241.) The texts they supply are on magnetic tapes and it is important to find out about the technical characteristics of these tapes first. Information can be written on to magnetic tapes in several different formats and it is unlikely that your local computer can read all of them. You may not understand the technicalities, but do ask someone at the computer centre about these before you order any texts. They will probably give you a small document describing the tape formats which they can read. Both archives have accepted formats in which they supply tapes, but they may be able to write you a tape in a different format if your computer cannot read their normal one. This information about magnetic tapes applies in all cases where material is being transferred from one computer centre to another. You do not need to understand the technicalities, but make sure that the people at each computer centre have them.

There are other possible sources of text if the one you require is not held at Dartmouth or Oxford. There now exists in California a very large archive of ancient Greek literary texts, consisting of all extant literary material written up to the fourth century <2AD.>2 This archive is called the Thesaurus Linguae Graecae and is located at the University of California at Irvine under the direction of Professor T. Brunner. They supply tapes of any texts from the archive but make a higher charge than Dartmouth or Oxford and will only provide one magnetic tape format, which is only compatible with <2IBM>2 computers, and no others. For linguistic analysis there is the Brown University Corpus of a million words of present-day American English, made up of 500 samples each of 2000 words and covering about fifteen different kinds of material such as literature, newspapers, and church sermons. All the texts in it were first published in 1961. There is also a manual which identifies each sample of text and describes the coding system used. At the University of Toronto, the entire body of Old English texts has been put into computer-readable form for the <1Dictionary of Old English>1 and it is possible to purchase texts from them. The Tre*sor de la Langue Francaise at Nancy also makes available French texts for a charge. For other languages in which there is no less activity but perhaps less cooperation, a little more research may be necessary to find out if a text has already been prepared. There are two main sources of information. The periodical <1Computers and the Humanities>1 began in 1966 and covers all computer applications in the humanities. Twice a year it publishes a <1Directory of Scholars Active,>1 which is in effect a catalogue of projects. They rely on scholars filling in questionnaires to keep the directory up to date and so it is inevitably not comprehensive. Recently a cumulative edition of the first seven years of the <1Directory of Scholars Active>1 was published, which includes author and subject indexes to the material. The same periodical occasionally produces lists of texts in machine-readable form, again relying on information sent in by their readers. It is always worth looking through these lists to see if anybody else is working on the same text and writing to them about it. In 1973 an Association for Literary and Linguistic Computing <2(ALLC)>2 was formed in London. It is now an international association and publishes a Bulletin three times a year. It has a number of specialist groups, some of which are oriented to particular languages. The names of the chairmen of these groups may be found in the Bulletin and they should be aware of who holds texts in their language. You may also be able to find out from these journals if anyone else has written a program which meets your needs. It may be worth approaching them for a copy of the program rather than spending a long time trying to write one yourself. The general bibliography in this book gives a list of references, most of which are conference

proceedings. You may also discover an article in one of these books about your text. Again it may be worth following up the author to enquire about the computer version of the text. If all else fails, the text will have to be prepared somehow. If your computer centre does not have a data-preparation service, it could be a long job for you or your assistants. It will also be an exacting task, which will compel you to make some decisions about your text that you may have been putting off for a long time. You should always start with a number of tests to ensure that your coding will work out as you want. You will have to devise a coding scheme to suit your text. If it is a non-standard alphabet it is advisable to use a transliteration scheme already in use, such as the one at Dartmouth or Irvine for Greek. Whatever scheme you choose for coding, it is important to use a unique code for each character or symbol in the text. These can always be changed to another symbol later on if you wish by a simple computer program, but ambiguous codes cannot be resolved automatically. If you are planning to use a package like <2COCOA,>2 you can use more than one computer character for each character in your text but be aware that this makes programming more difficult if you want to write some of your own programs as well. It is always worth finding out what other people have done in similar circumstances. Once you have begun your project, or even before you start, it is useful to meet other people who have similar interests. There are now two major series of conferences on computing in the humanities. The delegates to the meetings are from a variety of disciplines, but they all have an interest in computing. Some are programmers who work on packages for arts users, others are computer users from the humanities, many of whom write their own programs. The first such meeting on computers in literary and linguistic research was held in Cambridge in 1970. This has now become a biennial event at Easter and has so far been held at Edinburgh, Cardiff, Oxford and Aston (Birmingham). In 1973 a series of conferences was begun in North America, covering all computer applications in the humanities, not just in literary and linguistic research. These have been held so far in Minnesota, Los Angeles and Waterloo, Ontario. The published proceedings of these meetings have formed valuable sources of material for this book. The conferences are open to all. Though the <2UK>2 ones are now sponsored by the <2ALLC,>2 it is not necessary to be a member of the association to attend the meetings. The <2ALLC>2 now holds a one-day meeting in December each year which is held either in London or on the continent of Europe. It also organises a number of smaller local gatherings, details of which may be found in its <1Bulletin.>1 There are also specialist conferences on computers in archaeology and on information retrieval, details of which may be found in <1Computers and the Humanities.>1 This final chapter should have given you some ideas on the best way to

start a computer project. It will not be very easy at first, particularly if there is not much local expertise in humanities computing. You will see the word "error' appearing very frequently during your first few weeks of computing. With a little experience you will gradually begin to be more amused than depressed at your mistakes. Remember that it is very rare for a program to work first time and very surprising if it works the second time. It may take several attempts to get it right, but once it is correct it can be run many times on different data and the results will come out very quickly. If you cannot see what is wrong, you must ask for help. The advisory service is there for just this purpose and you will often learn more by talking to others than by reading manuals. Many beginners at computing do not like to feel that a machine knows better than they do, but if there is something wrong it is almost always your fault and if you cannot see the mistake you must ask for help. Once the project gets going, the results will appear rapidly. You will be able to extract much more information from your data than ever you could manually. The computer will give you a much more comprehensive view of your material and your research will be that much more thorough. Finally, it will also give you a taste of another discipline which is both exacting and rewarding for the arts user as well as for the scientist.