Information Research, Vol. 2 No. 1, August 1996 | ||||
A characteristic of natural-language text databases is that a user must be able to specify all of the variant forms of each query word if high recall is to be achieved. The most common type of word variants are those arising from morphology and thus most retrieval systems provide facilities for user- controlled right-hand (and occasionally left-hand) truncation to allow the retrieval of all words with the same root. A stemming algorithm, or stemmer, is a computational procedure that reduces all words with the same root to a single form by stripping the root of its derivational and inflectional affixes (Frakes, 1992; Lennon et al.., 1981). In most cases, only suffixes are stripped so that a stemmer provides an automatic equivalent of manual, right-hand truncation. Thus far, most work on stemmers has focused on present-day languages, but the increasing user of computers in the humanities has resulted in a need for comparable tools to facilitate searching in historical text databases. This paper summarises some of the initial results of a project here in Sheffield to develop such tools for databases of Latin text. Full details of the work are presented by Schinke et al.. (1996).
The Hartlib Papers Collection of manuscripts is held in the Library of the University of Sheffield and runs to approximately 25,000 folios. These manuscripts represent the bulk of the surviving papers of Samuel Hartlib (ca. 1600-1662), whose life-work was to establish a network of scientific and learned correspondence across Europe on a wide range of subjects. In 1987, the Hartlib Papers Project was established in the University of Sheffield, this being a multi-disciplinary research team composed of specialists in the field from the Departments of History, English Literature and the University Library (Greengrass, 1993; Leslie, 1990). With funding from the British Academy and the Leverhulme Trust, the Project has been able to transcribe all of the surviving papers in Sheffield into electronic form. The result is a database containing more than 100 megabytes of text that was published on CD-ROM in 1995.
The present project is designed to support searches of that significant fraction, ca. 25%, of the Hartlib Papers that is in Latin. It might appear, at first glance, that Latin is ideally suited to searching by means of right-hand truncation, since it is an inflected language which makes extensive use of suffixes to convey syntactic, rather than semantic, information about words. In practice, however, a simple, right-hand truncation search for a standard Latin word produces very poor results for two principal reasons (Schinke et al.., 1996).
The first problem is that many Latin words have more than one distinct root, so that an effective truncation search will be achieved if, and only if, the user knows all of the stems of the query word and carries out several separate, but related, searches of the database for each distinct query word. However, the great majority of the scholars who are expected to consult the Latin portions of the Collection will not be classicists by training, and will thus be most unlikely to be able to specify all of the morphological variants of their query terms that are necessary for high-recall searches. The second problem arises from the fact that not only are many Latin roots quite short but also many Latin words with different meanings have similar roots. The latter two factors result in very substantial overstemming, i.e., the retrieval of unrelated words that occurs when too short a stem remains after the removal of a word’s ending(s). The converse of this, understemming, occurs when too short a suffix is removed (so that related words are not all conflated to the same stem). Both problems occur with any stemming algorithm (Lovins, 1971); in the particular context of Latin, we believe that understemming is of less importance because of the ways in which Latin text databases are used. The nature of the stored texts implies that most users will be academic scholars - classicists, historians or philologists - who wish to retrieve texts that can answer highly specific questions, e.g., the extent to which a particular poet uses a particular phrase or the ways in which the meaning of a particular word differs in texts from different periods. In such cases, understemming is strongly to be preferred to overstemming, a fact that has strongly influenced the final form of the stemmer we have developed (Schinke et al.., 1996).
Latin words may readily be grouped into broad categories according to their part of speech and general form. Nouns, for example, are commonly grouped into five declensions, each with reasonably distinct endings, and most adjectives use the suffixes of these five declensions. However, additional particles are added to Latin adjectives when they are used in comparisons, and certain types of verb forms, such as present and future participles, gerunds, and forms of the gerundive, are actually treated as if they were nouns and adjectives. These considerations lead to a final list of 84 suffixes that are needed to stem nouns, adjectives and verbal nouns and adjectives to their correct linguistic roots. It is similarly possible to enumerate all of the 262 suffixes that are needed to stem correctly verbs of all tenses and moods, giving a total of no less than 346 suffixes that should be removed from Latin words in order to stem them correctly. Although it is straightforward to compile such a list, its use is likely to result in widespread overstemming, for the reasons given earlier. In fact, even with a severely restricted set of just 56 suffixes (25 of them relating to nouns and adjectives and 31 to verbs), our initial experiments showed that overstemming occurred for about one-third of the words in a sample dictionary (Schinke et al.., 1996).
The overstemming was principally caused by the removal of suffixes relevant to verb forms from nouns and adjectives (and, to a lesser extent, the removal of suffixes relevant to noun forms from verbs). This finding led to us to modify the algorithm so that it produced two separate dictionaries of stemmed words when it was applied to a text file. The algorithm is applied to an input query word using first the set of noun suffixes (those associated with the five declensions of nouns and adjectives) and the query word stemmed if a matching suffix is identified. The set of verb suffixes (those associated with the four conjugations of verbs, including deponent verbs) is then checked in an analogous manner. The result is that one dictionary contains a list of words in which nouns and adjectives are stemmed correctly, while any other words are either not stemmed at all or are stemmed in such a way that they cannot be confused with nouns and adjectives. For the second dictionary, the converse applies: verb forms are stemmed correctly, while nouns and adjectives are processed so that they cannot be confused with the verb stems. The final algorithm contains 19 suffixes for stemming nouns and adjectives and 25 suffixes for stemming verbs. Nine of the latter suffixes are transformed into other endings, rather than being removed directly, to ensure that all verb forms of the same tense are reduced to a common stem.
The modified algorithm was able to process correctly over 90% of the words in two large sample files drawn from the Hartlib Papers database. However, a detailed failure analysis showed that many of the failures were caused by words from some of the Latin poems in the database, where the authors often added enclitic suffixes to the ends of words instead of using conjunctions. There are just three enclitic suffixes: -QUE, which is used instead of AND; -NE, which is used to indicate a question; and -VE, which is used in place of OR. For example, instead of writing PUERI ET PUELLAE (the boys and the girls), authors may write PUERI PUELLAEQUE to give the same meaning, with the addition of the terminal -QUE effectively hiding the suffix -AE that ought to be removed from the stem PUELL-. It is possible to remove the three enclitic suffixes from all words before stemming starts but such a simplistic approach would lead to widespread overstemming, since many words incorporate these syllables as a part of their stems. In fact, only about 5% of the Hartlib words ending in -VE or -NE actually included an enclitic suffix that would have to be removed in order for the word to be stemmed correctly. It was hence decided to ignore the problems caused by these two enclitic suffixes and not to attempt to remove them from words prior to stemming. However, the same analysis showed that it was necessary to remove the enclitic suffix from over 80% of words ending in -QUE before those words could be correctly stemmed. It is thus clear that this enclitic suffix is very widely used and that it cannot simply be removed whenever it occurs at the end of a word (since this would still leave almost a fifth of the words incorrectly stemmed). We have thus created a list of words from which - QUE should not be removed prior to stemming.
Schinke et al.. (1996) describe various other features of the algorithm. The most important of these was the need to take account of the use of the letter J in place of the letter I that is quite common in neo-classical Latin texts (although the Romans never used the letter J at all). This was handled by ensuring that the algorithm transformed all occurrences of the letter J to I prior to stemming. An analogous replacement strategy was adopted for the letters V and U, which are used interchangeably in neo-classical Latin.
The algorithm has been implemented using some of the data structures and character manipulation routines in the implementation of the English stemmer due to Porter (1981), for which a C-language coding is given by Frakes (1992). The rules used for stemming nouns and adjectives remain separate from those rules used to stem verb forms. In both cases, suffixes are removed using a longest-match procedure in which just the longest suffix that matches the end of an input word is removed. No stemming takes place if a match cannot be obtained with any of the suffixes in either of the suffix dictionaries. By structuring the algorithm in this way, it is possible to ensure that entire classes of words in the stemmed dictionaries do not need to be processed by the retrieval routines (which are currently under development). Thus, the set of suffixes that are common to nouns and adjectives does not include any suffixes which are relevant to verbs, and thus verb forms cannot be reduced to any stem which might be identical to the stem of a noun or adjective. Likewise, the set of suffixes relevant to verb forms either allows nouns and adjectives to remain intact, or stems them in such a way that stems cannot be produced that are identical to those of verb forms. The aim of this approach is to ensure that the two main classes of words are stemmed in an appropriate manner but without the need for the extensive linguistic processing that would be required to identify the parts of speech for each of the words in a text that was to be stemmed.
In this paper, we have described the main features of a stemming algorithm we have developed to facilitate searches of Latin text databases. There are two major features of our algorithm that differentiate it from other stemmers that have been described in the literature (Frakes, 1992; Lennon et al.., 1981). Firstly, the algorithm generates two stem dictionaries. This it does by using two sets of stemming rules which keep nouns and adjectives separate from verb forms by default, but without the need to encode the parts of speech for the words that are to be stemmed. Secondly, the policy of deliberately understemming many words leaves enough grammatical information attached to the resultant word stems to enable the algorithm to distinguish easily between different words with similar roots. In addition, this feature enables very specific searches to be carried out for single grammatical forms of some types of words, an important requirement for the intended users of the procedure.
Thus far, we have considered only the stemming of individual words. We are now developing a retrieval system that will enable a user to present just a single query term to a database and then be presented with a list of all of the morphological variants of that term that occur in the database; these variants can then be included in the query.
This work was funded by the British Library Research and Development Department and the Library of the University of Sheffield.
How to cite this paper:
Greengrass, Mark, Robertson, Alexander M., Schinke, Robyn & Willett, Peter (1996) "". Information Research, 2(1) Available at: http://InformationR.net/ir/2-1/paper10.html
© the authors, 1996.
Contents |
|
Home |