Skip to content

Digital Epigraphy and Lexicographical and Onomastic Markup

From Gabriel Bodard on Stoa

(Prepress) Digital Epigraphy and Lexicographical and Onomastic Markup
September 9th, 2010 by Gabriel Bodard

[Note: this is part of a paper written after a conference on Digital Lexicography at the University of Cambridge in 2002, and was scheduled to appear in the print publication of the proceedings. As the publication never took place, and the paper is now rather too out of date to publish by traditional means without a lot more work, I'm posting here under a Creative Commons Attribution license that part of it (a little more than a third of the length) that might still be of some small interest. No significant changes have been made to this material since 2003 (e.g. code examples use TEI P4).]

Introduction

In this paper I discuss the digital markup of epigraphic texts, using the Aphrodisias in Late Antiquity 2004 electronic publication as an example corpus. I shall consider some of the uses to which the original electronic source code can be put, which includes the compiling of (or contributing to) indices and databases external to the original, limited project. Such external uses might include an onomastic database, a gazetteer of place names, or a digital lexicon, to suggest only three.

[I omit from this version the first two sub-sections of the paper (history of EpiDoc and thoughts on digital publication) which have now been more fully explored in this DHQ chapter and my Digital Medievalist article, respectively]

2. Onomastic and Lexicographic Markup

The aspect of the marked up inscriptions that may be of most relevance to the topic of this volume is the ability to mark-up words and names for indexing, searching, and export to external databases, such as lexica, prosopographies, and gazetteers. It is important to note that it is not necessary for the author of the electronic publication to predict these uses, for it to be possible to exploit the resource in this way. I shall give here fairly brief examples of how names are marked up and how information might be extracted from them (using the Lexicon of Greek Personal Names database as an example of the format and structure that such an output might take), followed by a similar example of the markup of two previously unattested words in the Aphrodisias material. (I should note that the LGPN database already contains data on the inscriptions of Asia Minor, so the following description is by way of example only, not how the database will be built in this case.)

2.i Personal Names

Personal names in the ALA2004 corpus are marked up in XML using the tag <persName>. This tag allows the user to specify two important pieces of information: the regularised form of the name, and a database key pointing to an authority list of individuals. The name of Asclepiodotus in inscription 54, used for our example above, is marked up as follows:

<persName key=”AsclepiodotusAph” reg=”Ἀσκληπιόδοτος” type=”Aphrodisian”>Ἀσκληπιόδοτος</persName>

The regularised form of the name is in this case identical to the form in the inscription, since there are no variant spellings and the name is in the nominative case. The @type=”Aphrodisian” attribute is for internal purposes, to indicate in which of the indices this name belongs, although it could also be used for prosopographical sorting. The authority list, to which the key “AsclepiodotusAph” points, gives the following additional information:

ID: AsclepiodotusAph
Full Name: Asclepiodotus of Aphrodisias
Extra Info: philosopher, and benefactor, father of Damiane: PLRE II, Asclepiodotus 2

Now if a prosopographical database such as the LGPN were to use the electronic files of these inscriptions marked up in EpiDoc XML to extract data on personal names and people, much of the information needed could be automatically generated by a simple XML parser. The LGPN main database has five fields: name, place, date, reference, ‘final brackets’ (for miscellaneous additional information). Four of these fields could be filled in, at least to a preliminary standard, by the parser.

The ‘name’ field contains the personal name in Greek, which is the content of the @reg attribute in the XML, i.e. Ἀσκληπιόδοτος. The ‘place’ field is dependent on contextual information, it could be extracted by a human from the epithet in the authority list; however it could also be automated to the extent that all names extracted from eALA inscriptions with the attribute value @type=”Aphrodisian”, are from Aphrodisias. The ‘date’ field is more complicated: the LGPN uses a series of encoded values which expand to a range of dates in the database; for example, 4A (=fourth century A.D.), translates to 300-400. Luckily, dates in EpiDoc files are handled similarly, if not using the same system: the date of our example inscription is listed both (in prose) as ‘late fifth century A.D.’, and (behind the scenes) as 467-500. The ‘reference’ field lists the primary reference for the name, in this case our inscription; a parser can easily extract the information: ALA2004 54, 2 (and analysis of the full corpus will reveal that it also occurs in 53).

The content of the ‘final brackets’ field would almost certainly have to be generated by a human editor (just as the first four fields would need to be checked). If the name is not in the nominative in the text, its value could be imported from the content of the <persName> element. Secondary reference, such as this person’s reference in the PLRE could also be extracted from the authority list.

The process of extracting the necessary information for a record in a prosopographical database is not therefore fully automated: editorial intervention and checking will always be required. The work of a parser working with files in an interchange format like EpiDoc can nonetheless speed the work up, creating preliminary records that only need collating and editing, along with most of the reference information to make the editor’s job easier. The electronic file does not do away with the author, but it provides an extra tool to facilitate their work.

2.ii New Lexical Words

The EpiDoc system also allows for the option to mark-up individual lexical words, as is done in the ALA2004 project. The word in the text is enclosed in <w> and </w> tags (standing for ‘word’), which has an attribute @lemma=”x”. The @lemma attribute is equivalent to the @reg attribute of <persName>: it gives the word in the nominative, normalized spelling, first person, etc., as a dictionary lemma. In ALA2004 the principal purpose of this lemmatizing is for the generation of indices, as well as personal and place names, the publication includes an index of Greek words. (Not all words are indexed of course, but they are all marked up for completeness; the stylesheet can be told not to index words like ὁ, καί, δέ, and so forth.)

As an aside, there is software, such as that developed by Perseus, which will lemmatize a Greek text with a greater or lesser degree of automation. This software needs to refer to a digitized dictionary, and so is of limited value when it comes to new words, errors, or misspellings; even in a perfectly clear text a human editor will need to resolve ambiguous forms. In inscriptions especially, the exceptions will be many, but work is ongoing to speed the process of lemmatizing by means of such tools.

Although ALA2004 has not yet marked up any further grammatical information, such as part of speech, syntax or linguistic structure (nor, as far as I know, has any other EpiDoc project), this possibility is not precluded in the future. There are other projects in both Greek (ancient and modern) and Latin, that use TEI to encode linguistic features from grammar, morphology and syntax, to narrative structures and discourse features in their texts. Such information could further enrich the value of a digital text for lexicographic use.

Inscriptions are a particularly rich source of previously unattested words, words that might be added to a supplement or new edition of a Greek (or Latin) lexicon. The two hundred and fifty inscriptions in ALA2004 provide several new Greek words, of which I shall here give two as an example. The words in the index whose lemmata do not appear in the standard lexica, but are interpretable in their contexts are σελλοφόρος and μυδροστασία.

Automatic extraction from the text could provide both the lemma, and the attested form, since the words would occur in the files in the following forms:

<w lemma=”ὁ”>τῶν</w> <w lemma=”σελλοφόρος”>σελλοφόρων</w>

<w lemma=”ὁ”>τῆς</w> <w lemma=”μυδροστασία”>μυδροστασίας</w>

The parser could also extract references for both words: ALA2004 80, 5 and 208, 1 respectively. This may be all that can be fully automated by a parser without human intervention, and an editor would need to check even this information for relevance and correct formatting for each lexicon entry. Nevertheless, this would be a start, and if the parser also extracted the immediate context of the word in the Greek text, the translation, and the immediate discussion from the electronic file, the editor’s task would be greatly facilitated.

At present, as mentioned above, there is no part of speech or morphological information in the EpiDoc markup scheme, so an editor would have to specify that both of these words are nouns, and interpret their declensions so as to give the genitive ending, for example, or decide if a verb is athematic, irregular, vel sim. Likewise, the gloss or definition of the word (depending on the nature of the lexicon) would have to be derived from the translation and commentary rather than automatically extracted by even the most intelligent of parsers. But the information provided by our parser would quickly allow the compilation of a preliminary entry for each word along the following lines:

σελλοφόρος, ου, chair-bearer, cf. διφροφόρος, Lat. sellarius; ALA 80, 5.

μυδροστασία, ας, place of the μύδρος, anvil = ?forge; ALA 208, 1, τό(πος) τῆς μυδροστασίας.

Both the extraction of personal names by a prosopographical project, and of lemmatised word-forms by a new lexicon (and similarly place names by an atlas or gazetteer project, which I have not discussed), could be facilitated by a preliminary pass of a parser over epigraphical texts marked up in EpiDoc XML. This would not remove the human editor from the chain: even if the processor could create a complete entry from our format, one would want the result checked by a human at the very least. Nor would this process completely replace traditional research in the creation of slips and compiling new words, or instances of personal and place names; a human editor is still needed to check results and search for references from other sources.

But the electronic edition of epigraphic data (or, in sister projects, papyrological, numismatic, or other textual material) allows for greater accessibility to the information. Not only do more people have access to a web edition than to a printed library book, but the publication of source files in the form of XML and other code allows the data to be queried and manipulated in ways that do not have to be predicted by the original project’s editors.

NoDictionaries: Latin texts with adjustable interlinear vocabulary

Here is a post on a very helpful tool developed by Lee Butterman: NoDictionaries (which is being presented by Professor Susan Setnik at the Department of Classics at Tufts University).

Read Laura Gibb’s overview of NoDictionaries:

In addition to the usual Bestiaria Latina blog round-up, I wanted to write a special post here today dedicated to a wonderful new online service made available by Lee Butterman: NoDictionaries.com. This is a Latin dictionary look-up tool which generates an interlinear word list for you to look at as you read.

The program is built on a core of vocabulary from Whitaker’s Words – a great program I’m sure many of you are familiar with already. What is different about NoDictionaries is that instead of a single word-by-word look-up, it generates word lists for an entire text and displays them line by line.

Interlinear Word Lists

This screenshot will give you a good sense of how the interlinear word lists look on the screen. The Latin text is in blue, with each word a clickable link (more about that below). The interlinear word list is in green.


Library of Latin Texts

There is a library of Latin texts already available at NoDictionaries.com, already equipped with these interlinear word lists. To see the list of available texts, go to NoDictionaries: Latin Literature.

Enter Your Own Text

In addition to the available texts, you can enter your own Latin text and generate the interlinear word lists. To enter your own text, go to: NoDictionaries: Novifex. You will see a text box on that page where you can either type the Latin text you want to read, or cut-and-paste the text from an existing digital version.

So, for example, if you are reading one of the fables at the Ictibus Felicibus fable blog and would like some help with the vocabulary, just cut-and-paste the plain version of the text into the box, and you’ll get a word list!


Ambiguous Words and Multiple Dictionary Entries

Of course, the biggest pitfall any automatic program like this must face is the large number of homographs in Latin – words with the same spelling which may derive from quite different dictionary entries. For example, you will often meet a canis in Aesop’s fables, a “dog” – but the first response of NoDictionaries.com is to supply the dictionary entry for the verb cano, “sing,” rather than the noun canis. What you can do, however, is click on the underlined blue Latin word and see all possible dictionary entries it could come from. So, as shown in the screenshot below, if I click on cani, the word list expands to include all the possible dictionary options:


Make sure you explore the dictionary in this way, instead of trying to force the meaning of the text to fit the word list. Instead, if something is just not making sense, explore the dictionary possibilities by clicking on the Latin word you are wondering about.

Correcting the Word List

As you explore the multiple dictionary options, you can correct and update the word list based on what you learn. Just click on the button to the right which reads “fix any definition selected” which will allow you to choose the correct dictionary entry and update the word list accordingly.


When you have clicked the “fix” button, the lists of alternative words will appear with checkmarks next to them.


If you know which word is the correct choice, then just click on the checkmark, and the word list will appear with the item corrected. If you make corrections to the texts in the Library, those corrections will be saved and the corrected word list will be displayed for the next user, benefiting everyone!

Use the Slider to Hide/Unhide the Lists

One of the very best things about NoDictionaries is that you can choose to hide and unhide the word lists. To do this, just slide the triangle to the left or right to hide or unhide the word lists:

So, as a reading strategy, you can look at just the Latin text without any English prompts by sliding the triangle to the left. Do your very best reading the Latin, going through it slowly, out loud, getting a sense of the overall passage and seeing what words you do recognize. Then, slide the triangle to the right and view the word lists. When you are done reading with the help of the dictionary, then slide the triangle back to the left again, and read the Latin text on its own.

Individual Word Look-Up

Here’s another great feature: even when the word lists are completely hidden, you can still look up an individual word, since the blue Latin text is still clickable. So, with the word lists hidden, you can still consult the dictionary entries for any word just by clicking on it:

What a flexible tool! So, make sure you use the slider in order to get just the right amount of help that you need – not too much, and not too little. You can also choose to have more or less information displayed in the dictionary entries; there’s a link under the slider which allows you to turn on or off the display of the different Latin forms for each word.


If you turn on the principal parts option, it will display the principal parts of verbs, the genitive forms of nouns, etc.

Share Your Word Lists

You can create an account at NoDictionaries.com so that you can save the word lists you create with the Novifex. Even better, you can share those word lists with others by giving them the address!

So, for example, you can let other people look at your entire collection of word lists like this:
http://nodictionaries.com/people/lauragibbs/passages
(you can see my username listed there – “lauragibbs” – as part of the address)

You can also link to a specific word list, like this for example:
http://nodictionaries.com/people/lauragibbs/267-trinity-8–camelus

This way you can share a marked up a text with your students by linking to it, either in an email or on your webpage or blog. The students click on the link, and they can then use the slider to adjust just how much vocabulary help they want when they read the text. I’m now including links to NoDictionaries.com word lists for all the Aesop poems I’m publishing in my Aesopus Elegiacus blog, for example – I hope it will be a way to make the poems easier for people to read, but without actually providing an English translation.

WHAT A GREAT TOOL… THANKS, LEE!

Kudos to Lee for this absolutely wonderful tool! You can send feedback to Lee by clicking the feedback button at the bottom of each page at NoDictionaries, which expands into a feedback box:

Lee’s program is a great way to build on and expand the range of William Whitaker’s excellent Words program – I hope this will be a big help to people who want to read Latin on their own, making use of the Internet to help them as they do so!

Classics in the Age of Wikipedia by David Bamman

Here is the abstract by David Bamman for talks at the XXI Simposio Nacional de Estudios Clásicos in Argentina (21-24 September, 2010):

Classics in the Age of Wikipedia: Creating, Sharing and Accessing Information in a Global, Networked Environment

Lectures Series at the XXI Simposio Nacional de Estudios Clásicos, Santa Fe, Argentina (Sept. 22-24, 2010)

David Bamman
The Perseus Project, Tufts University


I. Introduction to Computational Methods for Classical Philology

Wednesday, Sept. 22, 2010

From the Gutenberg Press to the Internet, one of the biggest impacts of the use of information technologies within the sphere of Classical Studies has been in providing ever-increasing levels of access – not simply physical access to the primary texts of the Classical tradition, but intellectual access as well.

For traditional textual scholars and Classical philologists, the availability of texts online is a first big step – we can now look at highly detailed images of the Venetus A manuscript of Homer’s Iliad without having to go to Venice, or use the Internet Archive to read an 1891 Teubner edition of Sallust without going to our university libraries.

The related fields of computational linguistics and natural language processing (NLP) are pushing this innovation even further by helping to provide increased intellectual access to these cultural heritage materials as well. For early learners of Greek and Latin, the methods of automatic linguistic analysis and machine translation help to lower the barrier of entry to interacting with primary source texts. For advanced researchers, computational methods not only expedite the traditional research that’s being done already, but also help uncover new information about the Greco-Roman world and its reception throughout the written record of history.

This talk will provide a general introduction to computational methods for Classicists, with a special focus on the digital resources and technologies that traditional scholars can use.


II. Linguistic Annotation of Classical Texts

Thursday, Sept. 23, 2010

One of the biggest contributions that traditional scholars can make to the field of computational philology is leveraging their expert knowledge in Greek and Latin to create linguistically annotated texts, and then publishing those texts for the entire community to use. This annotation can take several forms, including encoding the citation structure (like chapter and section breaks) in a digital edition, disambiguating the people and places mentioned in the text (e.g., annotating a given instance of “Alexander” in a text as Alexander the Great and not as Paris, the son of Priam), and marking the explicit syntactic relationship for each word in a sentence of Vergil.

This level of annotation does not require a sophisticated technological background – in many cases, it simply requires navigating the web. The data created by such annotation, however, is tremendously powerful: it can provide the training material for computational methods, and it can also provide a quantified foundation on which to explore many traditional questions – if we have a large collection of syntactic analyses for Latin poetry, for example, we can automatically identify rhetorical devices (such as hyperbaton) that involve the interaction between linear word order and syntax.

In all of this work, collaboration between different groups (and across languages) is crucial. Since many annotation tasks are performed with a strictly controlled vocabulary, the only language proficiency required is that of Greek or Latin – enabling students and researchers who are native speakers of English, Spanish, Arabic or Chinese to work together.

In this talk, I will illustrate several varieties of annotation tasks for Greek and Latin texts, focusing especially on 1) the creation of new digital editions; 2) disambiguating people and places; and 3) creating syntactic analyses (or “treebanks”) of texts. One goal of this talk will be an outline of the practical steps required for researchers to immediately begin this work themselves.


III. Mapping the Greek and Latin Genome: A Workshop on Treebanking

Friday, Sept. 24, 2010

Treebanking is a form of linguistic annotation that involves marking the explicit syntactic relation for every word in a sentence (e.g., annotating the subject of the verb, its objects, which adjectives modify which nouns), as in the following annotation of ista meam norit gloria canitiem (“that glory will know my old age”) from Propertius 1.8.

Treebanks exist for many modern languages (where they provide valuable training material for automatic parsers) but several are arising for historical languages as well, including those for Old English, Middle English, Early Modern English, Medieval Portuguese, Classical Chinese, Ugaritic, and our own work on the Ancient Greek and Latin Dependency Treebanks. Over the past three years and with the help of over 200 students, scholars and university classes from across the world, we have published almost 250,000 words of treebanked texts from a variety of authors (Homer, Hesiod, Aeschylus, Caesar, Cicero, Jerome, Ovid, Petronius, Propertius, Sallust and Vergil). This talk will give a working introduction to this kind of syntactic annotation, including an overview of the grammatical style, the community of treebankers, and a tutorial on the online annotation environment. The goal of this workshop is to provide the audience with the basic skills needed to undertake treebanking of any Greek or Latin text themselves.

Versioning Machine 4.0

Here is another tool for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines:

Versioning Machine 4.0 – A Tool for Displaying and Comparing Different Versions of the Same Text:

The Versioning Machine is a framework and an interface for displaying multiple versions of text encoded according to the Text Encoding Initiative (TEI) Guidelines. VM 4.0 has been updated to be P5 compatable. While the VM provides for features typically found in critical editions, such as annotation and introductory material, it also takes advantage of the opportunities afforded by electronic publication to allow for the comparison diplomatic versions of witnesses, and the ability to easily compare an image of the manuscript with a diplomatic version.

The Versioning Machine is also a tool for textual editors, providing an environment that allows editors to immediately see the consequences of their editorial decisions. The Versioning Machine can be used locally on a Mac or a PC, or it can be mounted on the WWW for public access. The documentation provided with the software not only provides information about the use of the software, but builds upon the Critical Apparatus chapter of the TEI Guidelines to give further guidance to those who wish to use this method of encoding.

Juxta receives Google Digital Humanities Award

Here is a post from Juxta – Collation software for scholars:

Good news! Google has offered its support to help us develop Juxta into a web application:

http://googleblog.blogspot.com/2010/07/our-commitment-to-digital-humanities.html

We are thrilled to have received this competitive award, and look forward to working to optimize Juxta for the web.

Here is an abstract of our application for the Google Award:

With the support of a Google Digital Humanities Research Award, we propose to transform Juxta into a web-based application integrated with Google Books. Scholars could use such a tool to track changes in language over time and to test literary and historical theories through comparative analysis of texts.

As the largest single part of the general remediation of the global library to digital formats, the 12,000,000+ books digitized by Google represent a major opportunity for scholars interested in the history of texts and editions. We want to know how Charles Dickens and Henry James changed their novels as they went through different editions in their lifetimes; and we also want to see the changes introduced by later editors, in later printings.  We want to collate versions of poems published by Sylvia Plath and Walt Whitman to discover their revisions.  We want to compare digital texts of uncertain origin with known versions, as a mode of authentication.

Using Juxta, a scholar can answer these questions and many more. Juxta comes with several kinds of analytic visualizations. The primary collation gives a split frame comparison of a base text with a witness text, along with a display of the digital images from which the base text is derived. Juxta displays a heat map of all textual variants and allows the user to locate all witness variations from the base text. The histogram visualization displays the density of all variation from the base text and serves as a useful finding aid for specific variants.

A web based Juxta would be very similar in function to the Juxta desktop application. Scholars could upload texts into a private storage area and compare them against books from the Google Books corpus. The scholar could also embed the collation into their own website (as with Google Maps) with an HTML code snippet that we will generate. Our goal would be to eventually integrate Juxta directly into the Google Books interface, allowing scholars to compare any two books for which they have access to the full text.

Iliad 10 and the Poetics of Ambush

Casey Dué & Mary Ebbott, Iliad 10 and the Poetics of Ambush. A Multitext Edition with Essays and Commentary, Hellenic Studies Series 39, Center for Hellenic Studies 2010 – ISBN 9780674035591

This edition, commentary, and accompanying essays focus on the tenth book of the Iliad, which has been doubted, ignored, and even scorned. Casey Dué and Mary Ebbott use approaches based on oral traditional poetics to illuminate many of the interpretive questions that strictly literary approaches find unsolvable. The introductory essays explain their textual and interpretive approaches and explicate the ambush theme within the whole Greek epic tradition. The critical texts (presented as a sequence of witnesses, including the tenth-century Venetus A manuscript and select papyri) highlight the individual witnesses and the variations they offer. The commentary demonstrates how the unconventional Iliad 10 shares in the oral traditional nature of the whole epic, even though its poetics are specific to its nocturnal ambush plot.

Monica Berti & Marco Büchler on Fragmentary Texts (Digital Classicist Seminar, London – July 30th, 2010)

Fragmentary Texts and Digital Collections of Fragmentary Authors

Monica Berti (Torino) and Marco Büchler (Leipzig)

Digital Classicist and Institute of Classical Studies Seminar 2010

Friday July 30th at 16:30, in room STB9, Senate House, Malet Street, London WC1E 7HU

The term fragment is applicable to a wide range of ancient evidence, which includes archaeological ruins, epigraphical and papyrological documents, and many other pieces of the material record. By “fragmentary texts” we mean not only material remains of ancient writings, but also quotations of lost texts preserved through other texts. A huge number of quotations of lost texts has been gathered in print collections, enabling scholars to reconstruct lost works and depict the personality of fragmentary authors.

Information technologies and hypertextual models permit the expression of every element of print conventions, thus building a cyberinfrastructure for new digital collections of ancient sources. Representing textual fragments first involves focusing on the complex relation between the fragment and its source of transmission, given that a quotation is only a shadow of the original text. Consequently, encoding fragments is ultimately the result of interpreting them, and this involves developing a language for representing every element of their textual features, thus creating meta-information through an accurate and elaborate semantic markup. Editing fragments signifies producing meta-editions that are different from printed ones, because they consist not only of isolated quotations but also of pointers to the original contexts from which the fragments have been extracted.

Moreover, the automatic and unsupervised detection of fragmentary authors is one of the most challenging tasks in the field of Natural Language Processing. Even if computational models developed from the knowledge and skills of classicists – based on observations in texts – can be trained faster, the overall quality will be not comparable to the level of classicists in the next years. For this reason we separate the field of collecting fragmentary authors into 4 working areas to support the work of classicists:

  • Associations between author and work names: This kind of an association graph supports tasks such finding all authors that have written works with the same or similar names.
  • Extraction of fragments of an author: Based on different patterns, text fragments are aligned to a fragmentary author whenever this author or his work is mentioned in the text.
  • Finding new quotations and parallel texts: Given such extracted fragments, additional quotations and parallel texts are determined.
  • Expansion of the fragments’ set: The use of all the extracted fragments, their quotations and their parallel texts, allows us to determine the semantic space or spaces of an author in order to find new possible fragment candidates of the same space.

During the Digital Classicist seminar two of these four working areas (whichever have made the best progress by the time of the presentation) will be explained in detail. From a more general view, it will be shown how the objective and quantitative methods of computer scientists can be combined with the qualitative in-depth working methodologies of classicists in this purely non-funding collaboration in order to bring benefits to both communities.

ALL WELCOME

The seminar will be followed by wine and refreshments.