Euripides Scholia: The XML Structure

The XML Structure and Technical Details

XML

The base form of the digital edition of the scholia is an XML document. XML (eXtensible Markup Language) is an international standard for markup, allowing the creation of computer data structures that are easily reprocessed and do not depend on particular operating systems or applications. XML documents are encoded in Unicode, the international standard for encoding the world’s various language scripts and other systems of symbols. This allows for the use of polytonic Greek as well as roman characters, plus metrical and other symbols in the edition.

TEI

TEI is the acronym of the Text Encoding Initiative, a non-profit project providing a standard for sophisticated markup of complex textual documents. TEI originated with the precursor to XML, SGML (Standard Generalized Markup Language), but in recent years TEI definitions have been rewritten in XML. The version of the TEI structure that has been adopted for this edition is known as P5. TEI has been and is being used in a number of projects (for example, EPIDOC) and is looked upon with favor by the U.S. National Endowment for the Humanities in relation to its support of digital projects in the humanities.

A Structure for the Euripides Scholia

TEI allows a vast range of possibilities for markup, but each project is entitled to use whatever subset seems most appropriate. The level of detail in the markup may vary justifiably according to the purposes of the edition and the time available. In a TEI digital edition, various metadata, background information, and declarations of particular usages are included in a teiHeader element that precedes the text element of the document. Within the text element, there are elements for front, body, and back. So far, I have created content within the XML edition itself only for the body element (much of the content of this web site could be converted to parts of the front and back). The structure of this edition is based on the use of four levels of the TEI division-type element, from the largest, div1, to the smallest needed here, div4. Every division element can be given an attribute called ‘type’ (attribute names are conventionally shown as follows: @type), and this attribute is essential to differentiating various structures in the edition.

The div1 element serves to enclose all the material that relates to one tragedy. So far, therefore, there is just one div1, its @type is ‘subdivisionByPlay’ and it also has another attribute, @xml-id, ‘Orestes’. The div1 for Hecuba will have the same value for @type but @xml-id will be ‘Hecuba’. At a later point, there will also be a div1 with @type of ‘preliminaryTexts’ to contain the forms of the Life of Euripides found in the manuscripts of the tragedies and any other prefatory items related to the whole corpus (for instance, epigrams on Euripides).

The div1 element encloses one or two div2 elements. If there is any prefatory material in the manuscript tradition of a play, then the first div2 contains this (@type is ‘hypotheseis’ and @xml-id is ‘hypOrestes’). There will always be a div2 containing the scholia on the play (@type is ‘scholia’ and @xml-id is ‘schOrestes’).

Here I will describe first in detail the scholia division. Each item that I have decided to treat as a separate scholion is contained in its own division of the next level, div3. In the structure adopted here, div3 always has three required attributes and occasionally has an optional fourth attribute. The first two required attributes provide classification of the scholia. @type is used to classify the scholia as older or younger or connected to a named Palaeologan scholar, and in some cases this category has to have a mixed value (as when the same item is both old and Moschopulean). Thus, @type takes a value from the following list: vet, rec, mosch, thom, tri, plan, vetMosch, vetThom, vetMoschThom, recMosch, recThom, recMoschThom, moschThom. @subtype is used for a rough classification of the content and takes a value from the following list: exeg (an exegetic scholion, explaining some matter of textual interpretation, mythography, genealogy, customs, staging, or the like), gloss (an annotation usually of only one or two words (not counting an introductory word like ἤγουν, ἤτοι, καί or an optional δηλονότι), giving a synonym or supplying an understood term), paraphr (a grouping of glosses into a paraphrase of more than a couple of words, without the sort of elaboration that might be found in an exegetic scholion that also uses paraphrase), gram (a grammatical note centered on etymology or usage per se without explicit relation to the passage at hand), metr (metrical notes, including technical descriptions of cola and notations about synizesis, resolution, vowel length, and the like), artGloss (a gloss that consists only of the article agreeing with the glossed word), etaGloss (an eta placed over a Doric alpha in a lyric passage to indicate the normal form). The lists of possible values can be expanded if that seems desirable, or if there is time to make finer distinctions among exegetic scholia. I hesitated for a while over when to use the value gloss. I considered limiting it to synonyms of the glossed word, since some other one-word notations are in a sense exegetical, supplying an understood verb form or a clarifying possessive. But short glosses, whether synonyms or not, reflect the same kind of pedagogical activity or intellectual practice, so I have adopted the wider definition, and this also means that suppressing the display of glosses removes the distraction of all the short and usually elementary annotations.

The third required attribute is @xml-id, which must be unique for each div3. The unique value is built as follows: the first two letters of the Latin title of the play (He, Or, Ph, Me, Hi, Al, An, Tr, Rh); the line number of the only line to which the scholion applies or of the first line of a range of lines to which the scholion applies, expanded with leading zeroes to make a four-digit number (0003, 0046, 0589, 1532); a decimal point; and two digits representing the sequence in which I have decided to arrange the notes under a single line number, from 01 to (theoretically) 99. This system will suffice for the initial compilation, but there must also be a mechanism for adding new scholia at an appropriate point within the sequence. If a new item need to be placed after the item with @xml-id of Or0014.06 and before Or0014.07, it will be Or0014.06a (and if more than one, then Or0014.06b and so forth).

The optional attribute of each scholion div3 is @n. This is necessary only for a scholion that applies to a range of lines, and it provides the explicit value to be displayed in the HTML version. When a scholion belongs to a single line, the line number to be displayed is generated instead by a function in the processing instructions that extracts it from the @xml-id.

The kernel of the structuring of the information, and what makes possible the optional inclusion of different kinds of information and the display of various levels of detail to different users, is the sequence of div4 elements that are the children of each scholion div3. The only one of these that is mandatory is the one with @type of ‘schText’, enclosing the text of a single scholion with its lemma and its witness list. TEI requires the use of child element p (paragraph) here, but forbids giving it a @type, so this element doesn’t contribute usefully to the tagging of content or the processing. Before the text of the scholion there may or may not be an element seg (segment) with @type of ‘lemma’ and @subtype either ‘inMS’ or ‘added’ to reflect whether there is an explicit lemma in any of the witnesses or not or whether the lemma has been added by the editor (added lemmas are processed to be displayed between angle brackets, which are U+27E8 and U+27E9, not the lesser than and greater than symbols, U+003C, U+003E). This segment is optional because occasionally it does not seem justified to supply a lemma (as when a scholion applies to a whole line). If the text of the scholion is more than one sentence, then the sentences are tagged as the s element with an attribute @n to provide sentence numbers. These numbers are needed to make the references in the apparatus criticus easier. The lineation of a digital edition is not fixed, so it is impossible to key an apparatus item to a line number. Anchoring each apparatus item to a single word or phrase is possible, but the markup would be far too time-consuming and in my opinion out of proportion to any possible gain. After the text of the scholion, a required seg with @type of ‘witnesses’ contains the sigla of the manuscripts that contain the scholion. (For the information conveyed by superscripts after a siglum in the HTML display, see the discussion below concerning the div4 for lemma and position.)

There are seven other kinds of div4 that may or may not follow the text of each scholion. In order, the @type of these is drawn from the following list: engTrans (translation will be provided for some of the more interesting scholia; in the current sample, only a few of these are included for testing purposes); lemmaPosNote (details about variations in the lemma, the presence of reference symbols linking the scholion to the text, and the position of the note), appCrit (main apparatus criticus), appCrit2 (for orthographica and other minor matters), commentSim (for commentary and citation of similia), collNotes (collation notes, that is, record of difficulties in reading images, of divergences from previous reports, and reminders to check the original or a better image at some future date), keywords (keywords, for additional description of the content in aid of searching or indexing at a later point). The last two are not really meant for public display, but for the author or eventual collaborators, though they are accessible in the current sample.

The div4 for the translation contains nothing but a p for the text of the translation.

The div4 for lemma and position contains a p with one to three seg elements: values for @type of these segments are ‘lemmaNote’, ‘refSymb’, ‘pos’. The position segment has two kinds of information: first, it records whether items are above the line, marginal, or intermarginal (all as opposed to being part of a recognizable block of scholia); second, it tells about variations in the ordering of scholia with respect to each other or if a scholion is continued from a previous item without apparent separation. Some editors of scholia suppress information about location, and there may be justification for that in some circumstances. This information seems to have some value, however, in that this edition is intended to be expandable and to provide details that may turn out to be useful to someone who later collates a witness never used before. One might have wanted to simply list the witnesses with superscript indications of position right after the text of the scholion. But XML does not handle such modifications easily, and for practical reasons I have therefore kept the use of items needing to be displayed as superscripts to a minimum. Therefore, instead of listing after a gloss shared by Moschopulean and Thoman witnesses the sequence X^sXa^sXb^sT^sY^sGr^sZ^sZa^sZm^s, I have preferred to list the witnesses as XXaXbTYGrZZaZm and to enter the note ‘s.l.’ in the position segment. This does not mean that superscript modifications of sigla do not occur at all: they are still necessary to distinguish different hands (1, 2, 3), or different versions of the same note at different locations in the same witness (for instance, R^a for scholia in the margins of the text of R, but R^b for the scholia written in a continuous block after the end of the text of Orestes; or M^s along with M when the same note is both in the scholia block of M and written above the line in the text). To handle such cases, I use a seg with @type of ‘witMod’, and such a segment can occur within the witness list, in remarks about lemma or position, in the apparatus criticus and in other div4 elements except the translation and keywords.

The div4 for the apparatus criticus contains a p with one or more seg with @type of ‘appItem’. For scholia of more than one sentence, an untagged number is added to the first item of the apparatus located in a particular sentence. The apparatus criticus is an area in which I have decided not to use the TEI mechanisms for apparatus criticus readings and variants, because in a project of this kind it seems to me that it would involve an unjustifiably large overhead of markup. I believe the information familiar to those who know how to read the apparatus criticus of a classical text can be provided in textual segments. This does mean that one will not be able to take my XML document and process it to produce a text that reflects the textual choices and errors of a particular witness, which probably would be possible with a more elaborate markup of readings and witnesses with pointers to specific words in the text. Such a project would require more personnel and a much larger budget, and I don’t think the benefit would be worth the cost. The secondary apparatus, for orthographica and minor curiosities that don’t need to take up space in the main apparatus but may be useful to collators, has a similar structure, except that its segments have @type of ‘orthogr’.

The div4 for the comment and similia contains nothing but a p for the text added here.

The div4 for the collation notes contains a p with one or more seg elements with @type of ‘other’.

The div4 for the keywords contains a p with one or more seg elements with @type of ‘keywds’. Each such seg contains a word or phrase.

The vast majority of the scholia have markup as described so far. There is an alternative pattern of markup for the metrical scholia that describe the metrical form colon by colon (for testing purposes, the initial sample contains Triclinius’ scholia on the parodos of Orestes). In this case, the first div4 element has @type of ‘schTextMetrAna’; this is structured as for regular scholia, but any part of the note that precedes the description of the first colon is tagged as a single s with @n of 0, so that the sentence describing the first colon will have @n of 1; also, if Triclinius describes two successive cola as the same, then that s will have a range for @n (for example, 5-6 if he says the fifth and sixth cola have the same pattern). When a div4 of this type occurs, it is always followed by another div4 with @type of ‘metrScheme’. This contains one p enclosing s elements with @n corresponding to the numbering of the sentences in the scholion itself. Each s has within it two seg elements, the first to contain the metrical scheme in symbols for long, short, etc., the second to contain the Greek text of the colon as it appears in Triclinius. The two @type values are ‘metrScheme’ and ‘triColon’ (despite the latter name, the same value can be used when an anonymous metrical scholion is marked up: the author of the scholion is conveyed by the tagging of @type at the level of the div3 parent). After this, the other div4 possibilities are identical to those available for the other scholia. By treating the metrical scholia with a different tagging, it becomes possible to process the XML into a modified display so that the metrical scheme and actual text of Triclinius are seen side by side with the scholion (rather than separately at the back of the book, as in De Faveri’s printed edition).

The argumenta or prefatory material have a very similar structure to the scholia. Recall that the relevant div2 has @type of ‘hypotheseis’. Each prefatory item is tagged as a div3, with @type classifying the different sorts (values: epitome, AristByz, misc, argThom (the long Thoman argument), Thoman (miscellaneous notes in Thoman witnesses), dramatisPersonae) and @n supplying a numeration. The first div4 then contains the actual item, and further div4 elements can be added for apparatus criticus and the other types discussed above (at present there is no division for translation, but it could easily be added to the structure; and there is no division for lemma and position). For more details on the markup, examine the .rng file or the .xml file which are posted at the License & Source Files page.

XML Validation

XML editing for this project has been performed with the Oxygen XML Editor, a java application that I run under Mac OS 10.6.x. It is a commercial product, but has an affordable academic license. In working with XML it is normal to have the document validated against some template or schema to ensure that all elements and attributes are being used in the correct fashion. TEI P5 offers an array of modules for different kinds of content and structures, and so far the scholia edition uses only a limited range of modules. One can create a validation document using the Roma tool on the TEI site. Initially I used a fairly complete document generated by Roma. In Oxygen, one associates the validation document with the xml file being worked on, and the program continuously checks and flags errors if any are found. As the XML file became bigger and its structure clearer and less tentative, I realized that it would be a great advantage to have a more specific validation document, and so I created from scratch a RelaxNG (XML format) schema document (and Oxygen’s built in tools and validation mechanism helped greatly with this). The source file for anyone who is interested is ‘teiScholiaModelMar24.rng’ (or a similar name with a more recent date). The advantage of this document is that it contains the specifications of the allowable values for all attributes and other information about the logical structure. Because of this, Oxygen is able to automatically supply or complete some parts of what is being typed as well as to flag any mistakes in typing the markup, mistakes that might not be caught by the non-specific Roma-generated schema and that would result in omissions or odd display at a later stage of the project. Periodically, one can switch between the more generalized and the more specific validation files to ensure that the structure has maintained TEI-compliance after any revisions to the structure.

XSLT

XSLT is an acronym for eXtensible Stylesheet Language: Transformations. It is an XML-based programming language that can be used to process XML into other formats (such as differently tagged XML or XHTML or HTML or PDF). XSL documents can be written and validated in Oxygen, and Oxygen also has the capacity to apply the transformation to a document in an environment for debugging. After reading much of a large book on XSLT, I built up a stylesheet gradually, partly by trial and error, and eventually arrived at the ones used in the current version of the project. The first task was to generate an HTML file containing everything in the body element of the TEI structure (and this means the text, since there is not yet any content in front or back). This is partly a matter of processing each element in the right way, and partly a matter of deciding how to tag for HTML formatting (see next, under CSS). The most confusing problem I encountered in the process was dealing with what are known as namespaces. When I used the Roma validation and declared the TEI namespace in my XML edition, it was necessary to use the namespace prefix ‘tei:’ in front of every element in the stylesheet instructions; when I switched to my more specific validation document, it was necessary to remove all those prefixes. Namespace prefixes still seem somewhat troublesome, since the transformation to HTML inserted namespace attributes into some tags, and those were in turn flagged as not allowed when the HTML was validated with Barebones BBEdit. I don’t quite understand what is involved here, but it does not seem to matter; in practice I do a global removal of those namespace attributes in the HTML document with BBEdit in a matter of seconds.

Processing the XML file with the XSLT file requires the use of a processing program. The free open-source program Saxon-HE 9.2.x.x can be used internally to the debugging process in Oxygen, but once debugging is finished, it is much faster to install the java archive of Saxon-HE and run it from the command line in Terminal.

Once a stylesheet that generated the full data was tested out and found successful, it involved only a few minor edits of the stylesheet to cause it to generate instead some subset of the data (old scholia only, scholia without glosses, and the like). One then ends up with a series of .html files with the same markup but different content. Although XSLT could also be used to generate files to vary the display (scholia only, scholia with main apparatus only, scholia with extras except collation notes and keywords displayed), it is more efficient to use different CSS stylesheets to effect the variations. The XSLT stylesheet, on the other hand, could be modified for some kinds of revisions to the format, such as the wording of the labels of the paragraphs following each scholion, or removal or rephrasing of the display of the @type and @subtype, which are visible after the line number in the current version partly for the sake of proofreading.

CSS

Almost every element in the HTML code that is generated has a ‘class’ attribute, and thus the formatting of the browser display can be handled through yet another document, in the language known as CSS (Cascading Style Sheets). Margins, indentation, font-family, font-size, superscript position, colors, backgrounds, etc. can all be modified by adjustments to the CSS stylesheet. Suppressing particular elements of the display is accomplished by using a slightly altered stylesheet, one in which any elements that are to be omitted (such as collation notes and keywords) have the CSS property ‘display’ added to the associated paragraph style and given the value ‘none’. In the current web site, all the different forms of the HTML document are generated in advance and then the choice is displayed with a simple HTML select element that uses client-side javascript to load the proper document. The site can therefore be tested or used on the local file system of one’s computer without a web server or CGI-programming. Eventually, however, a home might be found where the application of different CSS files would be handled through programming; but that would require a different level of technological support.

The Applications

Oxygen XML Editor has a variety of excellent capabilities, but it has the disadvantage of running in a java virtual machine rather than natively in Mac OS X. While its performance on files of several dozen lines is fine, some processes involving a large file (such as loading the document, fixing its indentation, or changing the size of the window) are painfully slow, even after adjusting the memory of the program. Adjusting the memory involves entering the application bundle and editing the info.plist, and you must do this again every time a new release is downloaded. Nevertheless, after a time one does learn to take advantage of various shortcuts, such as autocompletion based on the schema file written for the specifics of the scholia or the use of code templates for the chunks of markup that one uses most frequently. For some purposes it is handy to open up the .xml file in Barebones BBEdit instead, to enjoy that program’s excellent design and speed for comparing versions or doing global replacements. BBEdit is also essential for cleaning up the HTML files generated by the XSLT processing (the apparently objectionable namespace attributes were mentioned above).

The collations began their life as MS Word files (.doc). It would be nice if one could collate directly into the XML version instead of having to copy and paste the material from Word into Oxygen. Now that the structure to be used is less tentative, there may be a way to format future Word documents containing scholia and the sequenced information gathered in collation so that XSLT could be applied to Word's XML format (.docx) to get much of the markup for the scholia edition. But experimentation with that will have to wait for another day.