Contributing to Oracc: How and Why
Oracc projects minimally consist of an educational 'portal' website and an optional corpus of cuneiform texts. They come in all shapes and sizes from small student projects to large international research collaborations. We welcome them all. Here we outline what is involved in creating an Oracc project, and what the benefits are.
Portals: why and how | Corpora: how | Corpora: why
Why and how build an Oracc 'portal' site?
Here are some reasons why you should consider using Oracc to create an educational online resource about the ancient cuneiform world, whether or not you are also building and Oracc corpus.
- Your project will reach a far wider readership than any book you write. Oracc's public websites rank very highly in Google searches, so people will find your project. Oracc's Ancient Mesopotamian Gods and Goddesses [http://oracc.museum.upenn.edu/amgg/] site gets thousands of visitors a month, for instance.
- You will have far more control over your work (subject to Oracc's basic editorial standards), and much clearer identification as its creator, than if you contribute to Wikipedia or similar crowd-sourced initiatives.
- Oracc's default licensing encourages people to re-use your material while insisting that they clearly attribute it to you.
- If you are a teacher, you will be able to direct your students away from less reliable sites to your own resources. You can even get your students to contribute. Many of AMGG [http://oracc.museum.upenn.edu/amgg/]'s pages were written by graduate students, for instance.
- You can create your website in private and release it only when you are ready to do so. You can correct, update, and add to your website at any time.
- You can draw on all of Oracc's existing resources, easily linking to cuneiform texts in other projects, or creating a corpus of your own.
- You only need to know some basic HTML and a few extra tags. We will provide training and guidance, and there is full documentation on this website.
- You can be confident that your site will be readable on any web browser on any computer or handheld device. Your site will be securely backed up and maintained to current web standards for as long as you want it to be available.
How building an Oracc corpus works
Our basic model for corpus and tool development is that text
corpora are edited and annotated
at the source (or manuscript or tablet) level. We also know that it is often desirable to add new texts, joins, and fragments to the corpus; and to improve or update existing transliterations, translations, and annotations. Oracc thus works by merging manuscript files with lists of varying complexity to produce tools for describing and exploring the corpora in many ways. Whenever an editor or project manager edits, adds or updates the texts or data lists, the tools are rebuilt programmatically from scratch, so that the latest improvements to the annotated texts and the lists of data are automatically incorporated throughout the project.
The core Oracc standard for entering textual data is known as ATF, the ASCII Transliteration Format. ATF can support multiple translations, in any language.
Lemmatization is the process of annotating instances of forms of
words according to their dictionary headword. Oracc uses interlinear
lemmatization in the ATF transliterations to enable lemmatization
data to remain synchronized with textual changes. Even for completely new projects, the lemmatizer can be set up to draw on relevant glossaries from existing Oracc projects, thus automating much of the process.
Why build a corpus using the Oracc tools?
The CDLI catalogue provides a global repository of unique
identifiers for inscribed objects:
- you can use this catalogue to develop a corpus and eliminate duplication;
- you can also use additional catalogue data for fields not in the CDLI
You can use old data and enter new data easily:
- The heart of the Oracc tools is a strictly defined text format which
has an ASCII or Unicode input version, ATF, and an XML version used by programs (XTF).
- The Oracc group has extensive experience in legacy data
conversion and we are willing to help with substantial conversion jobs
by bringing old data into ATF.
- ATF makes easy things easy and difficult things
- ATF provides a complete solution to
transliteration needs for cuneiform texts in any of the languages
written in cuneiform.
- An online template generator takes some of the drudgery out of
entering new texts.
- Many data preparers can still create consistent results thanks to
the simplicity and thorough documentation of ATF.
The Oracc tools help you get data into a well-defined and highly
consistent format and keep it that way:
- An online service, the web-based ATF checker, identifies hundreds
of different kinds of errors and can also do content validation of
graphemes and more.
- The ATF checker can produce lists of graphemes and words, with
their frequencies, and these lists can be edited to eliminate
inconsistent transliterations. The checker can then use the revised
lists to check the content of your data, ensuring that it stays valid
both in structure and content.
- ATF is backed by a rigorously defined XML document
definition in the international standard Relax/NG Schema language.
Both the ATF input mechanism and the XML schemas are fully documented
and available on the web.
Data Backup and Version History
The Oracc server can look after your data:
- the repository is backed up nightly;
- files are available from anywhere to download, work on, and upload again;
- a history of changes is maintained; old versions can be retrieved easily.
Once texts are entered they can be enhanced in various ways:
- Lemmatization can be added either with interlinear tags or
dynamically during ATF processing.
- Sentence boundaries may be added.
- Full support is provided for different translation styles;
translation units can be lines, groups of lines, or sentences.
- The XML format, XTF, is able to provide a very rich version of
your textual data for programs to work on while the approach of typing
simple ASCII or Unicode texts and augmenting them automatically via lists means that the
benefits of XTF are achieved with as little human effort as possible.
The same transliterations and translation can be presented in several ways:
- Online, as HTML; we can even host your project either in its
entirety or in part.
- In print, by converting and importing the results into most modern
word processors and page-layout programs.
- Projects hosted on the Oracc web server can easily take advantage of
sophisticated searching and web-display facilities.
- Glossaries derived from your lemmatization, and/or the various lists produced by the ATF web service, can be used as
indices of print publications.
Corpora prepared with these tools are reusable and more useful:
- they add to the globally searchable and browsable Oracc;
- they provide more and a greater variety of instances for use in online sign lists and dictionaries.
For more information on how to manage a project, go to
the Manager section of the Oracc
23 Jul 2014
Steve Tinney & Eleanor Robson
Steve Tinney & Eleanor Robson, 'Contributing to Oracc: How and Why', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2014 [http://oracc.museum.upenn.edu/doc/about/contributing/howandwhy/]