Lemmatising: How To Use L2

This page summarises the steps required to use L2, the lemmatiser used by Oracc. First we describe what you need to know about editing ATF files, then glossary management, then rebuilding the whole project.

This page is designed as a refresher for those already familiar with lemmatisation. If you have not already done so, read the tutorial on linguistic annotation first.

Editing and fixing ATF files


If your project uses Akkadian, set the default language of your text using the protocol line [http://oracc.museum.upenn.edu/ns/xtf/1.0/protocols.html] #atf: lang akk-x-[DIALECT]. For instance, if your project's language should be described as, e.g., Old Babylonian, you will need to write:

#atf: lang akk-x-oldbab

If you need to switch languages or dialects in the middle of a text, you can use a short code [http://oracc.museum.upenn.edu/doc/developer/l2/languages/#Language_codes]. For instance, to mark a Neo-Assyrian dialect word in an otherwise Standard Babylonian text, you can write, e.g.,

{d}NIN.LILâ‚‚ ana {d}BAD %na a-bu-su %sb DAB-su

Here, the code %na marks the switch into Neo-Assyrian, while %sb marks the switch back to Standard Babylonian.

You do not mark a language switch at the end of a line, as the processor automatically returns to the default language at the start of each line.

For more about L2's handling of languages see the languages section of the Inline Tutorial.


When you add a new lemmatisation which has a SENSE as well as a GW, you always need to add an EPOS too, even when it is the same as the POS. For instance, instead of +šaknu[appointee//governor]N$ the correct entry is:


But if there is no SENSE, there is no need to add an EPOS:


If you forget to add an EPOS where it's needed, the checker will tell you!

COFs and PSUs

Lemmatise Compound Orthographic Forms (COFs) as described here.

You can add SENSEs to individual components of a Phrasal Semantic Unit (PSU) if this is appropriate. An overview of PSUs in L2 is given here.

Sentence boundaries

If you are in the habit of marking sentence boundaries in the lemmatisation with +. you will need to ensure that they occur before the semi-colons that mark the end of a lemmatisation, not after them. That is, the correct form is, e.g.,

iddâk[kill]V +.; šumma[if]MOD;

not iddâk[kill]V; +. šumma[if]MOD;.

Editing and fixing glossaries

Language/dialect glossaries

There is a glossary for each dialect of the languages in your corpus (as defined by the language tags in your ATF files), with names such as akk-x-oldass.glo and akk-x-stdbab.glo. The higher-level language glossaries, such as akk.glo are now generated from these lower-level ones. So, when you need to hand-edit glossary entries, you will need to do so in the relevant dialect-level glossary or glossaries, not in the top-level language glossaries as before.


You can now use the byforms mechanism in your Sumerian glossary to handle phenomena such as suppletive verbs, collapsed compounds and variant frozen forms.

Byforms are not yet implemented for Akkadian, but if you see a need for them in your project please contact your liaison.

COF and PSU handling

L2 handles Compound Orthographic Forms (COFs) in exactly the same way as before. You should not need to fix COF entries in the glossary if they are already entered correctly. A brief overview of COFs in L2 glossaries is given here.

Error-checks of Phrasal Semantic Units (PSUs) are rigorous. You should not need to fix PSUs entries in the glossary if they are already entered correctly, except if they also contain a COF. A brief overview of PSUs in L2 glossaries is given here.

Rebuilding L2 projects

Here are some hints on how to fix most of the error messages relating to lemmatisation but if you notice error messages that you cannot interpret, please contact your liaison for help.

Project configuration: Glossaries

You can control which glossaries are used to lemmatise your project, language by language. Use following option as many times as you need to:

<option name="[LANGUAGE]" value="[PROJECT AND/OR GLOSSARY NAMES]">

For instance:

<option name="%akk-x-ltebab" value="hbtin cams/gkab"/>
<option name="%akk-x-neoass" value=". .:akk-x-stdbab"/>

Here, the lemmatiser is told to look up forms tagged as Late Babylonian first in the HBTIN project's glossary (which is all LB), then in CAMS/GKAB's LB glossary. Neo-Assyrian forms are to be looked up first in the project's own NA glossary (the meaning of .) and then in the project's own SB glossary (the . followed by a : and the relevant language code).

If you do not add an entry to the config file for a particular language, the system will just use the project glossary for that language, as expected.

23 Jul 2014 osc at oracc dot org

Steve Tinney & Eleanor Robson

Steve Tinney & Eleanor Robson, 'Lemmatising: How To Use L2', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2014 [http://oracc.museum.upenn.edu/doc/help/lemmatising/lemmatising/]

Back to top ^^

Released under a Creative Commons Attribution Share-Alike license 3.0, 2014. [http://www.facebook.com/opencuneiform] [http://oracc.blogspot.com] [http://www.twitter.com/oracctivity]
Oracc uses cookies only to collect Google Analytics data. Read more here; see the stats here [http://www.seethestats.com/site/oracc.museum.upenn.edu]; opt out here.