L2: Signature/Lemmatization Syntax

This document describes extant and planned elements of the syntax of signatures and the lemmatization specifications that use them.

The forms of signatures and inline lemmatizations proper are identical as far as core and adjunct fields are concerned. Signatures are simply lemmatizations prefixed with a project and a lang/form pair.

Prefix Fields

Key CharField
@...Project name
:...=Form (Unicode text, no 'equals' signs)

Core Fields

Key CharAbbrevFull Name
CFCitation Form
[...]GWGuide Word
POSPart of Speech

Adjunct Fields

Key CharAbbrevFull Name
'EPOSEffective Part of Speech
/BASEWord Base
+CONTBase Continuation
*STEMWord Stem
#M1Morphology 1
##M2Morphology 2

Note: augmentation and disambiguation do not need to be handled explicitly in signatures because they are rewritten as part of the FORM or M1 fields.

Para-lemma Features


Properties can also be specified on lemmata using the '$'-notation. The full form is:


No spaces are allowed. If 'VALUE' is unique within the values given in the project's 00lib/properties.xml then the PROPERTY component is optional giving the short form:



Any lemma can be labeled with an anchor which can be used as the target of a reference. This can be used to handle anaphora:

#lem: ...  Anu-uballit[1]PN @1 ...

#lem: abišu[father] =1

A simple label consists of the at-sign (@) followed by digits, but arbitrary labels may be given subject to the constraint that no label may contain spaces:

#lem: ...  Anu-uballit[1]PN @mystery-man ...

#lem: abišu[father] =mystery-man

Syntax Hinting

Top vs Internal

Top-level boundaries may be given to mark discourse (:), sentence (.), clause (;) and phrase (,) boundaries.

Bracketing is implicit between top-level constituents:

a b ; c , d e ; f g . h i : j k

Is identical to:

( ( (a b) ; (c) , (d e) ; (f g) ) . ( h i ) ) : ( j k )

To annotate internal phrase structure one can add parentheses explicitly:

a b ; c , (d e , f g) , h i .

Here, (d e , f g) is first parsed as a top-level constituent, then recursively parsed.

Labeling Units

A unit can always be labelled by giving the label after its opening parenthesis. For units with explicit dividers, the label may be given after the divider:

(S a b ; (PRP c d)) :DATE e f

Conjunction and Modification

+&, +>, +< imply a phrase boundary, i.e., they are equivalent to +,+& etc.

kud[fish]; +& muszen[bird]
kud[fish]; tur[small]; +& (muszen[bird]; gal[big])


Linksets allow arbitrary collections of words to be collected as discontinuous units. These are generally identified by the various analyzer programs, but we define a mechanism for specifying them manually to supplement or override the programs.

Linksets can be defined and populated using two notations:



Where INDEX may be a simple integer or a more complex symbol:

 ##date/from ... #from ... #from

 ##date/to ... #to ... #to

[MEMBER] in each case enables the lemma(ta) to be associated with an element in the linkset structure. Suppose any date should consist of a year, month and day element. A date linkset might then look something like this:

 #lem: mu[year] ##date/doc/year; Šulgir[1]RN #doc/year; lugal[king] #doc/year

 #lem: iti[month]; Ubigu[1]MN #doc/month; ud[day]; n #doc/day

Note that many dates can be parsed successfully by machine, but this mechanism allows manual tagging of dates that aren't handled by the machine.

23 Jul 2014 osc at oracc dot org

Steve Tinney

Steve Tinney, 'L2: Signature/Lemmatization Syntax', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2014 [http://oracc.museum.upenn.edu/doc/help/lemmatising/syntax/]

Back to top ^^

Released under a Creative Commons Attribution Share-Alike license 3.0, 2014. [http://www.facebook.com/opencuneiform] [http://oracc.blogspot.com] [http://www.twitter.com/oracctivity]
Oracc uses cookies only to collect Google Analytics data. Read more here; see the stats here [http://www.seethestats.com/site/oracc.museum.upenn.edu]; opt out here.