This document describes extant and planned elements of the syntax of signatures and the lemmatization specifications that use them.
The forms of signatures and inline lemmatizations proper are identical as far as core and adjunct fields are concerned. Signatures are simply lemmatizations prefixed with a project and a lang/form pair.
|:...=||Form (Unicode text, no 'equals' signs)|
|Key Char||Abbrev||Full Name|
|POS||Part of Speech|
|Key Char||Abbrev||Full Name|
|EPOS||Effective Part of Speech|
Note: augmentation and disambiguation do not need to be handled explicitly in signatures because they are rewritten as part of the FORM or M1 fields.
Properties can also be specified on lemmata using the '$'-notation. The full form is:
No spaces are allowed. If 'VALUE' is unique within the values given in the project's 00lib/properties.xml then the PROPERTY component is optional giving the short form:
Any lemma can be labeled with an anchor which can be used as the target of a reference. This can be used to handle anaphora:
#lem: ... Anu-uballitPN @1 ... #lem: abišu[father] =1
A simple label consists of the at-sign (
@) followed by
digits, but arbitrary labels may be given subject to the constraint
that no label may contain spaces:
#lem: ... Anu-uballitPN @mystery-man ... #lem: abišu[father] =mystery-man
Top-level boundaries may be given to mark discourse
:), sentence (
.), clause (
and phrase (
Bracketing is implicit between top-level constituents:
a b ; c , d e ; f g . h i : j k
Is identical to:
( ( (a b) ; (c) , (d e) ; (f g) ) . ( h i ) ) : ( j k )
To annotate internal phrase structure one can add parentheses explicitly:
a b ; c , (d e , f g) , h i .
(d e , f g) is first parsed as a top-level
constituent, then recursively parsed.
A unit can always be labelled by giving the label after its opening parenthesis. For units with explicit dividers, the label may be given after the divider:
(S a b ; (PRP c d)) :DATE e f
+< imply a
phrase boundary, i.e., they are equivalent to
kud[fish]; +& muszen[bird]
kud[fish]; tur[small]; +& (muszen[bird]; gal[big])
Linksets allow arbitrary collections of words to be collected as discontinuous units. These are generally identified by the various analyzer programs, but we define a mechanism for specifying them manually to supplement or override the programs.
Linksets can be defined and populated using two notations:
Where INDEX may be a simple integer or a more complex symbol:
##date/from ... #from ... #from ##date/to ... #to ... #to
[MEMBER] in each case enables the lemma(ta) to be
associated with an element in the linkset structure. Suppose any date
should consist of a year, month and day element. A date linkset might
then look something like this:
#lem: mu[year] ##date/doc/year; ŠulgirRN #doc/year; lugal[king] #doc/year #lem: iti[month]; UbiguMN #doc/month; ud[day]; n #doc/day
Note that many dates can be parsed successfully by machine, but this mechanism allows manual tagging of dates that aren't handled by the machine.23 Jul 2014
Steve Tinney, 'L2: Signature/Lemmatization Syntax', Oracc: The Open Richly Annotated Cuneiform Corpus, Oracc, 2014 [http://oracc.museum.upenn.edu/doc/help/lemmatising/syntax/]