README Oracc Home SEARCH DOCUMENTATION

Creative Commons License

ORACC Home


Introduction


Grapheme

GDL & ATF


charset.rnc

Characters


grapheme.rnc

Preamble

Signs

Values

Names

Qualified

Number

Modifier

Allograph

Formvars

Compound

Punctuation


graphmeta.rnc

Preamble

Breakage

Other flags

Glosses

Presence

Scripts

Languages

Defining the default language

Shifting to other languages

Language codes


Proximity


Intrusions


words.rnc

Words

Normalization


gdl.rnc


Resources


Links

Top

Tutorial

GDL: Grapheme Description Language

(http://oracc.org/ns/gdl/1.0)

Steve Tinney
Version of 2017-08-10

Introduction

Grapheme Description Language for embedding in higher-order document types such as text editions and signlists. A formal definition with RNC schema is given interwoven with the ATF conventions for representing each element in the schema.

Grapheme

The term "grapheme" as used in this document refers to a string of letters, numbers, modifiers and operators used to specify a Sumero-Akkadian cuneiform sign by name or to render one of the values of such a sign. While sign names are often glyph-descriptive (e.g., KA×A meaning sign A written inside sign KA), this document does not provide a glyph description language. Rather, we define a Grapheme Description Language.

GDL & ATF

GDL is not intended to be generated manually; rather, it is the XML result of processing ASCII Transliteration Format (ATF) with the ATF processor. This document includes implementor notes on ATF interwoven with the technical documentation. Unless you are an implementor, or are pathologically curious (or both), you don't need to read this document! Read the tutorial instead. If you are a developer who is new to GDL and ATF it is recommended that you first read the tutorial, and then this document.

An XSL script to convert from GDL back to ATF can be found in the resources section below. The script does not convert the character set from Unicode to ASCII.

charset.rnc

In this section we provide a model for constraining the lexical representation of graphemic atoms. This aspect of grapheme description does not constrain the validity of values within a given signiary; that is handled elsewhere.

Atoms are tightly constrained sequences of characters separated into distinct lowercase and uppercase sets to permit finer-grained constraints.

Characters

GDL does not support any of the common ASCII approximations of the various non-ASCII characters used in cuneiform transliteration; GDL uses only the specific Unicode codepoints listed below for the representation of these characters. Details and images of the Unicode characters can be found at http://www.unicode.org/charts.

These days most ATF is generated in Unicode. However, it can also be restricted to ASCII characters, for which we define simple equivalents for the characters used in cuneiform transliteration which are not in the ASCII character set. The following table gives the ASCII sequences and the Unicode codepoints to which the ATF processor translates them. Certain conventions are not used in CDLI-strict notation; this is indicated in another column.

ATF Character Conventions
Unicode-ATF ASCII-ATF Character CDLI-Strict?1

1Characters not in the strict repertoire are not permitted in CDLI archival ATF.

2Lowercase x is permitted only in sign values; in sign names, only uppercase X is permitted as a notation for subscript-x. In sign names, lowercase x is an operator.
sz š U+161 yes
SZ Š U+160 yes
s, U+1E63 yes
S, U+1E62 yes
t, U+1E6D yes
T, U+1E6C yes
s' ś U+015B yes
S' Ś U+015A yes
' ʾ U+02BE yes
0-9 subscript ₀-₉ U+2080-U+2089 yes
x2 subscript ₓ U+208A yes
X2 subscript ₓ U+208A yes
h, U+1E2B no
H, U+1E2A no
j ŋ U+014B no
J Ŋ U+014A no

Characters are combined into atom specifications by grouping them in classes which are used to place lexical constraints on the atoms.

lV = Permitted lowercase vowels
a e i u
uV = Permitted uppercase vowels
A E I U
lC = Permitted lowercase consonants
b d g h k l m n p q r s u w y z
U+014B LATIN SMALL LETTER ENG
U+1E2B LATIN SMALL LETTER H WITH BREVE BELOW
U+015B LATIN SMALL LETTER S WITH ACUTE
U+0161 LATIN SMALL LETTER S WITH CARON
U+1E63 LATIN SMALL LETTER S WITH DOT BELOW
U+1E6D LATIN SMALL LETTER T WITH DOT BELOW
U+02BE MODIFIER LETTER RIGHT HALF RING
uC = Permitted uppercase consonants
B D G H K L M N P Q R S U W Y Z
U+014A LATIN CAPITAL LETTER ENG
U+1E2A LATIN CAPITAL LETTER H WITH BREVE BELOW
U+015A LATIN CAPITAL LETTER S WITH ACUTE
U+0160 LATIN CAPITAL LETTER S WITH CARON
U+1E62 LATIN CAPITAL LETTER S WITH DOT BELOW
U+1E6C LATIN CAPITAL LETTER T WITH DOT BELOW
U+02BE MODIFIER LETTER RIGHT HALF RING
Si = Subscript initial characters
U+2081-U+2089 (Unicode subscript 1 through 9)
Sc = Subscript continuation characters
U+2080-U+2089 (Unicode subscript 0 through 9)
Sx = Subscript x character
U+208A SUBSCRIPT PLUS SIGN

This yields the following base character sets and definitions (dollar-variables are expanded by a preprocessor to generate the actual RNC schema):

$lV = [aeiu]
$lC = [\x{2BE}bdegh\x{1E2B}i\x{14B}klmnpqrs\x{161}\x{1E63}\x{15B}t\x{1E6D}uwyz]
$uV = [AEIU]
$uC = [\x{2BE}BDEGH\x{1E2A}I\x{14A}KLMNPQRS\x{160}\x{1E62}\x{15A}T\x{1E6C}UWYZ]
$Si = [\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]
$Sc = [\x{2080}\x{2081}\x{2082}\x{2083}\x{2084}\x{2085}\x{2086}\x{2087}\x{2088}\x{2089}]

$subscript = (${Si}${Sc}?|\x{208A})?

lV = xsd:string {
   pattern = "${lV}${subscript}"
}

lVCv = xsd:string {
  pattern = "(${lV}${lC})+${lV}?${subscript}"
}

lCVc = xsd:string {
  pattern = "(${lC}${lV})+${lC}?${subscript}"
}

lVCCvc = xsd:string {
  pattern = "(${lV}${lC}{1,2})+(${lV}${lC}?)${subscript}"
}

lCVCCvc = xsd:string {
  pattern = "(suen|kuara|${lC}(${lV}${lC}{1,2})+(${lV}${lC}?))${subscript}"
}

uV = xsd:string {
   pattern = "${uV}${subscript}"
}

uVCv = xsd:string {
  pattern = "(${uV}${uC})+${uV}?${subscript}"
}

uCVc = xsd:string {
  pattern = "(${uC}${uV})+${uC}?${subscript}"
}

uVCCvc = xsd:string {
  pattern = "(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}

uCVCCvc = xsd:string {
  pattern = "${uC}(${uV}${uC}{1,2})+(${uV}${uC}?)${subscript}"
}

grapheme.rnc

namespace g = "http://oracc.org/ns/gdl/1.0"

grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form     = attribute form { text }
sb       = element g:b { s.model }
vb       = element g:b { v.model }
punct    = element g:p { p.model }
lang     = attribute xml:lang { xsd:language }
gsurro   = 
  element g:surro {
    delim? , (s|c|n|punct) , groupgroup
  }

# Values
#v.model  = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v        = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
#dingir   = element g:v { g.meta , lang? , ("d") }
#mister   = element g:v { g.meta , lang? , ("m") }

# Names
#s.model  =  "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model  = text

lst    = xsd:string {
  pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}

#[ABCD] is a stop-gap until lateuruk numbers are fixed
num      = xsd:string { pattern = "N|N\d+[ABCD]?" }

s        = element g:s { form? , g.meta , (s.model | (sb , mods+)) }

# Qualified graphemes
q        = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }

# Numbers
n.model  = r , (v|s|c|q)?

r        = element g:r {
             xsd:string {
	       pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }

n        = element g:n { form? , g.meta , sexified?, n.model , mods* }
sexified = attribute sexified { text }

# Modifiers
mods     = modifier | allograph | formvar

modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }

allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }

formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }

# Compounds
c.model  = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)

c        = element g:c { form? , g.meta , c.model , mods* }

g        = element g:g { g.meta , c.model , mods* }

compound = single | unary | binary

single   = n | s | c | (g,mods*) | q

unary    = o.unary , single

binary   = single , o.binary , single

ternary   = single , o.binary , single , o.binary , single

o.join   = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }

o.unary  = element g:o { attribute g:type { "repeated" } , xsd:integer }

o.binary =
  element g:o {
    attribute g:type {
      "containing" | "above" | "crossing" | "opposing"
    }
  }

# Punctuation
p.model =
    attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/"|":r:" } , 
    g.meta , 
    (v|q|s|n|c)?

Preamble

As a design principle, all of the most common GDL elements have single character names. In order to minimize possible confusion with similar names in other vocabularies, it is recommended that GDL elements always be namespace-qualified. To reinforce this point, the definition of the GDL schema does not use a default namespace.

The examples in this document all assume that the prefix g is bound to the namespace of the GDL schema.

namespace g = "http://oracc.org/ns/gdl/1.0"

grapheme = v | q | s | n | c | gloss | g | nongrapheme | punct | gsurro
form     = attribute form { text }
sb       = element g:b { s.model }
vb       = element g:b { v.model }
punct    = element g:p { p.model }
lang     = attribute xml:lang { xsd:language }
gsurro   = 
  element g:surro {
    delim? , (s|c|n|punct) , groupgroup
  }

Signs

We call the core alphanumeric portion of a sign an atom. This is a single grapheme component which for the purposes of this grapheme description instance is not susceptible to further sub-description.

All sign values are by definition atoms.

Sign names consist of one or more atoms. In the grapheme A there is a single atom; in the grapheme KA×A there are two atoms, KA and A. In another context, that same grapheme might be named as NAG; this version of the name contains a single atom, despite the fact that a sign list might describe the sign as KA×A. In other words, atomicity in grapheme names is determined by the naming scheme rather than the underlying construction of the glyph.

Two simple elements are defined for atoms: g:v, for sign values, and g:s for sign names.

Values

# Values
#v.model  = "x" | lV | lVCv | lCVc | lVCCvc | lCVCCvc
v.model = text
v        = element g:v { form? , g.meta , lang? , (v.model | (vb , mods+)) }
#dingir   = element g:v { g.meta , lang? , ("d") }
#mister   = element g:v { g.meta , lang? , ("m") }

Names

# Names
#s.model  =  "N" | "X" | uV | uVCv | uCVc | uVCCvc | uCVCCvc | lst | num
s.model  = text

lst    = xsd:string {
  pattern="(..?SL|ABZ|BAU|HZL|KWU|LAK|M|MEA|MZL|REC|RSP|ZATU)\d+[a-z]*"
}

#[ABCD] is a stop-gap until lateuruk numbers are fixed
num      = xsd:string { pattern = "N|N\d+[ABCD]?" }

s        = element g:s { form? , g.meta , (s.model | (sb , mods+)) }

Two special classes of sign name are signlists and numerical sign names. Numerical sign names match the pattern N<DIGITS>. Signlist names consist of an uppercase alphabetic prefix and an ASCII digit suffix; the prefix is the name of the sign list and the suffix is the number of the sign in that list. Prefixes fall into one of two groups. Generic signlist prefixes consist of any one or two uppercase letters followed by SL; hence, CDSL, PSL, PCSL are all valid signlist prefixes. The second group is the built-in set of historic sign lists.

Built-in Sign List Names
NameBibliography
ABZR. Borger, Assyrisch-babylonische Zeichenliste (AOAT 33; Neukirchen-Vluyn 1978)
BAUE. Burrows, Archaic Texts (UET 2; London 1935)
HZLC. Ruster and E. Neu, Hethitisches Zeichenlexikon (Harrassowitz Verlag 1989)
KWUN. Schneider, Die Keilschriftzeichen der Wirtschaftsurkunden von Ur III (Rome 1935)
LAKA. Deimel, Liste der archaischen Keilschriftzeichen (WVDOG 40; Berlin 1922)
MEAR. Labat, Manuel d'épigraphie akkadienne (6th ed. Paris 1988)
MZLR. Borger, Mesopotamisches Zeichenlexikon (AOAT 305; Ugarit-Verlag 2003)
RECF. Thureau-Dangin, Recherches sur l'origine de l'écriture cunéiforme (Paris 1898)
RSPY. Rosengarten, Répertoire commenté des signes présargoniques sumériens de Lagash (Paris 1967)
ZATUM. Green and H. J. Nissen, Zeichenliste der Archaischen Texte aus Uruk (ATU 2; Berlin 1987)

Qualified

Qualifed graphemes consist of a sign value followed by a sign name in parentheses, e.g., pu(BU). (In normalized text the superficially similar construct is used to indicate the logograms used for the normalized form, e.g., %akk/n bēlu(EN).)

# Qualified graphemes
q        = element g:q { form? , g.meta , (v|s|c) , (s|c|n) }

Number

Numerical graphemes have a special form. Each numerical grapheme consists of at least two parts: the repetition count and the sign value, sign name or compound sign. A special case is made for numerical graphemes by allowing them to have modifiers even if the graphemic base is a sign value.

The repetition count must have one of the following forms:

digits
This is the normal case.
n
This is a special case for circumstances where the repetition is completely uncertain.
n+digits
This is a special case for circumstances where the repetition is partly uncertain.

While it would in principle be possible to constrain the value space of GRAPHEME in the schema we do not do so; instead, as with non-numerical graphemes, we constrain the lexical space and require the values of numerical graphemes to be validated elsewhere. This allows the schema to be open-ended with respect to the identification of new numerical systems.

# Numbers
n.model  = r , (v|s|c|q)?

r        = element g:r {
             xsd:string {
	       pattern = "[nN]\+[0-9]+|[nN]|[0-9]+|[n1-9]+/[n1-9]" } }

n        = element g:n { form? , g.meta , sexified?, n.model , mods* }
sexified = attribute sexified { text }

Modifier

Sign names and numerical sign value atoms may be described by reference to modifications of the base sign, as summarized in the table below. The lexical representation of modifiers is restricted to either a single lower case letter or a sequence of one, two or three ASCII digits. The semantics of these modifiers is indicated in the table, but is irrelevant from the point of view of the schema. A single GDL element, g:m, contains the modifier.

Modifiers may not follow a compound sign's terminating pipe character; if an entire compound is to be modified, the compound's content must be grouped and the modifiers suffixed between the closing parenthesis and the closing pipe.

# Modifiers
mods     = modifier | allograph | formvar

modifier = element g:m { xsd:string { pattern = "[a-z]|[0-9]{1,3}" } }

Allograph

It is sometimes desirable to distinguish between grapheme instances which have otherwise been considered the same sign, or which actually are the same sign, for semantic or glyph-analytic reasons. This is expressed in GDL by the g:a element whose content is a sequence of one or more lowercase letters, excluding x, and ASCII digits. Sign list creators are free to assign whatever meanings they like to any combinations of these characters; in PCSL, for example, sequences such as a1a versus a1b and a2a versus a2b are used to implement multi-level distinctions between variants of a sign. An allograph may follow the closing parenthesis of a group within a compound sign, but may not follow the final vertical bar of the compound.

The reason for the exclusion of x in the allowable set of lowercase letters in an allograph is that allowing it in ASCII transliterations would introduce an ambiguity at the ATF level between x in allographs and x as a compound operator.

allograph= element g:a { xsd:string { pattern = "[a-wyz0-9]+" } }

Formvars

formvar = element g:f { xsd:string { pattern = "[a-z0-9]+" } }

Form variants is the GDL name for minor differences in the construction of signs which may be of interest in analysis of a corpus for handwritings, but which are not important enough to be displayed or included in the version of the writing used for linguistic analysis.

Compound

Compound graphemes are combinations of sign names and operators; the definition is recursive meaning that compound grapheme atoms may be grouped and the group treated as a compound in its own right. Atoms and compounds may both have associated modifier and/or allograph qualifications. We call a single combination of a sign or compound sign and its qualifiers a constituent.

The possible operator types are:

beside
Constituents are written sequentially beside each other.
joining
Constituents are written such that they share at least one common wedge.
containing
The constituent preceding the operator contains the constituent following the operator; the containment may be partial.
above
The constituent preceding the operator is written in the upper part of the line, with the following constituent written beneath it in the lower part of the line.
crossing
The constituents cross one another similarly to the diagonals of an X.
opposing
The constituent preceding the operator is opposite the following constituent, which is turned upside down.
repeated
The following constituent is repeated N times

The beside and joining operators are in fact joiners which mark boundaries; any number of joiner/compound pairs may be siblings.

The inside, above, crossing and opposing operators all have binary scope: a compound which contains an operator is constrained to having exactly two compound children, one before and one after the operator.

The repeated operator is a unary prefix with the content of the operator giving the repetition count. Compounds containing this operator may have only one compound child.

The repeated operator is a unary postfix with the content of the operator giving the number of degrees the sign is rotated in a clockwise direction. Compounds containing this operator may have only one compound child.

# Compounds
c.model  = (compound , (o.join , compound)+) | unary | binary | ternary | (g , mods+)

c        = element g:c { form? , g.meta , c.model , mods* }

g        = element g:g { g.meta , c.model , mods* }

compound = single | unary | binary

single   = n | s | c | (g,mods*) | q

unary    = o.unary , single

binary   = single , o.binary , single

ternary   = single , o.binary , single , o.binary , single

o.join   = element g:o { attribute g:type { "beside" | "joining" | "reordered" } }

o.unary  = element g:o { attribute g:type { "repeated" } , xsd:integer }

o.binary =
  element g:o {
    attribute g:type {
      "containing" | "above" | "crossing" | "opposing"
    }
  }

The difference between a simple sign and a compound sign is that a a compound sign is a sequence of sign names which contains at least one operator, i.e., a character which represents a relationship between multiple graphemes. In ATF the set of characters used for operators is: × % @ & . : +.

In ATF compound graphemes are enclosed at the outer level in vertical bars ("pipes", |...|):

|KA×A|

Signs are frequently modified or operated on as a group; parentheses are used to group multi-part constituents:

|GA₂×(ME.EN)|      |(GI&GI)׊E₃|

Note that modifiers and allographs must not be placed after the closing pipe; instead, they must be put inside the pipe adding grouping characters if necessary:

|GA₂~a×EN|       |GA₂×EN~a|          |(GA₂×EN)~a|  

Th examples above all mean different things. The first, |GA₂~a×EN|, means: "the a-allograph of the sign GA₂ containing sign EN". The second, |GA₂×EN~a|, means: "GA₂ containing the a-allograph of sign EN". The third, |(GA₂×EN)~a|, means: "the a-allograph of the group consisting of sign GA₂ containing sign EN". In example three the bad form *|GA₂×EN|~a would result in a parse error.

Each of the compound operations has its own ATF notation as summarized in the table below:

Summary of Compound Grapheme Operators in ATF/GDL
GDL ATF Example Sign
beside . |DU.DU| DUDU
joining + |LAGAB+LAGAB| NIGIN2
containing × |GA₂×AN| GA TWO TIMES AN
containing/group × |GA₂×(ME.EN)| GA TWO TIMES ME PLUS EN
above & |DU&DU| DU OVER DU
crossing % |GI%GI| GI CROSSING GI
opposing @ |LU₂@LU₂| LU TWO OPPOSING LU TWO
repeated |3×AN| THREE TIMES AN
repeated |4xLU2| FOUR TIMES LU TWO

Punctuation

Several types of cuneiform punctuation are supported in ATF and all of them must be preceded and followed by a space (in the case of * and / the punctuation may be immediately followed by a sign name in parentheses and then the following space). The recognized punctuation codes are:

* = Bullet
The "1" used at the start of each line in lexical texts, omen compendia, etc..
*(GRAPHEME)
Generic punctuation; most often used where scribes use signs other than a "1" at the start of the line in lexical texts, but may be used to transliterate arbitrary or unusual kinds of punctuation that are not otherwise covered below.
: = cuneiform vertical colon.

The vertical "colon" sign often found in commentaries.

N.B.: If the single colon occurs within a word it must be transliterated with the grapheme name form P₂

:' (colon+right-quote) =
Borger MZL 592 variant b; a variant on the vertical two-wedge colon
:" (colon+double-quote) = cuneiform diagonal colon
The diagonal "colon" sign often found in commentaries. Note that the three different double-wedge colon signs are mnemonically two-dots, two-dots-prime and two-dots-double-prime
:. = cuneiform triple wedge colon
The triple-wedge "colon" sign sometimes found in commentaries.
:: = ??
(A colon convention defined in the SAA style manual, form unspecified.)
/ = word divider
Word divider; if unqualified, this is the single vertical wedge word-divider as used, e.g., in Old Assyrian texts. May be qualified as, e.g., /(P2).

Punctuation Sign Names

The punctuation signs may also be transliterated using the following names: P1 (cuneiform word divider); P2 (cuneiform colon); P3 (cuneiform diagonal colon); P4 (cuneiform triple wedge colon); MZL592~b (as :').

# Punctuation
p.model =
    attribute g:type { "*"|":"|":'"|':"'|":."|"::"|"|"|"/"|":r:" } , 
    g.meta , 
    (v|q|s|n|c)?

graphmeta.rnc

namespace g = "http://oracc.org/ns/gdl/1.0"
g.meta = 
  break? , status.flags? , status.spans? , 
  paleography.attr? , linguistic.attr? , proximity.attr? ,
  opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
  varnum? , sign_attr? , utf8? , delim? ,
  attribute xml:id { xsd:ID }? ,
  breakStart? , breakEnd? ,
  damageStart? , damageEnd? ,
  surroStart? , surroEnd? ,
  statusStart? , statusEnd? ,
  accented?

accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }

break = attribute g:break  { "damaged" | "missing" }
opener = attribute g:o     { text }
closer = attribute g:c     { text }
hsqb_o = attribute g:ho    { "1" }
hsqb_c = attribute g:hc    { "1" }
emhyph = attribute g:em    { "1" }
sign_attr = attribute g:sign  { text }
utf8   = attribute g:utf8  { text }
delim  = attribute g:delim { text }
varnum = (
  attribute g:varo { text }? , 
  attribute g:vari { text }? ,  
  attribute g:varc { text }?
)

status.flags =
  attribute g:collated { xsd:boolean } ? ,
  attribute g:queried  { xsd:boolean } ? ,
  attribute g:remarked { xsd:boolean } ?

gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
#det = element g:d { pos , dtyp , delim? , emhyph? , notemark? , surroStart? , g.meta ,
#                    (dingir | mister | word.content*)}
det = element g:d { pos , dtyp , delim? , emhyph? , surroStart? , g.meta ,
                    (word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }

status.spans =
  attribute g:status {
    "ok" | "erased" | "excised" | "implied" | "maybe" | "supplied" | "some"
  }

paleography.attr =
  attribute g:script      { xsd:NCName }

linguistic.attr =
  attribute xml:lang      { xsd:language } ? ,
#  attribute g:rws         { "emegir" | "emesal" | "udgalnun" }? ,
  (attribute g:role       { "sign" | "ideo" | "num" | "syll" }
  | (attribute g:role     { "logo" } ,
     attribute g:logolang { xsd:language }))

proximity.attr = 
  attribute g:prox { xsd:integer }

nongrapheme = 
  element g:x {
    ( attribute g:type { "disambig" | "empty" | "linebreak" | "newline" | "user" | "dollar" | "comment" }
    | ( attribute g:type { "ellipsis" | "word-absent" | "word-broken" | "word-linecont" } 
        , status.spans , opener? , closer? , break? )),
    delim? , text? , varnum? ,
    attribute xml:id { xsd:ID }? ,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? , emhyph? ,
    surroStart? , surroEnd? ,
    statusStart? , statusEnd? ,
    status.flags?
    }

Preamble

This module defines attributes which are essentially graphemic metadata supplied by the editor of the text. They fall into several groups: properties of the grapheme imputed to derive from the scribe; properties assigned by the editor; physical preservation properties; paleographic properties; and linguistic properties. We describe these principally in the form of the tutorial aimed at end-users and allow the sequence of definitions in the schema to follow the tutorial.

namespace g = "http://oracc.org/ns/gdl/1.0"
g.meta = 
  break? , status.flags? , status.spans? , 
  paleography.attr? , linguistic.attr? , proximity.attr? ,
  opener? , closer? , hsqb_o?, hsqb_c? , emhyph? ,
  varnum? , sign_attr? , utf8? , delim? ,
  attribute xml:id { xsd:ID }? ,
  breakStart? , breakEnd? ,
  damageStart? , damageEnd? ,
  surroStart? , surroEnd? ,
  statusStart? , statusEnd? ,
  accented?

accented = attribute g:accented { text }
breakStart = attribute g:breakStart { "1" }
breakEnd = attribute g:breakEnd { xsd:IDREF }
damageStart = attribute g:damageStart { "1" }
damageEnd = attribute g:damageEnd { xsd:IDREF }
surroStart = attribute g:surroStart { "1" }
surroEnd = attribute g:surroEnd { xsd:IDREF }
statusStart = attribute g:statusStart { "1" }
statusEnd = attribute g:statusEnd { xsd:IDREF }

Breakage

break = attribute g:break  { "damaged" | "missing" }
opener = attribute g:o     { text }
closer = attribute g:c     { text }
hsqb_o = attribute g:ho    { "1" }
hsqb_c = attribute g:hc    { "1" }
emhyph = attribute g:em    { "1" }
sign_attr = attribute g:sign  { text }
utf8   = attribute g:utf8  { text }
delim  = attribute g:delim { text }
varnum = (
  attribute g:varo { text }? , 
  attribute g:vari { text }? ,  
  attribute g:varc { text }?
)

Other flags

status.flags =
  attribute g:collated { xsd:boolean } ? ,
  attribute g:queried  { xsd:boolean } ? ,
  attribute g:remarked { xsd:boolean } ?

Glosses

ATF divides glosses into three types:

Determinatives
Determinatives include semantic and phonetic modifiers, which may be single graphemes or several hyphenated graphemes, which are part of the current word. Determinatives are enclosed in single brackets {...}; semantic determinatives require no special marking, but phonetic glosses and determinatives should be indicated by adding a plus sign (+) immediately after the opening brace, e.g., AN{+e}. Multiple separate determinatives must be enclosed in their own brackets, but a single determinative may consist of more than one sign (as is the case with Early Dynastic pronunciation glosses).
Linguistic
Linguistic glosses are defined for the purposes of this specification as glosses which give an alternative to the word(s) in question. Such alternatives are typically either variants or translations. Linguistic glosses are enclosed in the double brackets {{...}}.
Document-oriented
Document-oriented glosses are used for scribal comments on the document including 10-marks, line-count summaries and asides such as he-pi₂ ("(text) broken"). Document-oriented glosses are enclosed in the compound brackets {(...)}.

Glosses must have a space or hyphen on one side or the other. They may have spaces on both sides. Glosses may not touch directly both the preceding and following graphemes; nor may they have hyphens at both ends.

{d}utu   larsa{ki}   {+u₃-mu₂}u₂-mu₁₁    AN{+e}

du₃-am₃{{mu-un-<(du₃)>}}

{(1(u))}    {(%a he-pi₂ eš-šu₂)}

The ATF processor sets type=text when the gloss is enclosed in {(...)} and type=lang when the gloss is enclosed in {...}.

The ATF processor sets pos=pre when the gloss has no space or boundary following it; pos=post when the gloss has no space or boundary preceding it; and pos=free when the gloss has spaces on both sides.

gloss = det | glo
pos = attribute g:pos { "pre" | "post" | "free" }
#det = element g:d { pos , dtyp , delim? , emhyph? , notemark? , surroStart? , g.meta ,
#                    (dingir | mister | word.content*)}
det = element g:d { pos , dtyp , delim? , emhyph? , surroStart? , g.meta ,
                    (word.content*)}
dtyp= attribute g:role { "phonetic" | "semantic" }
glo = element g:gloss { attribute g:type { "lang" | "text" } , delim? , pos , words }

Presence

status.spans =
  attribute g:status {
    "ok" | "erased" | "excised" | "implied" | "maybe" | "supplied" | "some"
  }

Programming note: Graphemic elements which can carry graphemic content (i.e., g:v, g:s, g:c, g:p, g:q, g:n, and g:x where the type is ellipsis) always have a g:status attribute. This can be used to navigate to the preceding/following grapheme which can have bracketing to determine when to open/close bracketing. Graphemes which have no explicit presence-status have g:status="ok".

Scripts

paleography.attr =
  attribute g:script      { xsd:NCName }

Languages

Defining the default language

At the start of the ATF file, amongst the other protocols, you need to define the language of your (ancient) text. (For instructions on how to define the language of your translation, see the page on translations.)

After the &-line but before the text begins, enter a single protocol line which begins #atf: lang, followed by a space and the relevant language or dialect code in lower-case. This example is for Neo-Babylonian:

#atf: lang nb

This line ensures that all transliterated and lemmatised words in the text will be treated as Neo-Babylonian--unless you explicitly mark otherwise, as described below.

Shifting to other languages

To shift to a different language within the text, write a percent sign followed immediately by the relevant language code. You will also need to explicitly signal the shift back to the default language. For instance, if you had not defined Emesal as the default alternative language in the first example you could write:

8. %e še-eb %s e₂-kur-ra ba-du₃-a-bi

As before, the text is assumed to switch back to the default language at the start of every new line.

Language codes

Here is a list of the most frequently used language and dialect codes. The full set, including peripheral dialects of Akkadian, is given on the Language Tags page of the developer documentation.

The Main Language and Dialect Codes
Language or dialectProtocol CodeInline Code(s)Notes
Akkadian(none: must specify dialect too)a or akk
  Early Akkadianakk-x-earakkeakkFor pre-Sargonic Akkadian.
  Old Akkadianakk-x-oldakkoakk
  Ur III Akkadianuaur3akk
  Old Assyrianakk-x-oldassoa
  Old Babylonianakk-x-oldbabob
  Old Babylonian peripheralakk-x-obperi
  Middle Assyrianakk-x-midassma
  Middle Babylonianakk-x-midbabmb
  Middle Babylonian peripheralakk-x-mbperi
  Neo-Assyrianakk-x-neoassna
  Neo-Babylonianakk-x-neobabnb
  Late Babylonianakk-x-ltebabnb
  Standard Babylonianakk-x-stdbabsb
  Conventional Akkadianakk-x-conakkcaThe artificial form of Akkadian used in lemmatisation Citation Forms.
  normalised(none: main text must be transliteration)nUsed in lexical lists and restorations; try to avoid wherever possible.
  transliterated (graphemic) Akkadian(none: must specify dialect too)gOnly for use when switching from normalised Akkadian.
Hittitehith or hit
Sumeriansux or sux-x-emegirs, sux, or egThe abbreviation eg stands for Emegir (main-dialect Sumerian).
  Emesalsux-x-emesale, es
  Syllabicsux-x-syllabicsy
  Udgalnunsux-x-udgalnunu

Roles

The role of a grapheme may be annotated on the grapheme element, but there is no ATF syntax for specifying it: the ideo, num or syll values of the role attribute should be determined by linguistic services processors and added directly to the XTF version of the text.

Logograms

The surface syntax for logograms is described under Sign Names above.

A normalization may be given after a word containing at least one logogram by following the word immediately with (=...), e.g., SAL(=mimma).

linguistic.attr =
  attribute xml:lang      { xsd:language } ? ,
#  attribute g:rws         { "emegir" | "emesal" | "udgalnun" }? ,
  (attribute g:role       { "sign" | "ideo" | "num" | "syll" }
  | (attribute g:role     { "logo" } ,
     attribute g:logolang { xsd:language }))

Proximity

proximity.attr = 
  attribute g:prox { xsd:integer }

Intrusions

The g:x type empty is generated by the ATF processor when a word begins with a boundary. This can happen in GDL fragments within a note, e.g.:

The g:x type disambig cannot be generated by the ATF processor, and so cannot occur outside of glossaries. Glossaries are processed using gdlme2, which generates a disambig for disambiguated forms such as a\abs. The use of backslash in \ for form variants cannot conflict with this because form variants are by definition non-linguistic and are not rendered when producing forms for the lemmatizer. If a form variant has linguistic importance it should be expressed using either an @-modifier or a ~-modifier.

#note: @akk{-ir}*: copy has @akk{NI}.
nongrapheme = 
  element g:x {
    ( attribute g:type { "disambig" | "empty" | "linebreak" | "newline" | "user" | "dollar" | "comment" }
    | ( attribute g:type { "ellipsis" | "word-absent" | "word-broken" | "word-linecont" } 
        , status.spans , opener? , closer? , break? )),
    delim? , text? , varnum? ,
    attribute xml:id { xsd:ID }? ,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? , emhyph? ,
    surroStart? , surroEnd? ,
    statusStart? , statusEnd? ,
    status.flags?
    }

words.rnc

namespace g = "http://oracc.org/ns/gdl/1.0"
namespace n = "http://oracc.org/ns/norm/1.0"
namespace note = "http://oracc.org/ns/note/1.0"
namespace syn = "http://oracc.org/ns/syntax/1.0"

word.content = text | group | grapheme | nongrapheme

words = (word | sword.head | sword.cont | nonword | nongrapheme)*

word = 
  element g:w {
    word.attributes,
    word.content*
  }

sword.head = 
  element g:w {
    attribute headform { text },
    attribute contrefs { xsd:IDREFS },
    word.attributes,
    word.content*
  }

sword.cont = 
  element g:swc {
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute headref { xsd:IDREF },
    attribute swc-final { "1" | "0" },
    delim? ,
    word.content*
  }

word.attributes = 
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute lemma { text }? ,
    attribute guide { text }? ,
    attribute sense { text }? ,
    attribute pos   { text }? ,
    attribute morph { text }? ,
    attribute base  { text }? ,
    attribute norm  { text }? ,
    delim? ,
    syntax.attributes*

nonword = 
  element g:nonw {
    (
    attribute xml:id { xsd:ID }? ,
    attribute xml:lang { xsd:language }? ,
    attribute type { "comment" | "dollar" | "excised" | "punct" | "surro" | "vari" }? ,
    attribute form { text }? ,
    attribute lemma { text }? ,
    syntax.attributes* ,
    break? , status.flags? , status.spans? , opener? , closer? , delim? , g.meta , 
    word.content*
    )
    |
    (
    attribute type { "notelink" },
    noteref,
    noteauto?,
    text
    )
  }

group = 
  element g:gg {
    attribute g:type { 
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "implicit-ligature" | "logo" | "numword"
    } ,
    g.meta ,
    (group | grapheme | normseg)+
  }

groupgroup = 
  element g:gg {
    attribute g:type { "group" } ,
    g.meta ,
    (group | grapheme | normword)+
  }

syntax.attributes = 
  (attribute syn:brk-before { text } |
   attribute syn:brk-after  { text } |
   attribute syn:ub-before  { text } |
   attribute syn:ub-after   { text } )

normword = 
  element n:w { 
    word.attributes , 
    break? , status.flags? , status.spans? , opener? , closer? , 
    hsqb_o? , hsqb_c? ,
    (normwordword | normwordgroup | gloss | nongrapheme | group)* ,
    syntax.attributes*,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    statusStart? , statusEnd?
  }

normwordgroup = 
  element n:word-group {
     attribute g:type { "alternation" } ,
     attribute g:delim { "-" }? ,
     element n:grouped-word { normwordword }+
  }

normwordword = ( text | (normseg | normgroup)+)

normseg =
  element n:s {
    n.meta ,
    g.meta ,
    text
  }

n.meta = normnum?

normnum = attribute n:num { "yes" }

normgroup = 
  element n:g {
    attribute g:type {
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "numword"
    } ,
    g.meta ,
    (normgroup | normseg)+
  }

Words

For the purposes of transliteration, a "word" is anything between spaces, including isolated and uninterpretable signs.

In GDL, words are sequences of graphemes or grapheme-groups. The following kinds of grapheme-groups are defined:

alternation
Simple alternation of the common transliterational form KI/DI. An alternation may contain more than one choice, but always applies to a sequence of single graphemes.
reordering
Reordering of graphemes within a word commonly expressed by use of the colon (:) as a grapheme joiner in transliterations. The original order of the signs on the tablet is not indicated within a word; the structural mechanism Multiplexing must be used instead.
namespace g = "http://oracc.org/ns/gdl/1.0"
namespace n = "http://oracc.org/ns/norm/1.0"
namespace note = "http://oracc.org/ns/note/1.0"
namespace syn = "http://oracc.org/ns/syntax/1.0"

word.content = text | group | grapheme | nongrapheme

words = (word | sword.head | sword.cont | nonword | nongrapheme)*

word = 
  element g:w {
    word.attributes,
    word.content*
  }

sword.head = 
  element g:w {
    attribute headform { text },
    attribute contrefs { xsd:IDREFS },
    word.attributes,
    word.content*
  }

sword.cont = 
  element g:swc {
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute headref { xsd:IDREF },
    attribute swc-final { "1" | "0" },
    delim? ,
    word.content*
  }

word.attributes = 
    attribute xml:id { xsd:ID } ,
    attribute xml:lang { xsd:language } ,
    attribute form  { text }? ,
    attribute lemma { text }? ,
    attribute guide { text }? ,
    attribute sense { text }? ,
    attribute pos   { text }? ,
    attribute morph { text }? ,
    attribute base  { text }? ,
    attribute norm  { text }? ,
    delim? ,
    syntax.attributes*

nonword = 
  element g:nonw {
    (
    attribute xml:id { xsd:ID }? ,
    attribute xml:lang { xsd:language }? ,
    attribute type { "comment" | "dollar" | "excised" | "punct" | "surro" | "vari" }? ,
    attribute form { text }? ,
    attribute lemma { text }? ,
    syntax.attributes* ,
    break? , status.flags? , status.spans? , opener? , closer? , delim? , g.meta , 
    word.content*
    )
    |
    (
    attribute type { "notelink" },
    noteref,
    noteauto?,
    text
    )
  }

group = 
  element g:gg {
    attribute g:type { 
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "implicit-ligature" | "logo" | "numword"
    } ,
    g.meta ,
    (group | grapheme | normseg)+
  }

groupgroup = 
  element g:gg {
    attribute g:type { "group" } ,
    g.meta ,
    (group | grapheme | normword)+
  }

syntax.attributes = 
  (attribute syn:brk-before { text } |
   attribute syn:brk-after  { text } |
   attribute syn:ub-before  { text } |
   attribute syn:ub-after   { text } )

Normalization

In normalization, sequences like mû/pû generate an outer n:w containing a n:word-group which in turn contains a sequence of n:grouped-word elements.

normword = 
  element n:w { 
    word.attributes , 
    break? , status.flags? , status.spans? , opener? , closer? , 
    hsqb_o? , hsqb_c? ,
    (normwordword | normwordgroup | gloss | nongrapheme | group)* ,
    syntax.attributes*,
    breakStart? , breakEnd? ,
    damageStart? , damageEnd? ,
    statusStart? , statusEnd?
  }

normwordgroup = 
  element n:word-group {
     attribute g:type { "alternation" } ,
     attribute g:delim { "-" }? ,
     element n:grouped-word { normwordword }+
  }

normwordword = ( text | (normseg | normgroup)+)

normseg =
  element n:s {
    n.meta ,
    g.meta ,
    text
  }

n.meta = normnum?

normnum = attribute n:num { "yes" }

normgroup = 
  element n:g {
    attribute g:type {
      "correction" | "alternation" | "group" | "reordering" | "ligature" | "numword"
    } ,
    g.meta ,
    (normgroup | normseg)+
  }

gdl.rnc

A simple entry point so that users don't have to include several separate schemas.

include "charset.rnc"
include "grapheme.rnc"
include "graphmeta.rnc"
include "words.rnc"

Resources

Links

Top

Tutorial


Questions about this document may be directed to the Oracc Steering Committee (osc at oracc dot org).