README Oracc Home SEARCH DOCUMENTATION

Creative Commons License

ORACC Home


Introduction


cbd.rnc

Preamble

Document

Declaration

Entries

Structure

Semantics

Properties


Resources

CBD: Corpus-Based Dictionary

(http://oracc.org/ns/cbd/1.0)

Steve Tinney
Version of 2014-08-06

Introduction

The CBD format is used for glossaries and lexicons. The architecture is designed to support the inclusion of arbitrary items of information tailored to the needs of different languages or types of name books.

cbd.rnc

default namespace = "http://oracc.org/ns/cbd/1.0"
namespace cbd = "http://oracc.org/ns/cbd/1.0"
start = cbd

cbd = element cbd { cbd-attr , declaration? , entry* }
cbd-attr    = (target-lang , target-rws , xml-lang)
target-lang = attribute cbd:target-lang { text }
target-rws  = attribute cbd:target-rws  { text }
xml-lang    = attribute xml:lang        { text }

declaration =   element declaration { prop-replace? , prop-def* }
prop-replace =  attribute cbd:property-replace { xsd:boolean }
prop-def =      element property {
                  prop-scope , prop-name , prop-type ,  prop-sort , 
		  prop-gaps , prop-val* }
prop-scope =    attribute cbd:property-scope { "cbd" | "entry" }
prop-name  =    attribute cbd:property-name  { xsd:NMTOKEN }
prop-type =     attribute cbd:property-type  { "singleton" | "list" | "complex" }
prop-sort =     attribute cbd:property-sort  { 
                  "none" | "numeric" | "alpha" | "list" }
prop-gaps =     attribute cbd:property-gaps-ok { xsd:boolean }
prop-val =      element property-value { prop-val-type , text }
prop-val-type = attribute cbd:prop-ok-type {
                  "number" | "letter" | "token" | "pattern" }

entry = element entry { cf , gw , pos , sense* , properties }

cf = element cf { text }
gw = element gw { text }
pos = element pos { text}

sense      = element sense { (gw? , pos? , 
                              ((glosses , definition?) | definition)),
			     sense*
             }

glosses    = element glosses { text }
definition = element definition { text | anyElement }
anyElement = element * { attribute * { text }* , (anyElement | text)* }

properties = element prop { name , 
                            ((value , key?) | ref),
			    properties }*

name  = attribute n { xsd:NMTOKEN }
value = attribute v { text } | element v { text | anyElement }
key   = attribute k { text }
ref   = attribute r { text }

Preamble

default namespace = "http://oracc.org/ns/cbd/1.0"
namespace cbd = "http://oracc.org/ns/cbd/1.0"
start = cbd

Document

The document element is cbd and has attributes to specify various fundamental glossary parameters as follows:

target-lang
The language of which the present CBD is a glossary. This must be a three letter ISO 639 code, with private-use definitions as given in the GDL documentation.
target-rws
The register/writing-system (or dialect) for which the CBD is a glossary. For example, a Sumerian glossary might focus only on Emesal, in which case target-rws would be set to ES.
xml:lang
The default language for definitions.
cbd = element cbd { cbd-attr , declaration? , entry* }
cbd-attr    = (target-lang , target-rws , xml-lang)
target-lang = attribute cbd:target-lang { text }
target-rws  = attribute cbd:target-rws  { text }
xml-lang    = attribute xml:lang        { text }

Declaration

The CBD declaration defines parameters for the glossary including its languages and properties. A library of default parameters set by language is available in cdl/lib/cbd/<LANG>.xml; if there is no declaration, or if the declaration has replace=no, the processor reads the system declaration for the target-lang before proceeding.

declaration =   element declaration { prop-replace? , prop-def* }
prop-replace =  attribute cbd:property-replace { xsd:boolean }
prop-def =      element property {
                  prop-scope , prop-name , prop-type ,  prop-sort , 
		  prop-gaps , prop-val* }
prop-scope =    attribute cbd:property-scope { "cbd" | "entry" }
prop-name  =    attribute cbd:property-name  { xsd:NMTOKEN }
prop-type =     attribute cbd:property-type  { "singleton" | "list" | "complex" }
prop-sort =     attribute cbd:property-sort  { 
                  "none" | "numeric" | "alpha" | "list" }
prop-gaps =     attribute cbd:property-gaps-ok { xsd:boolean }
prop-val =      element property-value { prop-val-type , text }
prop-val-type = attribute cbd:prop-ok-type {
                  "number" | "letter" | "token" | "pattern" }

Entries

A CBD entry consists of several core elements and an open-ended list of properties. In this way the structure is adaptable to different kinds of glossaries and languages. For example, a glossary of personal names can have properties giving genealogical information for the persons referenced in the entries.

entry = element entry { cf , gw , pos , sense* , properties }

The core elements provide the essential structural data for the glossaries and, optionally, the semantic outline.

Structure

The central structural mechanism for entries in the CBD architecture is formed from three pieces of data: the Citation Form (CF), i.e., the form of the word that is given as the headword in the entry; the Guide Word (GW), i.e., a disambiguating label which separates homophones; and the Part Of Speech (POS), i.e., the syntactic function typically fulfilled by the word.

In more traditional dictionaries the GW function is fulfilled by letters or numbers, and there is nothing in the CBD definition to prevent this from being the case in a CBD. However, it is also common in CBDs for the GW to be a word or phrase which orients the dictionary user to the meaning or semantic realm of the term--hypernyms often make good choices for GWs. The use of unordered symbols of this kind to disambiguate words permits deferral of decisions about the number and ordering of homonyms, and is particularly useful for the development phase of glossaries where the complete lexicon is unknown during the corpus-building process.

The permitted values for GW can be specified in the declaration by giving a prop-def entry for the property gw. Thus, to declare that GW is actually digits the following entry can be given:

<property cbd:property-scope="cbd" cbd:property-name="gw">
  cbd:property-type="singleton" cbd:property-sort="numeric"
  cbd:property-gaps-ok="no">
  <property-value type="number"/>
</property>

Similarly, the range of values of the pos element can be constrained by giving a specification for the property pos in the CBD declaration.

The content model of the structural elements is very weak because there is an operating assumption that validation will be carried out on the content--at least for gw and pos--based on allowable values specified in the property declarations.

cf = element cf { text }
gw = element gw { text }
pos = element pos { text}

Semantics

The CBD provides a built-in structure for dictionary definitions. The structure is recursive to arbitrary levels and provides support for multiple glossary-writing styles as well as tie-ins with the CDL lemmatizer.

Note that the specification for definition permits arbitrary structured content to be included here and subsequently processed by glossary-specific plug-ins.

sense      = element sense { (gw? , pos? , 
                              ((glosses , definition?) | definition)),
			     sense*
             }

glosses    = element glosses { text }
definition = element definition { text | anyElement }
anyElement = element * { attribute * { text }* , (anyElement | text)* }

Properties

The property framework enables CBDs to be configured to the needs of various languages and target data types. The framework is generic and recursive, meaning that complex properties are also supported.

Properties consist of at most three components: a required name, given in the n attribute; a required value or reference, given either as a v attribute, a v element or, for references, an r attribute; and an optional key symbol, given with a k attribute.

The key symbol may be used for sorting and cross-referencing. In the following example of the use of cross-referencing, the Sumerian verb ŋar[place] has several stems, each of which may be written in several ways. We call the writing of a stem a base, so the relationship between stem and base can be defined as follows:

    <prop n="stem" v="B">
      <prop n="form" v="ŋar"/>
      <prop n="func" v="perf"/>
    </prop>
    <prop n="stem" v="B">
      <prop n="form" v="mar"/>
      <prop n="func" v="perf"/>
      <prop n="rws" v="ES"/>
    </prop>
    <prop n="base" v="ŋar">
      <prop n="stem" r="#form=ŋar"/>
    </prop>
    <prop n="base" v="ma·ra">
      <prop n="stem" r="#form=mar"/>
    </prop>

The (incomplete and simplified) example enumerates the stems of the verb--there is a B (base) stem with perfective function in the default register/writing-system with the form ŋar, and another form, mar of the B-perfective used in Emesal. The orthographic bases on which forms are constructed are given in the base property. The bases reference the stems via the form property of the stem properties. This permits complex co-validation of the lemmatized instance-data: the lemmatizer uses the glossary to ensure, for example, that a lemmatized form uses a combination of base and stem which is valid according to the entry's properties. If a base can be used to write several stems, the lemmatizer can issue a diagnostic if no stem is specified; if the base is used to write only one stem, the lemmatizer can supply the stem from the base if the user has not done so.

properties = element prop { name , 
                            ((value , key?) | ref),
			    properties }*

name  = attribute n { xsd:NMTOKEN }
value = attribute v { text } | element v { text | anyElement }
key   = attribute k { text }
ref   = attribute r { text }

Resources


Questions about this document may be directed to the Oracc Steering Committee (osc at oracc dot org).