5 Defining SGML Document Structures: The DTD
Rules such as those described above are the first stage in the creation of a
formal specification for the structure of an SGML document, or document
type definition, usually abbreviated to DTD. In creating a
DTD, the document designer may be as lax or as restrictive as the occasion
warrants. A balance must be struck between the convenience of following simple
rules and the complexity of handling real texts. This is particularly the case
when the rules being defined relate to texts which already exist: the designer
may have only the haziest of notions as to an ancient text's original purpose or
meaning and hence find it very difficult to specify consistent rules about its
structure. On the other hand, where a new text is being prepared to an exact
specification, for example for entry into a textual database of some kind, the
more precisely stated the rules, the better they can be enforced. Even in the
case where an existing text is being marked up, it may be beneficial to define a
restrictive set of rules relating to one particular view or hypothesis about the
text---if only as a means of testing the usefulness of that view or hypothesis.
It is important to remember that every document type definition is an
interpretation of a text. There is no single DTD which encompasses any kind of
absolute truth about a text, although it may be convenient to privilege some
DTDs above others for particular types of analysis.
At present, SGML is most widely used in environments where uniformity of
document structure is a major desideratum. In the production of technical
documentation, for example, it is of major importance that sections and
subsections should be properly nested, that cross references should be properly
resolved and so forth. In such situations, documents are seen as raw material
to match against pre-defined sets of rules. As discussed above, however, the
use of simple rules can also greatly simplify the task of tagging accurately
elements of less rigidly constrained texts. By making these rules explicit, the
scholar reduces his or her own burdens in marking up and verifying the
electronic text, while also being forced to make explicit an interpretation of
the structure and significant particularities of the text being encoded.
5.1 An Example DTD
A DTD is expressed in SGML as a set of declarative statements, using a
simple syntax defined in the standard. For our simple model of a poem, the
following declarations would be appropriate:
<!ELEMENT anthology - - (poem+)>
<!ELEMENT poem - - (title?, stanza+)>
<!ELEMENT title - O (#PCDATA) >
<!ELEMENT stanza - O (line+) >
<!ELEMENT line O O (#PCDATA) >
These five lines are examples of formal SGML element declarations. A
declaration, like an element, is delimited by angle brackets; the first
character following the opening bracket must be an exclamation mark, followed
immediately by one of a small set of SGML-defined keywords, specifying the kind
of object being declared. The five declarations above are all of the same type:
each begins with an ELEMENT keyword, indicating that it declares
an element, in the technical sense defined above. Each consists of three parts:
a name or group of names, two characters specifying minimization rules,
and a content model. Each of these parts is discussed further
below. Components of the declaration are separated by white space, that is one
or more blanks, tabs or newlines.
The first part of each declaration above gives the generic identifier of the
element which is being declared, for example
`poem', `title', etc. It is
possible to declare several elements in one statement, as discussed below.
5.2 Minimization Rules
The second part of the declaration specifies what are called
minimization rules for the element concerned. These rules
determine whether or not start- and end-tags must be present in every occurrence
of the element concerned. They take the form of a pair of characters, separated
by white space, the first of which relates to the start-tag, and the second to
the end-tag. In either case, either a hyphen or a letter O (for ``omissible''
or ``optional'') must be given; the hyphen indicating that the tag must be
present, and the letter O that it may be omitted. Thus, in this example, every
element except <line> must have a start-tag. Only the
<poem> and <anthology> elements must have end-tags as well.
5.3 Content Model
The third part of each declaration, enclosed in parentheses, is called the
content model of the element, because it specifies what element
occurrences may legitimately contain. Contents are specified either in terms of
other elements or using special reserved words. There are several such reserved
words, of which by far the most commonly encountered is PCDATA, as in
this example. This is an abbreviation for parsed character data,
and it means that the element being defined may contain any valid character
data. If an SGML declaration is thought of as a structure like a family tree,
with a single ancestor at the top (in our case, this would be <anthology>),
then almost always, following the branches of the tree downwards (for example,
from <anthology> to <poem> to <stanza> to <line>
and
<title>) will lead eventually to PCDATA. In our example, <title>
and <line> are so defined. Since their content models say PCDATA
only and name no embedded elements, they may not contain any embedded elements.
5.4 Occurrence Indicators
The declaration for <stanza> in the example above states that a
stanza consists of one or more lines. It uses an
occurrence indicator (the plus sign) to indicate how many times
the element named in its content model may occur. There are three occurrence
indicators in the SGML syntax, conventionally represented by the plus sign, the
question mark, and the asterisk or star.
[See note 6] The plus sign means that there may be one or more occurrences of the
element concerned; the question mark means that there may be at most one and
possibly no occurrence; the star means that the element concerned may either be
absent or appear one or more times. Thus, if the content model for <stanza>
were (LINE*), stanzas with no lines would be possible as well as
those with more than one line. If it were (LINE?), again empty
stanzas would be countenanced, but no stanza could have more than a single line.
The declaration for <poem> in the example above thus states that a <poem>
cannot have more than one title, but may have none, and that it must have at
least one <stanza> and may have several.
5.5 Group Connectors
The content model (TITLE?, STANZA+) contains more than one
component, and thus needs additionally to specify the order in which these
elements (<title> and <stanza>) may appear. This ordering is
determined by the
group connector (the comma) used between its components. There are
three possible group connectors, conventionally represented by comma, vertical
bar, and ampersand.
[See note 7] The comma means that the components it connects must both appear in the
order specified by the content model. The ampersand indicates that the
components it connects must both appear but may appear in any order. The
vertical bar indicates that only one of the components it connects may appear.
If the comma in this example were replaced by an ampersand, a title could appear
either before the stanzas of a <poem> or at the end (but not between
stanzas). If it were replaced by a vertical bar, then a <poem> would
consist of either a title or just stanzas---but not both!
5.6 Model Groups
In our example so far, the components of each content model have been either
single elements or PCDATA. It is quite permissible however to define
content models in which the components are lists of elements, combined by group
connectors. Such lists, known as
model groups, may also be modified by occurrence indicators and
themselves combined by group connectors. To demonstrate these facilities, let
us now expand our example to include non-stanzaic types of verse. For the sake
of demonstration, we will categorize poems as one of stanzaic,
couplets, or
blank (or stichic). A blank-verse poem consists
simply of lines (we ignore the possibility of verse paragraphs for the moment)
[See note 8] so no additional elements need be defined for it. A couplet is defined
as a <line1> followed by a <line2>.
<!ELEMENT couplet O O (line1, line2) >
The elements <line1> and <line2> (which are distinguished to
enable studies of rhyme scheme, for example) have exactly the same content model
as the existing <line> element. They can therefore share the same
declaration. In this situation, it is convenient to supply a name group
as the first component of a single element declaration, rather than give a
series of declarations differing only in the names used. A name group is a list
of GIs connected by any group connector and enclosed in parentheses, as follows:
<!ELEMENT (line | line1 | line2) O O (#PCDATA) >
The declaration for the <poem> element can now be changed to
include all three possibilities:
<!ELEMENT poem - O (title?, (stanza+ | couplet+ | line+) ) >
That is, a poem consists of an optional title, followed by one or
several stanzas, or one or several couplets, or one or several lines. Note the
difference between this definition and the following:
<!ELEMENT poem - O (title?, (stanza | couplet | line)+ ) >
The second version, by applying the occurrence indicator to the group
rather than to each element within it, would allow for a single poem to contain
a mixture of stanzas, couplets or blank verse.
Quite complex models can easily be built up in this way, to match the
structural complexity of many types of text. As a further example, consider the
case of stanzaic verse in which a refrain or chorus appears. A refrain may be
composed of repetitions of the line element, or it may simply be text, not
divided into verse lines. A refrain can appear at the start of a poem only, or
as an optional addition following each stanza. This could be expressed by a
content model such as the following:
<!ELEMENT refrain - - (#PCDATA | line+)>
<!ELEMENT poem - O (title?,
( (line+)
| (refrain?, (stanza, refrain?)+ ) )) >
That is, a poem consists of an optional title, followed by either a
sequence of lines, or an un-named group, which starts with an optional refrain,
followed by one of more occurrences of another group, each member of which is
composed of a stanza followed by an optional refrain. A sequence such as
`refrain - stanza - stanza - refrain' follows this
pattern, as does the sequence
`stanza - refrain - stanza - refrain'. The sequence
`refrain - refrain - stanza - stanza' does not, however,
and neither does the sequence ``stanza - refrain - refrain - stanza.''
Among other conditions made explicit by this content model are the requirements
that at least one stanza must appear in a poem, if it is not composed simply of
lines, and that if there is both a title and a stanza they must appear in that
order.
Back to table of contents
On to next section
Back to previous section