2 What's Special about SGML?
There are three characteristics of SGML which distinguish it from other
markup languages: its emphasis on descriptive rather than procedural markup;
its document type concept; and its independence of any one system
for representing the script in which a text is written. These three aspects are
discussed briefly below, and then in more depth in sections
4 SGML Structures
and 5 Defining SGML Document Structures: The DTD.
2.1 Descriptive Markup
A descriptive markup system uses markup codes which simply provide names to
categorize parts of a document. Markup codes such as <para> or \end{list}
simply identify a portion of a document and assert of it that ``the following
item is a paragraph,'' or ``this is the end of the most recently begun list,''
etc. By contrast, a procedural markup system defines what processing is to be
carried out at particular points in a document:
``call procedure PARA with parameters 1, b and x here'' or ``move the left
margin 2 quads left, move the right margin 2 quads right, skip down one line,
and go to the new left margin,'' etc. In SGML, the instructions needed to
process a document for some particular purpose (for example, to format it) are
sharply distinguished from the descriptive markup which occurs within the
document. Usually, they are collected outside the document in separate
procedures or programs.
With descriptive instead of procedural markup the same document can readily
be processed by many different pieces of software, each of which can apply
different processing instructions to those parts of it which are considered
relevant. For example, a content analysis program might disregard entirely the
footnotes embedded in an annotated text, while a formatting program might
extract and collect them all together for printing at the end of each chapter.
Different sorts of processing instructions can be associated with the same parts
of the file. For example, one program might extract names of persons and places
from a document to create an index or database, while another, operating on the
same text, might print names of persons and places in a distinctive typeface.
2.2 Types of Document
Secondly, SGML introduces the notion of a document type, and
hence a document type definition (DTD). Documents are regarded as
having types, just as other objects processed by computers do. The type of a
document is formally defined by its constituent parts and their structure. The
definition of a report, for example, might be that it consisted of a title and
possibly an author, followed by an abstract and a sequence of one or more
paragraphs. Anything lacking a title, according to this formal definition,
would not formally be a report, and neither would a sequence of paragraphs
followed by an abstract, whatever other report-like characteristics these might
have for the human reader.
If documents are of known types, a special purpose program (called a
parser) can be used to process a document claiming to be of a
particular type and check that all the elements required for that document type
are indeed present and correctly ordered. More significantly, different
documents of the same type can be processed in a uniform way. Programs can be
written which take advantage of the knowledge encapsulated in the document
structure information, and which can thus behave in a more intelligent fashion.
2.3 Data Independence
A basic design goal of SGML was to ensure that documents encoded according
to its provisions should be transportable from one hardware and software
environment to another without loss of information. The two features discussed
so far both address this requirement at an abstract level; the third feature
addresses it at the level of the strings of bytes (characters) of which
documents are composed. SGML provides a general purpose mechanism for string
substitution, that is, a simple machine-independent way of stating that a
particular string of characters in the document should be replaced by some other
string when the document is processed. One obvious application for this
mechanism is to ensure consistency of nomenclature; another, more significant
one, is to counter the notorious inability of different computer systems to
understand each other's character sets, or of any one system to provide all the
graphic characters needed for a particular application, by providing descriptive
mappings for non-portable characters. The strings defined by this
string-substitution mechanism are called entities and they are
discussed below in section 8 SGML Entities .
Back to table of contents
On to next section
Back to previous section