








The Exeter SGML Project has granted authorization for adaptation and use of this document by ISUG. The original document content may be examined in the source, the canonical home for which is: http://www.ex.ac.uk/SGML/whysgml1.html.
SGML (Standard Generalized Markup Language) is an internationally agreed standard for information representation. Information is the operative term. The focus upon information representation and use of a formal notation for encoding document structure -- rather than upon 'file format' or 'data format' -- constitute foundational concepts in SGML. The central concern of SGML for information structure often requires users to adopt a new way of thinking about the encoding of information in structured objects we call 'documents.'
SGML can be used to support publishing in its broadest definition: single medium conventional publishing on paper, online multi-media database publishing, or any other publication deliverable that may be derived from a single neutral source. SGML-encoded documents typically use a preponderance of 'plain text' representation for the encoding of information; this portable representation can be read by people, as well as exchanged between machines and applications in a rigorous and unambiguous manner. The following document provides an introduction to the main features of SGML, using non-technical [and thus, technically 'imprecise'] language.
SGML provides an internationally recognized, non-proprietary language for designing your own markup schemes. Markup, broadly speaking, is character text or binary codes 'added to' data content in order to convey particular information about that data. In a document produced by a typical word processor file, markup is represented by the proprietary codes that the software inserts into the files to indicate which words should be printed in a certain font, which paragraphs should be centered, where page breaks occur, etc. In the case of a database system, markup is represented by proprietary codes in the data file which indicate where one field or record ends and another begins, and so on.
SGML embraces the principles of descriptive markup -- where textual markup is used to delimit and describe document subcomponents as information objects. The textual markup used is semantically perspicuous to a human: natural language is employed in the selection of names for textual objects, their attributes, and other features. Markup is 'descriptive' in that it indicates the nature, function, or type of content in a document, rather than specifying how that data content should be displayed, printed on paper, or otherwise processed. Markup which is specific to pre-determined processing goals is sometimes called 'procedural' or 'presentational' because it uses codes that specify processing directives or intended display effects. SGML's descriptive markup identifies delimited information objects in terms of what they are, not in terms of how they are to be rendered on a computer screen or formatted for paper printout.
SGML has no provision for expressing and validating primitive relational semantics through its encoding. Validity checking in SGML is thus primarily syntactic, based upon a small number of defined semantics for lexical primitives. However, a certain kind of 'semantic-level' encoding is achieved in SGML through the use of names for elements and attributes that have a well understood (or informally documented) semantic value for SGML users. The integrity of document structure is more readily examined and apprehended by users if a section heading is identified as a "heading" -- e.g., <heading> -- rather than a piece of text that has to be printed or displayed in "20 point Times Bold".
SGML languages provide rigorous control over the syntactic structure of the encoded information, which supports consistency and logical structure that enhance document usability. The control can be rigorous because markup schemes that you design using SGML declare a set of rules which unambiguously prescribe how document objects must be marked up in order to be correctly structured in context. An SGML parser that understands the declared rules is incorporated into SGML-compliant software in such a way that it can apply the rules to the structure of a document instance, and can ensure that any markup in the document conforms to the appropriate set of rules. Such validity checking (and error reporting) can be used to guarantee that the data in an SGML-encoded document is structured in a knownway. For example, if the set of rules you are working with declares that text labelled as a "sub-section" can occur only within text labeled as a "section", SGML-compliant software will provide feedback to the user during text creation and editing indicating whether this rule has been obeyed in all situations.
If you possess an electronic document that contains data (1) descriptively tagged, and (2) structured according to a known set of rules in an unambiguous manner, then you are empowered to process that information however you see fit. Information encoded in SGML should be free of contamination by proprietary markup codes that are optimal for one kind of processing but are utterly useless for another. A document structured in a processing-neutral way can be readily prepared for formatting (i.e., for paper printout) or for online display. Alternately, you could map the data into a database structure to create text archives, or import the data from a database to create active documents. You can create new documents by extracting or combining information taken from one or more source documents. You can incorporate document objects into a hypertext or multi-media system. All this you can do without altering the source file -- which means you can process the same source information in many different ways simultaneously.
SGML thus makes it possible to re-use and share information across applications as well as across hardware platforms and operating systems. People working at different sites, using different authoring tools on different machines can produce SGML documents conforming to an agree-upon formal definition that is expressible in plain text. By validating their source documents, they can be assured that the different document (sub)components produced in different locations can be combined to produce a single document that has structural uniformity. Provided that you know the markup scheme which was used to create an SGML document, you can take any arbitrary SGML document and process it however you see fit. Thus, several sites could download a copy of a SGML document from an archive and each print it out in conformance with their local house style rules.
SGML will probably change the way you work, but it will most assuredly change the way you think about your work. Given meaningful DTDs, appropriate software, and a commitment to the long-term usefulness of encoded data, it can simplify many of the things you do already, and make possible many of the things you have always wanted to do.
SGML encourages you to think of information you create as units within a larger document object: a container full of smaller containers, storing re-usable and rigorously structured blocks of information. You will be urged by SGML software to comply with the constraints you have declared to be true for the model of "text" you have adopted for a particular document type -- rather than viewing the authoring environment (as with a word processer invoked on a text file) as presenting a blank slate for undifferentiated characters, or a formatted on-line help screen, or a database file, or a completed hypertext document. Your concern as an author will be upon the intellectual enterprise as such: the creation of structured information. By focusing upon the creation of information rather than worrying about how the text will subsequently be processed (a task that can be deferred, or handed to a style sheet), you will be free as an author to concentrate upon the task of thinking. It is the task of the supporting SGML software to ensure that the document contains properly-structured content, according to the appropriate markup scheme.
One of the advantages of SGML-encoded data in that your documents can be processed in a consistent manner. To impose a "house-style" on all your printed documents, it is necessary only to ensure that all your SGML documents go through the same translation process to map their contents into, say, a file with LaTeX commands, or into a word processor's style sheet.
You only need to write one translation process. If your house-style changes, you simply need to alter the translation process and pass all your old document files through the amended stylesheet version to give them the new look. You do not need to edit the documents themselves, because they never contained any formatting instructions in the first place!
Finally: remember that you can use exactly the same source documents for your printed output as the source for your on-line (hypertext) documents, or for mapping information to or from your text database. Re-purposing and re-use of SGML encoded information is the benefit of separating the specifications (and representations) for structure and content versus processing.
Scientific and reference publishers are major users of SGML: these include Elsevier, Springer-Verlag, Kluwer, and Oxford University Press. Organizations with large scale information handling needs were early adopters of SGML: for example, the International Standards Organization, HMSO (Her Majesty's Stationery Office), the European Patent Office, the European Commission, and the US Department of Defense. The major suppliers of UNIX software and hardware have chosen SGML to deliver their next generation of documentation for publication on paper and as on-line man pages. The latest version of the Oxford English Dictionary is an SGML document available both on paper and as a CDROM.
Within academia, the Text Encoding Initiative (TEI) is a major international project that recommends SGML for the encoding and interchange of any electronic text intended for scholarly analysis. Detailed guidelines have been published by the TEI to assist scholarly research projects in their use of SGML encoding. The American Chemical Society and American Mathematical Society will be using SGML for all their electronic publishing needs. CERN and the publishing wing of the Institute of Physics have also adopted SGML.
SGML has been adopted widely within industry, serving the needs of large multinational corporations and specific industry sectors where standardized data representation is critical to information interchange. For example: the automotive and truck industries have adopted SAE J2008 and T2008, representing a suite of related standards that use SGML for data interchange; the Air Transport Association of America (ATA) has adopted a set of recommended specifications for aerospace industry documents; a Committee of the TCIF/IPI (Telecommunications Industry Forum/Information Products Interchange) has developed DTDs governing telecommunications document interchange; the Semiconductor industry, led by a seven-member consortial group named Pinnacles, is now using SGML encoding for Electronic Component Information Exchange (ECIX).
SGML documents can be created using any authoring tool which can produce files that do not contain application-specific codes (e.g. can write plain ASCII or EBCDIC to a disk file). On the other hand, few users enjoy typing all the markup by hand, unaided by software that can guide the creation of properly-structured information. The advantage of an SGML-based authoring tool is that the software -- having read the rules for the designated document type -- can prompt the user for necessary information, or complain about missing information, or otherwise interpret the constraints of the document grammer in such a manner as to leave the author free from worrying about the required structure. Dedicated SGML software is available for virtually every platform and environment. High-quality commercial software products, as well as public-domain software tools, are available for all the major machine types and operating systems (DOS, Windows, MacOS, UNIX, VMS, etc.)
Many non-profit agencies, educational institutions, service providers, and consortial bodies can supply additional information about SGML. Most of these providers offer free online access via standard Internet protocols.
Some representative resources:
Contact Robin Cover with corrections and updates, or to submit contributions to the ISUG online document database.
