|
<home> <about> <technology> <introdution> <modelling> <syntax> <presentation> <linking> <graphics> <multimedia> <knowledge> <database/> </technology> <membership> <chapters> <contact> <news> <events> <search>
|
XML and Databases... XML and databases are a natural fit for each other in three important ways. First, XML documents provide a platform-neutral mechanism for transporting data between databases and applications. Second, databases provide an efficient way to store and query XML documents. And third, the loosely-structured hierarchy of XML documents provides a useful way to store semi-structured or unstructured data, which does not easily fit into existing database models. Before we start talking about XML and databases, we need to answer a question that occurs to many people: "Is XML a database?" An XML document is a database only in the strictest sense of the term. That is, it is a collection of data. A more useful question is whether XML and its surrounding technologies constitute a database management system (DBMS). The answer to this question is, "Not really." While XML provides many of the things found in databases storage, schemas, query languages, and so on it lacks many of the important features of DBMSs efficient storage, indexes, security, transactions and data integrity, multi-user access, and so on. So while it might be possible to use XML as a database in enviroments with small amounts of data, few users, and modest performance requirements, this will fail in most production environments. Most people's usage of XML and databases falls into two broad categories, based on whether they are using the database to store data or documents. For example, is XML used simply as a data transport between the database and an application? Or is its use integral, as in the case of XHTML and DocBook documents? This is usually a matter of intent, but it is important because all data-centric documents share a number of characteristics, as do all document-centric documents, and these influence how XML is stored in the database. Data-centric documents are documents that use XML as a data transport. They are designed for machine consumption and the fact that XML is used at all is usually superfluous. That is, it is not important to the application or the database that the data is, for some length of time, stored in an XML document. Examples of data-centric documents are sales orders, flight schedules, scientific data, and stock quotes. Document-centric documents are (usually) documents that are designed for human consumption. Examples are books, email, advertisements, and almost any hand-written XHTML document. They are characterized by less regular or irregular structure, larger grained data (that is, the smallest independent unit of data might be at the level of an element with mixed content or the entire document itself), and lots of mixed content. The order in which sibling elements and PCDATA occurs is almost always significant. In practice, the distinction between data-centric and document-centric documents is not always clear. For example, an invoice (data-centric) might contain large-grained, irregularly structured data, such as a part description. And a user's manual (document-centric), might contain fine-grained, regularly structured data (often metadata), such as an author's name and a revision date. In spite of this, characterizing your documents as data-centric or document-centric will help you decide what kind of database to use. As a general rule, data is stored in a traditional database, such as a relational, object-oriented, or hierarchical database. This can be done by third-party middleware or by capabilities built in to the database itself. In the latter case, the database is said to be XML-enabled. Documents are stored in a native XML database (a database designed especially for storing XML) or a content management system (an application designed to manage documents and built on top of a native XML database). These rules are not absolute. Data especially semi-structured data can be stored in native XML databases and documents can be stored in traditional databases when few XML-specific features are needed. Furthermore, the boundaries between traditional databases and native XML databases are beginning to blur, as traditional databases add native XML capabilities and native XML databases support the storage of document fragments in external (usually relational) databases. In order to transfer data between XML documents and a database, it is necessary to map the XML document schema (DTD, XML Schema, RELAX NG schema, etc.) to the database schema. The data transfer software is then built on top of this mapping. The software may use an XML query language (such as XPath, XQuery, or a proprietary language) or transfer data strictly according to the mapping (the XML equivalent of SELECT * FROM Table). Two mappings are common: the table-based mapping and the object-relational mapping. The table-based mapping is used by many of the middleware products that transfer data between an XML document and a relational database. It models XML documents as a set of tables. That is, the structure of an XML document must be as follows: The table-based mapping is useful for serializing relational data, such as when transferring data between two relational databases. Its obvious drawback is that it cannot be used for any XML documents that do not match the above format. The object-relational mapping is used by all XML-enabled relational databases and some middleware products. It models the data in the XML document as a tree of objects that are specific to the data in the document. Element types with attributes and/or element content (complex element types) are modeled as classes. Attributes, and element types with PCDATA-only content (simple element types), are modeled as scalar properties. The model is then mapped to relational databases using traditional object-relational mapping techniques. That is, classes are mapped to tables, scalar properties are mapped to columns, and object-valued properties are mapped to primary key / foreign key pairs. It is important to understand that the object model used in this mapping is not the Document Object Model (DOM). The DOM models the document itself and is the same for all XML documents, while the model described above models the data in the document and is different for each set of XML documents that conforms to a given XML schema. For example, a sales order document that contains a SalesOrder element with two nested Item elements could be modeled as a tree of objects from two classes SalesOrder and Item as shown in the following diagram: Were a DOM tree built from the same document, it would be composed of objects such as Element, Attr, and Text: Data can also be stored in a native XML database. The main reasons to do this are for semi-structured data and query speed. Semi-structured data has regular structure, but that structure varies enough that mapping it to a relational database results in either a large number of columns with null values (which wastes space) or a large number of tables (which is inefficient to retrieve). Such data can be stored in a native XML database in the form of an XML document. Native XML databases can often retrieve data as XML much faster than relational databases. The reason for this is that native XML databases often store entire documents together physically or use physical (rather than logical) pointers between the parts of the document. This allows the documents to be retrieved either without joins or with physical joins, both of which are faster than the logical joins used by relational databases. Unfortunately, the increase in speed may apply only to queries along the document hierarchy. Queries outside this hierarchy may be slower than those in relational databases, although this probably depends on the database. Such queries also require substantial indexing, which may slow update speed. There are two basic strategies to storing XML documents in databases: store them as a BLOB in a relational database and accept limited XML functionality, or store them in a native XML database. The main advantage to storing an XML document as a BLOB in a relational database is that you don't have to buy or learn new software. The database provides transactional control, security, multi-user access, and so on. In addition, many relational databases have tools for searching text and can do such things as full-text searches, proximity searches, synonym searches, and fuzzy searches. Some of these tools are being made XML-aware, which will eliminate the problems involved with searching XML documents as pure text. If you need more features than are offered by a BLOB in a relational database, you should consider a native XML database. Native XML databases are databases designed especially to store XML documents. Like other databases, they support features like transactions, security, multi-user access, programmatic APIs, query languages, and so on. The only difference from other databases is that their internal model is based on XML and not something else, such as the relational model. Native XML databases are most clearly useful for storing document-centric documents. This is because native XML databases preserve things like document order, processing instructions, comments, CDATA sections, and entity usage, while XML-enabled databases do not. Furthermore, native XML databases support XML query languages, allowing you to ask questions like, "Get me all documents in which the third paragraph after the start of the section contains a bold word." Such queries are clearly difficult to ask in a language like SQL. Native XML databases are also useful for storing documents whose "natural format" is XML, regardless of what those documents contain. For example, suppose XML documents are used for messaging in an e-commerce system. Although these documents are probably data-centric, their natural format as messages is XML. Thus, when they are stored in a message queue, it makes more sense to use a message queue built on a native XML database than a non-XML database. The native XML database offers XML-specific capabilities, such as XML query languages, and will usually be faster at retrieving whole messages. Another example of this sort of usage is a Web cache. Several other uses for native XML databases are to store semi-structured data, to increase retrieval speed in certain situations, and to store documents that do not have DTDs (well-formed documents). The latter capability is useful for applications like Web search engines, where no single DTD or set of DTDs applies to all the documents in question. Although no formal definition of native XML databases exists, one possible definition, developed by members of the XML:DB mailing list, is that a native XML database is one that: Defines a (logical) model for an XML document -- as opposed to the
data in that document -- and stores and retrieves documents according to
that model. At a minimum, the model must include elements, attributes,
PCDATA, and document order. Examples of such models are the XPath data
model, the XML Infoset, and the models implied by the DOM and the events
in SAX 1.0. Has an XML document as its fundamental unit of (logical) storage,
just as a relational database has a row in a table as its fundamental
unit of (logical) storage. Is not required to have any particular underlying physical storage
model. For example, it can be built on a relational, hierarchical, or
object-oriented database, or use a proprietary storage format such as
indexed, compressed files. Perhaps the easiest way to think of a native XML database is as one that maps the DOM directly to an object-oriented database or to Elements, Attributes, PCDATA, etc. tables in a relational database. (In fact, some native XML databases do just this, although the mappings involved are somewhat more complex to get better performance.) In practice, most native XML databases are built on proprietary data stores. These may simply be compressed, indexed text or (more commonly) stores designed for a specific model of an XML document. Copyright 2002 by Ronald Bourret. Used by permission. |