Markup Languages
As you saw in tutorial 1, XML is becoming an essential part of the corporate Digital Nervous System (DNS). Microsoft's focus is on using XML to accomplish three goals: creating messages in a standard format (using BizTalk), separating data and presentation when building Web pages (using Microsoft Internet Explorer 5), and calling methods through firewalls and between different platforms (using the Simple Object Access Protocol [SOAP]). In this tutorial, we will look at some of the reasons XML is better suited to accomplish these goals than other markup language options, such as Hypertext Markup Language (HTML) or Standard Generalized Markup Language (SGML).
A markup language uses special notation to mark the different sections of a document. In HTML documents, for example, angle brackets (<>) are used to mark the different sections of text. In other kinds of documents, you can have comma-delineated text, in which commas are used as special characters. You can even use binary code to mark up the text, as could be done in a Microsoft Office document. For every markup language, software developers can build an application to read documents written in that markup language. Web browsers will read HTML documents and Microsoft Office will read Office documents. Documents written in XML can be read by customized applications using various parsing objects, or they can be combined with Extensible Stylesheet Language (XSL) and presented in a Web browser.
Documents created using a markup language consist of markup characters and text. The markup characters define the way the text should be interpreted by an application reading this document. For example, in HTML <h1>Introduction</h1> contains the markup characters <h1> and </h1> and the text Introduction. When read by an application that reads HTML¡ªsay, a Web browser¡ªthe markup characters tell the application that the text Introduction should be displayed using the h1 (heading 1) font.
Thus, when you are using a markup language, you should consider the following three elements:
- The markup language, which defines the markup characters
- The markup document, which uses the markup language and consists of markup characters and text
- The interpreted document, which is a markup document that has been read and interpreted by an application
However, in XML the markup language itself is the only element that is predefined¡ªthe designer of an XML document defines the structure of the document and the markup characters. This feature makes XML flexible and allows the data in the interpreted document to be used for a wide variety of purposes. For example, the formatted data in an XML document could be parsed and then displayed to a user, placed in a database, or used by another application.
This tutorial focuses on three markup languages: XML, HTML, and SGML. Let's begin with SGML, the parent language of both HTML and XML.
As mentioned, you can think of a Microsoft Office document as being built from a type of markup language. However, Microsoft Office documents can be read only by Microsoft Office or by an application that can convert a Microsoft Office document. Thus, Microsoft Office documents are not application-independent and can be shared only with people who have Microsoft Office or a converter. Because corporations need to share data with a large number of partners, customers, and different departments within the corporation, they need documents that are application-independent. SGML was designed to meet this need; it is a markup language that is completely independent of any application.
SGML uses a document type definition (DTD) to define the structure of the document. The DTD specifies the elements and attributes that can be used within the document and specifies what characters will be used to mark the text. In SGML, you can use brackets (<>), dashes (-), or any other character to mark up your document as long as the special character is properly defined in the DTD.
SGML has existed for more than a decade and is older than the Web. It is a metalanguage that was created to maintain repositories of structured documentation in an electronic format. As a metalanguage, SGML describes the document structures for other markup languages. SGML is used to define the markup characters and structure for XML. An SGML definition for HTML has also been created. Both HTML and XML can be considered applications of SGML.
SGML is an extremely versatile, powerful language. Unfortunately, these features come with a price: SGML is difficult to use. Training people to use SGML documents and creating applications that read SGML documents requires a great deal of time and energy. Because of these difficulties, SGML is not suited for Web development. The specification for SGML is over 500 pages long, with over 100 pages of annexes. It is a very complex specification designed for large, complex systems¡ªoverkill for our three goals of standardized messages, separation of data and presentation, and method calling.
Nearly every computer user is familiar with HTML. HTML is a fairly simple language that has helped promote the wide usage of the Internet. HTML has come a long way since it was originally designed so that scientists could use hyperlinked text documents to share information. Let us begin by looking at HTML's original version.
In its original conception, HTML was supposed to include elements that could be used to mark information within the HTML document according to meaning. Tags such as <title>, <h1>, <h2>, and so on were created to represent the content of the HTML document.
How the marked text would actually be interpreted and displayed would depend on the Web browser's settings. Theoretically, any two browsers with the same user settings would present the same HTML document in the same way. This flexibility would enable users with special needs or specific preferences to customize their Web browsers to view HTML pages in their preferred format¡ªan especially useful feature for people with impaired vision or who are using older Web browsers.
In this scenario, the HTML developer uses tags based on an HTML standard that are displayed according to the user's preferences. For this to work, it must be based on a standard for HTML. The current Web standard can be found at http://www.w3.org.
HTML has proved to be a great language for the initial development of the Internet. As the Internet matures, the need has developed for a language that can be used for more complex and large-scale purposes such as fulfilling corporate functions, and HTML quickly fails to meet the mark. Let's look at some of the problems with HTML.
Conflicting standards
In 1994, Netscape created a set of HTML extensions that worked only in Netscape's Web browser. This was the beginning of the browser wars, and the first casualty was the HTML standard. Using these extensions, Netscape could now allow the author of the HTML document to specify font size, font and background color, and other features. Eventually, Netscape added frames. Of course, all of these extensions would not display properly in any other browser. The HTML extensions were so popular that by 1996 Netscape was the number one browser.
Although Netscape won a major victory, Web developers and users suffered a major loss. In addition to the problem of handling nonstandard extensions, different browsers handle the standard tags in different ways. This means that Web designers now have to create different versions of the same HTML document for different Web browsers. The extensions force users to accept pages that are formatted according to the author's wishes.
NOTE
In most browsers, you can create default settings that will override the settings in the HTML pages. Unfortunately, most users do not know how to use these settings, and if you do set your own defaults, most pages will not display correctly.
Creating HTML documents that will appear approximately the same in all browsers is a difficult, and at times impossible, task. For information about this topic, see the Web Standards Project at http://www.webstandards.org.
NOTE
It is beyond the scope of this book to go into the details of HTML standardization, but the Web Standards Project site will provide you with the information and resources you need.
No international support
The Internet has created a global community and made the world a much smaller place. Corporations are expanding their businesses into this global marketplace, and they are extending their partners and corporations around the globe, linking everything through the Internet. A few proposals to create an international HTML standard have been put forward, but no standard has actually materialized. There are no HTML tags that can identify what language an HTML document is written in.
Inadequate linking system
When you create HTML documents, links are hard-coded into the document. If a link changes, the Web developer must search through all the HTML documents to find all references to the link and then update them. With Web sites that are dynamic and constantly evolving and growing to meet the needs of the users, this lack of a linking system can create substantial problems. We need a much more sophisticated method of linking documents than can be provided by HTML. HTML does not allow you to associate links to any element, nor does it allow you to link to multiple locations, whereas the linking system in XML does provide these features. In Chapter 6, you will learn more about XML's linking capability.
Faulty structure and data storage
HTML does have a structure, but this structure is not extremely rigid. For example, you can place heading 3 (<h3>) tags before heading 1 (<h1>) tags. Within the <body> tag, you can place any legitimate tag anywhere you want. You can validate HTML documents, but this validation only confirms that you have used the tags properly. Even worse, if you leave off end tags, the browser will try to figure out where the end tags should be and add them in. Thus, you can create HTML code that is not properly written but will still be interpreted properly by the browser.
Another problem arises if you try to put data into an HTML document. You will find it very difficult to do so. For example, suppose we are trying to put information from a database into an HTML document. We have a database table named Customer with the following fields: customerID, customerName, and customerAddress. When we create an HTML document with this data, every customer should have a customerID and a customerName value. The customerAddress value is optional. We could present this data in HTML in a table, as follows:
<body> <table border="1" width="100%"> <tr> <th width="33%">Name</th> <th width="33%">Address</th> <th width="34%">ID</th> </tr> <tr> <td width="33%">John Smith</td> <td width="33%">125 Main St. Anytown NY 10001</td> <td width="34%">001</td> </tr> <tr> <td width="33%">Jane Doe</td> <td width="33%">2 Main St. Anytown NY 10001</td> <td width="34%">002</td> </tr> <tr> <td width="33%">Mark Jones</td> <td width="33%">35 Main St. Anytown NY 10001</td> <td width="34%"></td> </tr> </table> </body> |
In a browser, this table would appear as shown in Figure 2-1.
Figure 2-1. Database table created using HTML.
This document is completely valid HTML code. There are no errors in the HTML code for the table; it is syntactically correct. Yet in terms of the validity of the data, the information is invalid. The third entry, Mark Jones, is missing an ID. Although it is possible to write applications that perform data validation on HTML documents, such applications are complex and inefficient. HTML was never designed for data validation.
HTML was also not designed to store data. The table is the most common way of both presenting and storing data in HTML. You can use <div> tags to create more complex structures to store data, but once again you are left with the task of writing your own data validation code.
What we need instead is something that enables us to put the data in a structured format that can be automatically validated for syntactical correctness and proper content structure. Ideally, the author of the document will want to define both the format of the document and the correct structure of the data. As you will see in Chapters 4 and 5 this is exactly what XML and DTDs do.
In 1996, the World Wide Web Consortium (W3C) began to develop a new standard markup language that would be simpler to use than SGML but with a more rigid structure than HTML. The W3C established the XML Working Group (XWG) to begin the process of creating XML.
The goals of XML as given in the version 1.0 specification (http://www.w3.org/TR/WD-xml-lang#sec1.1) are listed here, followed by a description of how well these have been implemented in the current XML standard:
- XML shall be straightforwardly usable over the Internet. Currently, only minimal support for XML is provided in most Web browsers. Internet Explorer 4 and Netscape Navigator 4 both provide minimal support. Internet Explorer 5 provides additional support for XML, which will allow Web developers to use XSL pages to present XML content.
- XML shall support a wide variety of applications. With the introduction of BizTalk and SOAP, XML will be used in a wider range of applications. Other applications, such as Lotus Domino, also use XML. Many applications are now available for viewing and editing XML content and DTDs.
- XML shall be compatible with SGML. Many SGML applications and SGML standard message formats are currently in existence. By making XML compatible with SGML, many of these SGML applications can be reused. Although the conversion process can be complex, XML is compatible with SGML.
- It shall be easy to write programs that process XML documents. For XML to become widely accepted, the applications that process XML documents must be easy to build. If these applications are simple, it will be cost-effective to use XML. The current specification does meet this goal, especially when you use a parser such as the ones provided by Microsoft and IBM.
- The number of optional features in XML is to be kept to the absolute minimum, ideally zero. The more optional features, the more difficult it will be to use XML. The more complex a language, the more it costs to develop with it and the less likely anyone will be to use it. The XML standard has met this goal.
- XML documents should be human-legible and reasonably clear. Ideally, you should be able to open an XML document in any text editor and determine what the document contains. With a basic understanding of XML, you should be able to read an XML document.
- The XML design should be prepared quickly. It is essential that the standard be completed quickly so that XML can be used to solve current problems.
- The design of XML shall be formal and concise. It is essential that computer applications be able to read and parse XML. Making the language formal and concise will allow it to be easily interpreted by a computer application. XML can be expressed in Extended Backus-Naur Form (EBNF), which is a notation for describing the syntax of a language. EBNF in turn can be easily parsed by a computer software program. SGML cannot be expressed in EBNF. For more information about EBNF, refer to http://nwalsh.com/docs/articles/xml/toc.html#EBNF.
- XML documents shall be easy to create. Several XML editors are now available that make it easy to create XML documents; these editors will be discussed in Chapter 3. You can also create your own custom XML editor.
- Terseness in XML markup is of minimal importance. Making the XML markup extremely concise is less important than keeping the XML standard concise. You could include an entire set of acceptable shortcuts in the standard (as SGML does) and avoid putting them in the markup, but this will make XML much more complex. XML has successfully done this.
These goals are geared toward making XML the ideal medium for creating Web applications. As an added bonus, XML will also be perfect for creating standard messages and passing messages to call methods.
Four specifications define XML and specify how it will achieve these goals:
- The XML specification defines XML syntax. It is available at http://www.w3.org/TR/WD-xml-lang.
- The XLL specification defines the Extensible Linking Language. It is available at http://www.w3.org/TR/xlink.
- The XSL specification defines Extensible Style Sheets. It is available at http://www.w3.org/TR/NOTE-XSL.html.
- The XUA specification defines the XML User Agent. This specification will define an XML standard similar to SOAP; it has not yet been created.
The current XML specification is only 26 pages long¡ªas opposed to several hundred pages for the SGML specification. XML is easy to use and, with BizTalk, can be used to create messages in a standardized format. XML allows you to separate content and presentation using XML documents and XSL pages. Using SOAP, you can package a request for a method on a remote server in an XML document, which can be used by a server to call the method. Thus, XML can fulfill the three basic goals perfectly.
The following features of XML make it well suited for the corporate DNS:
- XML is international. XML is based on Unicode. Unicode allows for a larger amount of storage space for each character, which in turn makes it possible for Unicode to include characters for foreign alphabets. SGML and HTML are based on ASCII, which does not work well with many foreign languages.
- XML can be structured. Using DTDs, XML can be structured so that both the content and syntax can be easily validated. This enhanced structure will enable you to create standardized valid XML documents.
- XML documents can be built using composition. Using the more powerful linking methods of XML, documents can be created from a composite of other documents. This enhanced linking system will enable you to create customized documents by selecting only the pieces of other documents you need.
- XML can be a data container. XML is ideally suited to be a container for data. Using DTDs, you can efficiently represent almost any data so that it can be read by humans, computer parsers, and applications.
- XML offers flexibility. XML allows you either to not use a DTD (a default one will be used) or to define the structure of your document to the smallest detail using a DTD. With a DTD, you can define the exact structure of your document so that both the structure of the data and the content can be easily validated.
- XML is easy to use. XML is only slightly more complicated than HTML. As more browsers support XML and more tools are available for working with XML, it is likely that more developers will take advantage of XML.
- XML has standard formats. Standard formats for XML documents can be easily produced.
With these advantages, XML can be used to cater to the more complex corporate needs.
HTML was well suited for the birth of the Internet, but the Internet has become a center for commerce and information and a central focus of business operations, and HTML is no longer capable of meeting its needs. The failure of Internet browsers to meet the HTML standards, the difficulty of validating HTML documents, a poor linking system, and a lack of international support has made HTML a poor choice for the future. SGML is an excellent, powerful tool capable of documenting complex systems, but unfortunately, SGML is far too complex for the current needs of the Internet.
XML is ideally suited for the next generation of Internet applications, for ecommerce, and for the corporate DNS. XML is a simpler, lighter markup language, which is flexible, is easy to use, and can be used for international documents. XML is ideal for storing data and sending messages, and XML documents can be validated.
At the time this book is being written, a large portion of the XML standard is complete, and it's likely to remain the same for some time. The XML 1.0 specification, defining the syntax of the XML language and XML DTDs, is well accepted and is not likely to change in the near future. Other elements of XML are still evolving, including schemas, which are similar to DTDs, and XML Path Language (XPath), which is a replacement for some of the current XML linking mechanisms. Over the next few years, XML will be refined to become an incredibly powerful tool that will create the next evolution of the Internet. This book will present both the current XML standard and a glimpse into the XML, and applications, of the future.
discuss this topic to forum
