• home
  • forum
  • my
  • kt
  • download
  • Introduces the Python xml_pickle object

    Author: 2007-08-25 14:44:24 From:

    In the first installment of his new 'XML Matters' column -- and as part of his ongoing quest to create a more seamless integration between XML and Python -- David Mertz presents the xml_pickle module. Mertz discusses the design goals and decisions that went into xml_pickle and provides a list of likely uses.

    What is XML? What is Python?
    XML is a simplified dialect of the Standard Generalized Markup Language (SGML). Many of you are familiar with SGML via HTML. Both XML and HTML documents are composed of text interspersed with, and structured by, markup tags in angle brackets. But XML encompasses many systems of tags that allow XML documents to be used for many purposes, including:

    • Magazine articles and user documentation
    • Files of structured data (like CSV or EDI files)
    • Messages for interprocess communication between programs
    • Architectural diagrams (like CAD formats)

    A set of tags can be created to capture any sort of structured information you might want to represent, which is why XML is growing in popularity as a common standard for representing diverse information.

    Python is a freely available, very high-level, interpreted language developed by Guido van Rossum. It combines clear syntax with powerful (but optional) object-oriented semantics. Python is available for a range of computer platforms and offers strong portability between platforms.

    Introduction to the project
    There are a number of techniques and tools for dealing with XML documents in Python. (The Resources section provides links to two developerWorks articles in which I discuss general techniques. It also provides links to other documents on XML/Python topics.) However, one thing that most existing XML/Python tools have in common is that they are much more XML-centric than Python-centric. Certain constructs and coding techniques feel "natural" in a given programming language, and others feel much more like they are imported from other domains. But in an ideal environment all constructs fit intuitively into their domain, and domains merge seamlessly. When they do, programmers can wax poetic rather than merely make it work.

    I've begun a research project of creating a more seamless and more natural integration between XML and Python. In this article, and subsequent articles in this column, I'll discuss some of the goals, decisions, and limitations of the project; and hopefully provide you with a set of useful modules and techniques that point to easier ways to meet programming goals. All tools created as part of the project will be released to the public domain.

    Python is a language with a flexible object system and a rich set of built-in types. The richness of Python is both an advantage and a disadvantage for the project. On one hand, having a wide range of native facilities in Python makes it easier to represent a wide range of XML structures. On the other hand, the range of native types and structures of Python makes for more cases to worry about in representing native Python objects in XML. As a result of these asymmetries between XML and Python, the project -- at least initially -- contains two separate modules: xml_pickle, for representing arbitrary Python objects in XML, and xml_objectify, for "native" representation of XML documents as Python objects. We'll address xml_pickle in this article.

    Part I: xml_pickle
    Python's standard pickle module already provides a simple and convenient method of serializing Python objects that is useful for persistent storage or transmission over a network. In some cases, however, it is desirable to perform serialization to a format with several properties not possessed by pickle. Namely, a format that:

    • Is human readable
    • May be parsed, manipulated, and its objects imported by languages other than Python
    • Supports validation of stored serialized objects

    xml_pickle provides each of these features while maintaining interface compatibility with pickle. However, xml_pickle is not a general purpose replacement for pickle since pickle retains several advantages of its own such as faster operation (especially via cPickle) and a far more compact object representation.

    Using xml_pickle
    Even though the interface of xml_pickle is mostly the same as that of pickle, it is worth illustrating the (quite simple) usage of xml_pickle for those who are not familiar with Python or pickle.

    Python code to demonstrate [xml_pickle]
    
    
    <FONT COLOR="#3333CC"><b>import</b></FONT> xml_pickle  <FONT COLOR="#1111CC"># import the module</FONT>
    
    <FONT COLOR="#1111CC"># declare some classes to hold some attributes</FONT>
    <FONT COLOR="#3333CC"><b>class</b></FONT><A NAME="MyClass1"><FONT COLOR="#CC0000"><b> MyClass1</b></FONT></A>: <FONT COLOR="#3333CC"><b>pass</b></FONT>
    <FONT COLOR="#3333CC"><b>class</b></FONT><A NAME="MyClass2"><FONT COLOR="#CC0000"><b> MyClass2</b></FONT></A>: <FONT COLOR="#3333CC"><b>pass</b></FONT>
    
    <FONT COLOR="#1111CC"># create a class instance, and add some basic data members to it</FONT>
    o = MyClass1()
    o.num = 37
    o.str = <FONT COLOR="#115511">"Hello World"</FONT>
    o.lst = [1, 3.5, 2, 4+7j]
    
    <FONT COLOR="#1111CC"># create an instance of a different class, add some members</FONT>
    o2 = MyClass2()
    o2.tup = (<FONT COLOR="#115511">"x"</FONT>, <FONT COLOR="#115511">"y"</FONT>, <FONT COLOR="#115511">"z"</FONT>)
    o2.num = 2+2j
    o2.dct = { <FONT COLOR="#115511">"this"</FONT>: <FONT COLOR="#115511">"that"</FONT>, <FONT COLOR="#115511">"spam"</FONT>: <FONT COLOR="#115511">"eggs"</FONT>, 3.14: <FONT COLOR="#115511">"about PI"</FONT> }
    
    <FONT COLOR="#1111CC"># add the second instance to the first instance container</FONT>
    o.obj = o2
    
    <FONT COLOR="#1111CC"># print an XML representation of the container instance</FONT>
    xml_string = xml_pickle.XML_Pickler(o).dumps()
    <FONT COLOR="#3333CC"><b>print</b></FONT> xml_string
    

    Everything except the first line and the next-to-last line is generic Python for working with object instances. It might be a little contrived and a little simple, but essentially everything you do with instance data members (including nesting instances as container data, which is how most complex structures are built in Python) is contained in the example above. Python programmers only need to make one method call to encode their objects as XML.

    Of course, once you have "pickled" your objects, you'll want to restore them later (or use them elsewhere). Supposing the above few lines have already run, restoring the object representation is as simple as:

    new_object = xml_pickle.XML_Pickler().loads(xml_string)

    Obviously, in real cases you would want to do something more interesting with the created XML document than just hold it in memory during runtime. For example, you might save the XML document to disk (maybe using the XML_Pickler.dump() method), or transmit it over a communication channel. Actually, the example does print to paper, which might well be a good durable storage format.

    Sample Pyobjects.dtd document
    Running the sample code above will produce a pretty good example of the features of an xml_pickle representation of a Python object. But the following example is a hand-coded test case I've developed that has the advantage of containing every XML structure, tag and attribute allowed in document type. The specific data is invented, but it is not hard to imagine the application the data might belong to.

    <?xml version="1.0"?>
    <!DOCTYPE PyObject SYSTEM "PyObjects.dtd">
    <PyObject class="Automobile">
       <attr name="doors" type="numeric" value="4" />
       <attr name="make" type="string" value="Honda" />
       <attr name="tow_hitch" type="None" />
       <attr name="prev_owners" type="tuple">
          <item type="string" value="Jane Smith" />
          <item type="tuple">
             <item type="string" value="John Doe" />
             <item type="string" value="Betty Doe" />
          </item>
          <item type="string" value="Charles Ng" />
       </attr>
       <attr name="repairs" type="list">
          <item type="string" value="June 1, 1999:  Fixed radiator" />
          <item type="PyObject" class="Swindle">
             <attr name="date" type="string" value="July 1, 1999" />
             <attr name="swindler" type="string" value="Ed's Auto" />
             <attr name="purport" type="string" value="Fix A/C" />
          </item>
       </attr>
       <attr name="options" type="dict">
          <entry>
             <key type="string" value="Cup Holders" />
             <val type="numeric" value="4" />
          </entry>
          <entry>
             <key type="string" value="Custom Wheels" />
             <val type="string" value="Chrome Spoked" />
          </entry>
       </attr>
       <attr name="engine" type="PyObject" class="Engine">
          <attr name="cylinders" type="numeric" value="4" />
          <attr name="manufacturer" type="string" value="Ford" />
       </attr>
    </PyObject>

    Informally, it is not difficult to see the structure of a PyObjects.dtd XML document. (A formal document type definition (DTD) is available in Resources.) But the DTD will disambiguate any issues that are not immediately evident.

    Looking at the sample XML document, you can see that the three stated design goals of xml_pickle have been met:

    • The format is human readable
    • The XML representations may be manipulated by means other than xml_pickle -- whether they are unrelated Python/XML modules, XML libraries in other programming languages, XML-enhanced editors and utilities, or just simply text-editors (as was used in creation of the sample)
    • XML representations of Python objects may be validated using standard XML validators and PyObjects.dtd

    All documents that conform to the DTD and only documents that conform to the DTD will be representations of valid Python objects.

    Design features, caveats and limitations

    Content model
    The content models of Python and XML are simply different in certain respects. One significant difference is that XML documents are inherently linear in form. Python object attributes -- and also Python dictionaries -- have no definitional order (although implementation details create arbitrary ordering, such as of hashed keys). In this respect, the Python object model is closer to the relational model; rows of a relational table have no "natural" sequence, and primary or secondary keys may or may not provide any meaningful ordering on a table. The keys are always orderable by comparison operators, but this order may be unrelated to the semantics of the keys.

    An XML document always lists its tag elements in a particular order. The order may not be significant to a particular application, but the XML document order is always present. The effect of the differing significance of key order in Python and XML is that the XML documents produced by xml_pickle are not guaranteed to maintain element order through "pickle"/"unpickle" cycles. For example, a hand-prepared PyObjects.dtd XML document, such as the one above, may be "unpickled" into a Python object. If the resultant object is then "pickled," the <attr> tags will most likely occur in a different order than in the original document. This is a feature, not a bug, but the fact should be understood.

    Limitations
    Several known limitations occur in xml_pickle as of the current version (0.2). One potentially serious flaw is that no effort is made to trap cyclical references in compound/container objects. If an object attribute refers back to the container object (or some recursive version of this), xml_pickle will exhaust the Python stack. Cyclical references are likely to indicate a flaw in object design to start with, but later versions of xml_pickle will certainly attempt to deal with them more intelligently.

    Another limitation is that the namespace of XML attribute values (such as the "123" in <attr name="123">) is larger than the namespace of valid Python variables and instance members. Attributes created manually outside the Python namespace will have the odd status of existing in the .__dict__ magic attribute of an instance, but being inaccessible by normal attribute syntax (e.g. "obj.123" is a syntax error). This is only an issue where XML documents are created or modified by means other than xml_pickle itself. At this time, I simply haven't determined the best way of handling this (somewhat obscure) issue.

    A third limitation is that xml_pickle does not handle all attributes of Python objects. All the "usual" data members (strings, numbers, dictionaries, etc.) are "pickled" well. But instance methods, and class and function objects as attributes, are not handled. As with pickle, methods are simply ignored in "pickling." If class or function objects exist as attributes, an XMLPicklingError is raised. This is probably the correct ultimate behavior, but a final decision has not been made.

    Design choices
    One genuine ambiguity in XML document design is the choice of when to use tag attributes and when to use subelements. Opinions on this design issue differ, and XML programmers often feel strongly about their conflicting views. This was probably the biggest issue in deciding the xml_pickle document structure.

    The general principle decided was that a thing that is naturally "plural" should be represented by subelements. For example, a Python list can contain as many items as you like, and is therefore represented by a sequence of <item> subelements. On the other side, a number is a singular thing (the value might be more than 1, but there is only one thing in it). In that case, it seemed much more logical to use an XML attribute called "value." The really difficult case was identified with Python strings. In a basic way, they are sequence objects -- just like lists. But representing each character in a string using a hypothetical tag would destroy the goal of human readability, and make for enormous XML representations. The decision was made to put strings in the XML "value" attribute, just as with numbers. However, from an aesthetic point of view, this is probably less desirable than within a tag container, especially for multiline strings. But this decision seemed more consistent since there was no other "naked" #PCDATA in the specification.

    In part because strings are stored in XML "value" attributes -- but mostly to maintain the syntactical nature of the XML document -- Python strings needed to be stored in a "safe" form. There are a few unsafe things that could occur in Python strings. The first type is the basic markup characters like greater-than and less-than. A second type is the quote and apostrophe characters that set off attributes. The third type is questionable ASCII values, such as a null character. One possibility considered was to encode the whole Python strings in something like base64 encoding. This would make strings "safe," but also completely unreadable to humans. The decision was made to use a mixed approach. The basic XML characters are escaped in the style of "&amp;", "&gt;" or "&quot;". Questionable ASCII values are escaped in Python-style, such as "\000". The combination makes for human-readable XML representations, but requires a somewhat mixed approach to decoding stored strings.

    Anticipated uses
    There are a number of things that xml_pickle is likely to be good for, and some user feedback has indicated that it has entered preliminary usage. Below are a few ideas.

    - XML representations of Python objects may be indexed and cataloged using existing XML-centric tools (not necessarily written in Python). This provides a ready means of indexing Python object databases (such as ZODB, PAOS, or simply shelve). - XML representations of Python objects could be restored as objects of other OOP languages, especially ones having a similar range of basic types. This is something that has yet to be done. Much "heavier" protocols like CORBA, XML-RPC, and SOAP have an overlapping purpose, but xml_pickle is pretty "lightweight" as an object transport specification. - Tools for printing and displaying XML documents can be used to provide convenient human-readable representations of Python objects via their XML intermediate form. - Python objects can be manually "debugged" via their XML representation using XML-specific editors, or simply text editors. Once hand-modified objects are "unpickled," the effects of the edits on program operation can be examined. This provides an additional option to other existing Python debuggers and wrappers.

    Please send me your feedback if you develop additional uses for xml_pickle or see enhancements that would open the module to additional uses.

    discuss this topic to forum

    relation tutorial

    No relevant information

    Category

      Development (6)
      Introduction to Python (5)
      Miscellaneous (4)
      Searching (2)
      Web Fetching (5)
      XML and Python (0)

    New

    Hot