• home
  • forum
  • my
  • kt
  • download
  • Disk-based Data Structures (DBM)

    Author: 2007-08-10 09:39:38 From:

    We've discussed the interaction between Perl and relational databases, courtesy of the DBI module, in several installments of the Perl You Need to Know. But suppose you don't want to get tied up with a relational database and all the supporting software needed, yet you still need to store a dataset to disk for future use. This month we'll look at several Perl DBM, or database management, solutions which have much less overheard than DBI and are a quick way to store and use Perl data structures to and from disk.

    One of the most oft-used data structures in Perl is the venerable hash. Simply put, in a hash structure each item is referenced by a "key" which is associated with a "value". Also known as a key/value pair, a hash item is elementary Perl:

    my %car = ( 'make'  => 'Nissan',
                'model' => 'Maxima',
                'year'  => '1997',
                'color' => 'evergreen'
              );
    

    Here we have a typical hash, named %car, with four keys: make, model, year, and color. Each key has an associated value, such as 1997 or evergreen. It's not a big leap of faith to see this hash as a very simple database. In fact, we can see the hash %car as a database table, and each key as a field name. Each value is the value for each field.

    There are two reasons we might want to save this hash to disk. For one, we might want to remember its contents to consult or even modify on future invocations of a script. This is a form of state preservation. Second, the hash might be very large -- suppose it has 1,000 or 40,000 keys -- and working with the whole hash in memory could be an inefficient use of resources.

    Using Perl's dbmopen function, we can "tie" this hash to a disk file, rather than keep it in memory. It's important to note that you need to tie the hash to disk before you begin using it -- a pre-populated hash won't simply save its existing data to disk once it becomes tied. dbmopen typically accepts three parameters:

    dbmopen (%hashName, "filename", $permissions_mode)
    

    Imagine then that we want to create the %car hash in a disk-based file:

    my %car=();
    dbmopen (%car, "car_data", 0666) ||
     die "Could not open or create database!";
    %car = ( 'make'  => 'Nissan',
             'model' => 'Maxima',
             'year'  => '1997',
             'color' => 'evergreen'
            );
    dbmclose (%car);
    

    First we initialize the hash so that it is empty, and then the hash is tied to a disk file. Using the dbmopen function we create a disk file named car_data. Note that on Unix systems the resulting file may automatically grow a ".db" extension onto the filename, while on Windows it may not. The permissions mode is given as a permissions mask applied to the database file only if it is created from scratch -- exactly how this number behaves will depend on the operating system.

    This particular example illustrates only creating a new hash and saving it to disk. In fact, the real utility of a DBM file is that we can use the hash while it resides on disk, just as if it resided in memory.

    my %car=();
    dbmopen (%car, "car_data", 0666) ||
     die "Could not open or create database!";
    print $car{model}."\n";
    dbmclose (%car);
    

    The example above assumes that the %car hash already contains data on disk, as if we had run the previous example first. Simply, this script grabs a key value from the hash and prints it to output -- the result in this case would be 'Maxima'. Once you tie a hash to the DBM file using dbmopen, you simply proceed to use that hash just as any hash, despite the fact that it resides on the disk. If you add new data to the hash, it will automatically be stored to the disk.

    When Perl stores your hash to disk using dbmopen, it is actually relying on a particular DBM system, known as either ODBM or NDBM. The latter, or the "old DBM", was the original DBM system distributed with Perl. Newer implementations of Perl use NDBM or -- wait for it -- the "new DBM". The old DBM, for example, was slower. The new DBM is faster. Also, the old DBM had a very limited storage capacity -- less than 1K for each value. The new DBM supports storage capacities ranging from 1 to 4K depending on the Perl distribution.

    Yet, ODBM and NDBM aren't the only DBM's out there. You can actually use any available DBM to tie your hash to disk. Other DBM's may offer advantages in speed, or storage capacity, or other features that we'll soon see. Remember that DBM's are independent of and incompatible with each other, and if you tie a hash to one type of DBM, you can't simply use that hash with another DBM system.

    Many Perl distributions come with additional DBM's, or you can find them at CPAN, the Perl module repository. In particular, four alternative DBM's are commonly used with Perl.

    SDBM is the "simple DBM", and is optimized for speed with a small storage capacity per value. Worth considering for large hashes with small values.

    GDBM is the Gnu Free Software Foundation implementation of DBM. It offers several advances over the other DBM's, including compatibility with GDBM databases used on many platforms. GDBM is also very fast, and features some synchronization support for shared DBM's which may be accessed by multiple users simultaneously.

    DB_File is the module name for BerkeleyDB version 1. This is an even more sophisticated DBM which is fast, supports large values, and offers several different storage architectures: normal hashes, BTrees, and disk-based arrays. We'll take a closer look at DB_File shortly.

    BerkeleyDB is the DBM module for BerkeleyDB versions 2 and 3. As you can imagine, these DBM's are the most advanced, and are evolutions from version 1 of this software.

    On several occassions we've said that you can "tie" a hash to a disk-based file. In fact, to "tie" a hash means something specific in Perl. To use any of the DBM's we've talked about, one must first tie a hash to the DBM manager. So far we've done this using dbmopen, which itself actually calls the tie function. But that only works for the standard ODBM/NDBM database -- if we want to use any of the fistful of alternative DBM's, we must learn to tie on our own. No more clip-ons.

    Suppose SDBM is attractive for a particular project, because we have a large hash with small values -- which plays right into SDBM's hand. Tie-ing a hash to any DBM follows a standard template that we can illustrate using SDBM.

    use SDBM;
    use Fcntl;
    
    my %car=();
    tie (%car, "SDBM", "car_data", O_CREAT|O_RDWR, 0666) ||
     die "Could not open or create database.";
    %car = ( 'make'  => 'Nissan',
             'model' => 'Maxima',
             'year'  => '1997',
             'color' => 'evergreen'
            );
    untie (%car);
    

    Two use statements begin this code, as we include the SDBM module and the Fcntl module for filesystem control. Some Perl distributions may not include SDBM, or other alternative DBM's, so this or other DBM modules may need to be installed separately. The Fcntl module gives us some keywords that we'll use with the tie function.

    The %car hash is initialized and then the tie call is made. The syntax of tie should certainly recall dbmopen. Following the hash name, we specify the DBM module being used ("SDBM") and the filename for the disk-based file ("car_data"). Next we list some keywords, courtesy of Fcntl, that specify the mode for this file. Specifically, O_CREAT will create the file anew if it doesn't yet exist, while O_RDWR will allow reading or writing to and from the file.

    If, instead, we know that the database file exists and we want read-only access -- no modification will be made to the hash -- we can leave off the O_CREATE keyword and instead write:

    tie (%car, "SDBM", "car_data", O_RDONLY, 0)
    

    By and large, the fact that we're tieing to an SDBM database rather than, say, ODBM or GDBM is relatively transparent to us. Once the hash is tied, we proceed as usual in handling the hash. The exception to this is when the DBM provides features beyond the basic Perl hash. Let's consider two examples of this: DB_File's BTree hash, and MLDBM's support for multi-level hash structures.

    A BTree does not grow bees. Thankfully. We mentioned earlier the DB_File module, which is another DBM that supports the BerkeleyDB version 1. Using the same syntax seen above for tieing a hash to SDBM, we could use DB_File in its normal hash mode. But DB_File also supports another data architecture for your hash -- the BTree.

    A crucial difference between hashes and lists in Perl is that lists are ordered. You can predict the fourth item in a list, because it is always the fourth item (unless you move it or change the list). But the keys of a hash are not ordered -- you can't predict the fourth key in a hash. You can only reference hash values by their associated key, regardless of order to other keys.

    Enter the BTree -- an ordered hash! When you tie a hash to a DB_File BTree, the hash not only preserves keys and values, but order as well. By default, when you add a new key to such a tied hash, it is placed in lexical (alphabetical) order relative to the other keys. You can define your own alternative sorting routing for the order of the BTree, but we won't get into that here. The DB_File documentation does, though.

    So, let's create our %car hash in a DB_File Btree:

    use DB_File;
    use Fcntl;
    
    my %car=(); tie (%car, "DB_File", "car_data", O_CREAT|O_RDWR, 0666, $DB_File::DB_BTREE) ||
    die "Could not open or create database."; %car = ( 'make' => 'Nissan', 'model' => 'Maxima', 'year' => '1997', 'color' => 'evergreen' ); untie (%car);

    What does it mean now that %car is stored as a BTree? For one thing, BTrees are very fast to search, so when you ask for $car{year} DB_File will be able to pull that value very quickly. This is especially useful when your hash is very large, with thousands of keys. Not quite as dramatic when there are only four keys!

    Beyond that, this hash has order. Specifically, the key color is going to be the first key in the hash, followed by make, model, and year -- because that's the lexical order of these keys. DB_File maintains a "cursor" at a given position in the hash -- this is a hypothetical pointer that says "this is where I am in the hash". Much like a cursor in a text document. You may never need to use this cursor, but doing so allows for some funky actions beyond the operations that normal Perl hashes are restricted to.

    For example, let's say we want to find the first key that begins with the letter 'm', and the next key after that. First, the code, assuming the previous example has already run and the contents of %hash are already stored on disk as a BTree:

    use DB_File;
    use Fcntl;
    
    my %car=(); my $dbm = tie (%car, "DB_File", "car_data", O_RDONLY, 0, $DB_File::DB_BTREE) ||
    die "Could not open or create database."; #Set cursor to first key that matches my $matchKey="m"; my $matchValue=0; my @matchedKeys=(); $dbm->seq ($matchKey, $matchValue, R_CURSOR); push (@matchedKeys, $matchKey);
    #Move cursor to next key
    $dbm->seq ($matchKey, $matchValue, R_NEXT); #Release tied hash undef $dbm; untie (%car);

    Several new concepts are introduced in this example. Because we're dealing with functions that normal Perl hashes don't support, we need to rely on functions provided by DB_File itself. When the hash is tied, we assign it to an object handle $dbm, through which we'll call DB_File methods.

    One DB_File method is illustrated here, seq(). Provide seq() with a scalar variable for the key and the value, along with a keyword that tells the method what to do. Once "it" is done, the current key and value will be assigned to the scalar variables provided.

    In the first call to seq(), we specify the keyword R_CURSOR. This tells DB_File to set the cursor at the first key that matches $matchKey. Since we only provided a key of "m", DB_File will set the cursor at the key "make", the first and closest match. Next, the key "make" and its value, "Nissan", will be assigned to $matchKey and $matchValue respectively.

    With the cursor in position, we call seq() again, this time with the R_NEXT keyword. Consequently, DB_File moves the cursor to the next key in order ("model") and fills the two variables with the relevant data ("model" and "Maxima"). Of course, we could have placed the seq( ... R_NEXT) calls inside a loop to traverse many keys of the hash.

    Once complete, we must set the object handle $dbm to undef along with untieing the hash.

    All of the hashes we've seen today have been one-level. Put another way, each key is simply associated with a scalar value. This works great with all the DBM's we've seen for storing these hashes onto disk-based files.

    But the game changes when you want to store a multi-level hash. After all, Perl doesn't restrict key values to scalars -- the value of a key could be a list of scalars, or another hash, or a list of hashes, or a hash of lists, and so on ad infinitum. In fact, Perl hashes can be arbitrarily complex with many levels, and this is what makes Perl hashes especially powerful (and sometimes mind-numbingly confusing). None of the DBM approaches we've seen thus far can properly store such a multi-level hash, though.

    Fortunately, there is a solution to everything. Well, not everything, but at least to this problem.

    MLDBM is actually a module which sits on top of one of the other DBM's we've seen today, and lets you store multi-level hashes transparently, as if they were single-level hashes. The key to this, not to steal any of MLDBM's thunder, is rather simple and it's called serialization.

    A serializer is an algorithm which essentially "flattens" a multi-level data structure into a single scalar value. Of course, like a compressed text file, you can't actually use the data structure in its serialized state, but it is a way to make such complex data portable. There are several serializers available for Perl, especially Data::Dumper and Storable. MLDBM essentially lets you hook up one of these serializers with one of your preferred DBM's, transparently so that the serializing and de-serializing happens without your intervention.

    By default, MLDBM uses the SDBM database with the Data::Dumper serializer. But I prefer Storable, because it is faster. I also prefer DB_File for a DBM, and since we've seen that in use, let's illustrate using MLDBM with DB_File and Storable to tie a multi-level hash to disk. That's quite a mouthful.

    use MLDBM qw(DB_File Storable);
    use Fcntl;
    
    my %car=(); tie (%car, "MLDBM", "car_data", O_CREAT|O_RDWR, 0666, $DB_File::DB_BTREE) ||
    die "Could not open or create database."; $car{'JN1HU11P1HX875232'} =
    { 'make' => 'Nissan', 'model' => 'Maxima', 'year' => '1997', 'color' => 'evergreen' };
    $car{'1GNDM15Z2HB187252 '} =
    { 'make' => 'Chevrolet', 'model' => 'Astro', 'year' => '1999', 'color' => 'black' };
    print "Inventory contains: \n"; foreach (keys %car) {
    print $_ . ":\t".$car{$_}{make}." ".$car{$_}{model}."\n";
    } untie (%car);

    When we use MLDBM we express our preference for the DB_File DBM and the Storable serializer. From there, we tie the hash and work with it rather normally. In this case, we again create a BTree hash. Our %car hash is a two-level hash, because the values of the top-level keys are themselves hashes. Each top-level key is a vehicle identification number, or VIN, representing a fictional inventory. The value for each VIN is a hash describing the car. A short output loop near the end of this script illustrates how we can dig into the hash levels.

    There's nothing special about the code for managing this hash, and that's the point. Thanks to MLDBM, we deal with a multi-level hash just like any other, and the fact that it is being stored to disk, serialized, and de-serialized is entirely transparent.

    Storing hashes on disk is a quick and convenient way to keep a simple database, whose data is already in a format native to Perl (hashes). While DBM databases are not as sophisticated as relational databases accessed through DBI, they can be a surprisingly flexible solution for state preservation and storage of large quanities of simple data without the overhead and software managment issues that relational databases inevitable incur. Whereas DBI is the SUV of data management, consider DBM the nimble Miata.

    One of the most popular events of the summer Olympics is the men's 100 meter sprint, an event that has appeared at every Olympics since the original at Athens. The competitors are often fiery tempered, and the victors declared among the fastest men on Earth. The millions who watch the 100 meter are captured, ultimately, by some 10 seconds worth of action. Actually, 9.79 seconds, considering Maurcie Greene's current world record. But what would Mr. Greene say if we told him that a Perl subroutine can split and sort a sentence into an alphabetically ordered list of letters 98,253 times in that same 9.79 seconds? What do you have to say for yourself now, Mr. Fast Pants? Well, he'd most likely pummel our soft and fleshy behinds, and those of anyone who knows this much about Perl. But it's still true. And we know it's true because of the Benchmark module ¡ª our handy Perl stopwatch with which we can time, optimize, and slim down on code. I think Mr. Greene would appreciate that. After the pummeling.

    Quality is Job Number One

    It's been said that there are a thousand ways to solve any single goal in Perl. Indeed, Perl is a very flexible language, liberating our quirky brains to find solutions peculiar to our individual personalities, although a thousand may be on the optimistic side. But let's not be fooled into complacency by the tolerance and open-minded nature of the cult of Perl -- not all solutions are created equal, especially when it comes to execution time.

    Interpreted languages like Perl are often looked down upon by speed demons, who prefer the endorphine highs of compiled languages or the extremists spending their days inhaling assembly language and machine code. Despite their sneers, it's quite easy to conjure up several Perl subroutines all of which solve the same problem, and do so in wildly varying execution times. This may not matter for the script that is only run once, or on special occasions -- but with Perl backing many Web servers, it's entirely possible for Perl scripts to be executed millions of times a day. Like picking pennies from the floor, it all adds up over time.

    Perl's Benchmark module is an extremely useful tool for measuring the speed of your scripts, whether in their entirety, or down to the level of particular subroutines, or single lines of code. While some may want to comb over a script to optimize the speed of every expression, benchmarking is also incredibly helpful in finding significant bottlenecks in a script ¡ª segments of code that eat up the majority of processing time. Often, focusing on optimizing one or two main bottlenecks can improve an entire script's execution time dramatically.

    discuss this topic to forum

    relation tutorial

    No relevant information

    New

    Hot