• home
  • forum
  • my
  • kt
  • download
  • Python Squeezes the Web

    Author: 2007-08-25 14:34:27 From:

    Sometimes, when developing a web application, you want to acquire data from another source to put on your site. For example, portal sites like portaloo place the latest news headlines on their pages, and the news headlines are constantly updated without human intervention.

    I ran into a similar problem designing the web site and database system of Memphis Scholastic Chess. They were spending 5-6 hours per week manually locating the latest ratings for 900 players (having to navigate through around 75 individual pages) and then skim the list for players in our organization on the United States Chess Federation web site. I used Python to write a server site program that runs automatically once a week and downloads and parses pages from the USCF's web site.

    Here's a sample set of data from the USCF's web site that we need to parse with Python

    
    From http://www.64.com/cgi-bin/ratings.pl?nm=T&st=TN:
    12-97  373p        TN 03-97 <AHREF=/cgi-bin/ratings.pl/USCF/21005567>TARVER,NATHAN</A>
    12-94  401p        TN 02-95 <A HREF=/cgi-bin/ratings.pl/USCF/12613391>TASHIE,DAPHNE</A>
    10-99  385         TN 11-99 <A HREF=/cgi-bin/ratings.pl/USCF/12752592>TATE,JEREMY</A>           10-05  367
    
    From http://www.64.com/cgi-bin/ratings.pl?nm=P&st=MS:
    12-96 1167p        MS 04-97 <A HREF=/cgi-bin/ratings.pl/USCF/12660161>PATTERSON,RAPHAEL C</A>
    08-99 1452  1261   MS 09-00 <A HREF=/cgi-bin/ratings.pl/USCF/12499243>PATTILLO,BILLY R</A>      09-22 1476  1261
    12-94  960p        MS 04-95 <A HREF=/cgi-bin/ratings.pl/USCF/12619152>PATT
    ON,SAM R</A>
    08-99  863         MS 04-00 <A HREF=/cgi-bin/ratings.pl/USCF/12739657>PAYNE,DANIEL</A>
    
    

    The logic for the program is quite simple:

    • Retrieve a list of the USCF ID #s of players in our organization
    • Retrieve a list of state/letter combinations that we need to fetch
    • Fetch each page
    • Go through the pages line by line. If a newer rating exists to the far right, use it; otherwise use the existing rating.
    • Write the changes to the database, where the PHP3 scripts on the web server will automatically pick up on the changes

    And the Python program that does the dirty work (in under 100 lines of code!):

    #! /usr/bin/python
    # A Python daemon that checks the uschess.org ratings against a
    # mysql database and updates them accordingly.
    
    import urllib
    import re
    import string
    import calendar
    import MySQL
    # called on a per-state basis
    def ProcessUSCFInfo(letter, state):
            "Processes the information from the USCF Web site and imports it into the Mysql database."
    
            print "Downloading data for state", state, "letter", letter
            uscf_data = urllib.urlopen("http://www.64.com/cgi-bin/ratings.pl?st=" +state + "&nm=" + letter).read()
    
            # the data that we want is inside the <pre> tag, but after the <b>-enclosed title
            # get a list of players
            beginMatch = beginDataRegexp.search(uscf_data)
            endMatch = endDataRegexp.search(uscf_data)
            uscf_data = uscf_data[beginMatch.end():endMatch.start()]
            uscf_player_lines = string.split(uscf_data, "\n")
    
            # parse the lines and fill up a list with USCFPlayer instances
            uscf_players = []
            playercount = 0
            for data_line in uscf_player_lines:
                    # Use a regexp to extract the needed information from a line of
    data
                    this_player = USCFPlayer()
                    regexp_match = perLineRegexp.match(data_line)
                    if regexp_match == None: continue
                    this_player.USCFRating, exp_month, exp_year, this_player.PlayerId, w_rating = regexp_match.groups()
    
                    # make sure that this player is in the database
                    if this_player.PlayerId not in player_id_list: continue
    
                    # the weekly updates
                    if(w_rating != None): this_player.USCFRating = w_rating
    
                    # handle Life memberships and the Y2K issue in the expiration dates
                    if exp_month == None:
                            this_player.ExpDate = "2099/12/31"
                    else:
                    else:
                            if string.atoi(exp_year) > 70: exp_year = "19" + exp_year
                            else: exp_year = "20" + exp_year
    
                            # get the last day of the month
                            exp_day = (calendar.monthrange(string.atoi(exp_year), string.atoi(exp_month)))[1]
                            this_player.ExpDate = exp_year+"/"+exp_month+"/"+str(exp_day)
    
                    # add the USCFPlayer to the list
                    USCFPlayers.append(this_player)
                    playercount = playercount + 1
            print "Retrieved", playercount, "players from state", state, "letter", letter
            return uscf_players
    
    # used to hold data related to a USCF Player
    class USCFPlayer:
            pass
    
    # common regexps used by ProcessUSCFInfo
    beginDataRegexp = re.compile(r"<pre>\n<b>.*</b>", re.I | re.DO
    TALL)
    endDataRegexp = re.compile(r"</pre>")
    perLineRegexp =re.compile(r".{5}\s+(\d{3,4})p?\s+(?:\d{3,4}p?)?\s*\w{2}\s+(?:(?:(\d{2})-(\d{2}))|Life)\s+<A.*USCF/(\d{8}).*/A>(?:\s+.{5}\s+(\d{3,4})\s*(?:\d{3,4}p?)?)?")
    # global list of all players
    USCFPlayers = []
    
    # get a list with all of the valid playerids
    db_conn = MySQL.connect("host_name_here", "user_id", "pass_word")
    db_conn.selectdb("database_name")
    player_id_list_tmp = db_conn.do("SELECT PlayerId FROM Players")
    player_id_list = []
    
    # eliminate all of the singletons
    for pid_singleton in player_id_list_tmp:
            player_id_list.append(pid_singleton[0])
    
    # get a list of the state/letter combinations
    state_letter_list = db_conn.do("SELECT LEFT(LastName, 1) AS Letter, State, CONCAT(LEFT(LastName, 1), State) AS Sorter FROM Players GROUP BY Sorter")
    
    # iterate and process each state/letter combo
    for state_letter in state_letter_list:
            ProcessUSCFInfo(state_letter[0], state_letter[1])
    # dump the whole mess to the database
    print "Trying to save", len(USCFPlayers), "players to database...",
    updated_players = 0
    for uscf_player in USCFPlayers:
            db_conn.do("UPDATE Players SET USCFRating = " + uscf_player.USCFRating + ", ExpDate = '" + uscf_player.ExpDate  + "' WHERE PlayerId = '" + uscf_player.PlayerId + "'")
    
    print "done"
    
    Python is an interpreted, object-oriented, high-level programming language with dynamic semantics. Its high-level built in data structures, combined with dynamic typing and dynamic binding, make it very attractive for Rapid Application Development, as well as for use as a scripting or glue language to connect existing components together. Python's simple, easy to learn syntax emphasizes readability and therefore reduces the cost of program maintenance. Python supports modules and packages, which encourages program modularity and code reuse. The Python interpreter and the extensive standard library are available in source or binary form without charge for all major platforms, and can be freely distributed.

    Often, programmers fall in love with Python because of the increased productivity it provides. Since there is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is easy: a bug or bad input will never cause a segmentation fault. Instead, when the interpreter discovers an error, it raises an exception. When the program doesn't catch the exception, the interpreter prints a stack trace. A source level debugger allows inspection of local and global variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a line at a time, and so on. The debugger is written in Python itself, testifying to Python's introspective power. On the other hand, often the quickest way to debug a program is to add a few print statements to the source: the fast edit-test-debug cycle makes this simple approach very effective.

    What I Like about Python

    I've written programs in a number of different languages, including Visual Basic, C/C++, Perl, and PHP3. There are some things Python has that makes it, in my opinion, substatially more flexible than other languages:

    • Powerful Datatypes and Operations -

      Python has built in strings, tuples, lists, dictionaries, and more. Want your function to return two values? Return a tuple, an immutable list of values! Want to grab elements 4-6 of list MyList? Use slice notation to write: MyList[4:6]! This slice notation works on strings, too, so "Monty Python"[0:5] evaluates to "Monty". List can also dynamically grow, too. You can easily iterate over the elements of a list with the "for" command, and the "in" and "not in" statements let you take advantage of Python's built-in binary search routines instead of having to code your own. Very few languages have this type of functionality built in and available as a core part of the language. Most of the time, a special add-on library (such as STL) is required to get all these features.

    • Rapid Development with an Interactive Interpreter -

      Rather than go through the compile/test/run cycle of most traditional programming languages, or even the edit/run cycle of many scripting languages, Python has an incredibly useful interactive interpreter. During the development of the aforementioned application, I pasted a chunk of data into the interpreter, assigned to to a variable, and wrote the string parsing regular expression in about an hour. Whenever I'm curious about the built-in methods of a list, I pop into the interpreter and run dir([]). When I'm not sure exactly how some esoteric feature works, I define a test case and run it. I even build my applications bottom up, importing and testing critical functions in the interpreter before I write the top-level code that uses the functions.

    • Runs on Multiple Platforms and Has the Same Implementation on Multiple Platforms -

      I don't have to extol the virtues of a multi-platform language to you; you have a tremendous amount of flexibility in where you develop and deploy your applications. You can write Python on a Mac and upload it to a Unix server, start out with a Linux server and move up to a Sun Ultra, et. al. But, unlike some "cross-platform" languages like ANSI C++ (which I originally wrote the uscfratingd program in) and PHP3, it supports the same features everywhere because there is only one main implementation of Python in common use. I have written GUI applications with Python that run on Windows and Linux without changing one line of code (more on this in a following article). Can any other language (aside from Perl) claim this type of functionality?

    • Rich Core Library -

      Out of the box, on all platforms, Python programs can use sockets to speak any protocol or use predefined classes to speak HTTP, FTP, SMTP, POP, Telnet and a variety of other Internet protocols. Built-in classes are provided to permit your app to parse XML, HTML, and SGML. Regular Expressions, a powerful feature that allows text parsing (look at the perLineRegexp variable in the program for a useful example), were borrowed from Perl and are present in Python on Windows, Mac and Linux. Python/Tk, a moderately powerful GUI framework is available for Windows, Mac and Unix, and wxPython, a wrapper to the wxWindows C++ library, are available for Unix and Windows and are under development for BeOS and the Macintosh. Overall, Python provides a lot of features for free that might require costly third-party libraries in other languages.

    The Example Explained

    I don't have enough space to provide a complete introduction to Python (check the Python Tutorial for that), but I'll try to explain things briefly as I go. If you've done some sort of programming before, you'll find that Python is extremely easy to learn and lets you do a lot with a small amount of code. To try the code in this article, install a copy of Python for your distribution of Linux. Debian 2.1 users should be able to just type "apt-get install python", and Python 1.5.x is included with RedHat 5.0 or higher and can be installed with glint. Using your favorite text editor (I like VIM), pull up a chair and follow along! Note, in order to run the example program exactly as written, you'll need to create a MySQL table called "Players" like this:

    
    CREATE TABLE Players (
            PlayerId char(8) NOT NULL PRIMARY KEY,
            LastName varchar(50) NOT NULL,
            FirstName varchar(50) NOT NULL,
            USCFRating mediumint NOT NULL,
            State char(2) NOT NULL,
            ExpDate date NOT NULL
    )
    
    And some sample data from the list above:
    
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tarver', 'Nathan', 'TN');
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12613391', 'Tashie', 'Daphne', 'TN');
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('21005567', 'Tate', 'Jeremy', 'TN');
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Patterson','Raphael','MS');
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12499243','Pattillo','Billy','MS');
    
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12660161','Pat
    ton','Sam','MS');
    INSERT INTO Players(PlayerId, LastName, FirstName, State) VALUES('12739657','Pay
    ne','Daniel','MS');
    

    Defining the Regular Expression

    In developing this program, I started by looking at my data. For a couple of hours, I dabbled in the interpreter like so:

    
    Python 1.5.2 (#0, Sep 13 1999, 09:12:57)  [GCC 2.95.1 19990816 (release)] on lin
    ux2
    Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
    >>> import urllib, re, string
    >>> beginDataRegexp =re.compile(r"<pre>\n<b>.*</b>", re.I | re.DOTALL)
    >>> endDataRegexp = re.compile(r"</pre>")
    >>> test_data = urllib.urlopen("http://www.64.com/cgi-bin/ratings.pl?nm=T&st=TN").read()
    >>> len(test_data)
    25286
    

    Thus far, I've pulled in some libraries that give me string, url downloading, and regular expression functions. Then, I defined two regular expressions. Regular expressions are analagous to keys. Put simplistically, the regular expression is moved down the string one character at a time, similarly to trying a key on a hallway full of doors. When the regular expression matches, the door is opened. The first regular expression looks for the text "<pre>" followed by a blank line, followed by a "<b>" set of tags with something inside of it. The second one looks for the end of the "<pre>" tag. After that, I downloaded some test data from the USCF web site to play with. One line of code, that's all it takes! The original C++ version of this program required 11 lines of code to emulate the functionality of just this one! It also had to be linked with the GNOME HTTP library, which wasn't present on my target FreeBSD system.

    
    >>> beginMatch = beginDataRegexp.search(test_data)
    >>> endMatch = endDataRegexp.search(test_data)
    >>> test_data_lines = string.split(test_data[beginMatch.end():endMatch.
    start()], "\n")
    >>> len(test_data_lines)
    

    Here, I'm using the regular expression that I defined earlier to search the string for a match. Then, I used the string slicing to grab that chunk of the data. The data is then turned into a list containing the individual lines.

    
    >>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?")
    >>> test_data_rows[1]
    '12-96  523p        TN 10-96 <A HREF=/cgi-bin/ratings.pl/USCF/12659889>TAB
    AKOFF,ADRIAN</A>
    >>>  perLineRegexp.search(test_data_rows[1]).groups()
    ('523',)
    

    Note that I've put \d{3,4} in parenthesis. This represents an extremely powerful aspect of regular expressions: grouping! The regular expression parser will save any values that it finds in parenthesis, and they can be accessed via the groups function as shown above. The final part of this regular expression is "p?". The USCF uses "p" after a rating to denote it as "provisional", meaning that the player has not yet played 20 games. For our purposes it is not needed, so "p?" tells the regular expression parser "if there is a p there, ignore it and move on."

    I continued this trial and error sequence, slowly expanding my regular expression until something like this:

    >>> perLineRegexp = re.compile(r".{5}\s+(\d{3,4})p?\s+(?:\d{3,4}p?)?\s*\w{2}\s+(?:(?:(\d{2})-(\d{2}))|Life)\s+<A.*USCF/(\d{8}).*/A>(?:\s+.{5}\s+(\d{3,4})\s*(?:\d{3,4}p?)?)?")
    >>> perLineRegexp.match(test_data_lines[1]).groups()
    ('523', '10', '96', '12659889', None)
    

    While regular expressions form a key part of this program, they didn't account for the fivefold performance increase I experienced when porting this program from C++ to Python! The secret lies in the statement: if this_player.PlayerId not in player_id_list: continue. Before, with C++, I was stuck between tough choices: suffer the performance penalty of hitting the database more times than needed (by issuing an UPDATE statement for each row) or taking the time to implement a binary search algorithm. Being pressed for time, I chose the former and immediately regretted it. Luckily, Python came along and saved the day, yet again, with its rich, powerful data types built into the language itself.

    Also, as another example of Python's flexibility, it can be used as the embedded scripting language for VIM. While writing this article, I grew tired of manually replacing > with &gt;, so I composed the following function and put it in ~/vim-htmlify.py:

     
    # a little macro for VIM that helps when composing HTML documents
    import vim, string, htmlentitydefs
    htmlequivs = {}
    # swap built-in table, we want a dictionary indexed by
    # characters that can't be used.
    for key, value in htmlentitydefs.entitydefs.items():
            if key != "amp":
                    htmlequivs[value] = key
    def htmlify():
            for i in range(0, len(vim.current.range)):
                    cLine = vim.current.range[i]
                                    cLine = string.replace(cLine, "&", "&amp;")
                    for badchar in htmlequivs.keys():
                            cLine = string.replace(cLine, badchar, "&" + h
    tmlequivs[badchar] + ";")
                    vim.current.range[i] = cLine
            print len(vim.current.range), "line(s) HTMLified"
    
    
    
    and put this in ~/.vimrc:
    
    pyfile ~/vim-htmlify.py
    map h :py htmlify()<CR>
    
    With the touch of "h", I could convert <b>Foo Bar</b> into &lt;b&gt;Foo Bar&lt;b&g. This is yet another example of the power and ubiquity of Python!

    I hope I have given you a taste of how Python can be a very effective tool to parse information from the Internet. In this case, this simple program, written in 2 days with no prior knowledge of Python, has saved Memphis Scholastic Chess over 70 hours to date in the 3 months that it has been implemented. Python combines ease of use with the ability to run on multiple platforms and provides a rich library that makes the tasks that are simple in theory (downloading a web page, parsing an HTML file, showing a window on the screen, etc.) simple in practice. I strongly urge you to try out Python. Stay tuned for the next article, wherein I'll create a cross-platform GUI to run on top of our data parsing application!

    discuss this topic to forum

    relation tutorial

    No relevant information

    Category

      Development (6)
      Introduction to Python (5)
      Miscellaneous (4)
      Searching (2)
      Web Fetching (5)
      XML and Python (0)

    New

    Hot