• home
  • forum
  • my
  • kt
  • download
  • Writing a simple search engine in Python and MySQL

    Author: 2007-08-25 14:29:28 From:

    I think that a search engine for the contents of a website is a great thing to improve a website. No need to browse through all the pages for something you're looking for, in stead just type a keyword and let some script do all of the work. Anyway, I wrote a simple search engine for this site and I'll explain here how I did it. It is very basic, but easy to improve. For example, in stead of indexing the local files of this site you can use the urllib package to search any other site on the internet!
    Try this link and see how it works on my web page.

    The basic things you'll need to do

    • setting up a database with all words and the occurrence of those words (I'll use a MySQL database for this example)
    • script that looks for words and their occurrences and inserts them into the database
    • html search form for the user
    • script that retrieves the user's search word from that database
    • page that displays the results with links to the relevant pages

    Setting up the database

    Create a table in your MySQL database and name it search. We'll need five columns:

    • search_id, INTEGER, NOT NULL, AUTO_INCREMENT, UNSIGNED
    • word, VARCHAR(50), NOT NULL
    • occurrence, INTEGER, NOT NULL, UNSIGNED
    • url, VARCHAR(200), NOT NULL
    • link, VARCHAR(200), NOT NULL


    and set search_id as the PRIMARY KEY
    If you feel uncomfortable setting up a MySQL database, fear not, there are lots of resources on the internet about this subject. Also check the tutorial on this website here. Be sure to put this database table on your webserver and pay attention to the security issues. More about this later...

    The Python indexer

    The following script crawls through the content of a page and searches for every word and the occurrence of that word, it was based on code I found here. If you are the one that wrote this code and think that I should not publish it here, please contact me.
    The Python code is:

    1import string 
    2import sys 
    3import re 
    4import MySQLdb  
    5 
    6# open database and make cursor  
    7conn = MySQLdb.connect(host = "localhost",  
    8                       user = "root",  
    9                       passwd = "***",  
    10                       db = "***")  
    11cursor = conn.cursor()  
    12# first empty the database  
    13cursor.execute('''''TRUNCATE TABLE words''')  
    14 
    15def makeIndex(myurl, link):  
    16    # open local html file  
    17    href = link  
    18    page = myurl  
    19    f = open(page, "rb")  
    20    f.close 
    21      
    22    # initialize stuff here  
    23    wordcount = 0  
    24    words     = { }  
    25      
    26    for line in f.readlines() :  
    27        line = string.strip( line )   
    28        for word in re.split(  
    29                "[" + string.whitespace + string.punctuation + "]+" ,  
    30                line ) :  
    31            word = string.lower( word )  
    32            if re.match( "^[" + string.lowercase + "]+$" , word ) :  
    33                wordcount += 1  
    34                if words.has_key( word ) :  
    35                    words[ word ] += 1  
    36                else :  
    37                    words[ word ] = 1  
    38    sorted_word_list = words.keys()  
    39    sorted_word_list.sort()  
    40      
    41    # now populate the database  
    42    for word in sorted_word_list :  
    43        cursor.execute('''''INSERT INTO words (word, occurrence, url, link) VALUES (%s,%s,%s,%s)''',(word,words[word],page,href))  
    44 
    45# index every page of your website  
    46makeIndex("yourpage1.html""a href=\"yourpage1\">My page 1")  
    47makeIndex("yourpage2.html""a href=\"yourpage2\">My page 2")  
    48etcetera...  
    49cursor.close()  
    My page 1") makeIndex("yourpage2.html", "a href=\"yourpage2\">My page 2") etcetera... cursor.close() " originalcode="import string import sys import re import MySQLdb # open database and make cursor conn = MySQLdb.connect(host = "localhost", user = "root", passwd = "***", db = "***") cursor = conn.cursor() # first empty the database cursor.execute('''TRUNCATE TABLE words''') def makeIndex(myurl, link): # open local html file href = link page = myurl f = open(page, "rb") f.close # initialize stuff here wordcount = 0 words = { } for line in f.readlines() : line = string.strip( line ) for word in re.split( "[" + string.whitespace + string.punctuation + "]+" , line ) : word = string.lower( word ) if re.match( "^[" + string.lowercase + "]+$" , word ) : wordcount += 1 if words.has_key( word ) : words[ word ] += 1 else : words[ word ] = 1 sorted_word_list = words.keys() sorted_word_list.sort() # now populate the database for word in sorted_word_list : cursor.execute('''INSERT INTO words (word, occurrence, url, link) VALUES (%s,%s,%s,%s)''',(word,words[word],page,href)) # index every page of your website makeIndex("yourpage1.html", "a href=\"yourpage1\">My page 1") makeIndex("yourpage2.html", "a href=\"yourpage2\">My page 2") etcetera... cursor.close() ">view plain | print | copy to clipboard | ?

    After you have run this script (changing yourpage into your actual pages of course), the table you have created earlier is populated with every word and their occurrences of your pages. You should only run this script once the content of your website has changed. Probably once or twice a day, but it really depends on your individual situation.

    The html search form

    Very often you will find a search box at the top of a page, on the left or right side. Put this HTML code somewhere on your page:

    1<form action="cgi-bin/search.py">  
    2<input name="word" type="text" size="14">  
    3<input value="SerpiaSearch" type="submit">  
    4</form> 
    " originalcode="
    ">view plain | print | copy to clipboard | ?

    For more information on how to create HTML forms look here. The key element of this form is the cgi-bin/search.py script, the next paragraph will delve into this sript.

    The actual search script

    The next script, convientetly called search.py, will retrieve the input name "word" from the database:

    1conn = MySQLdb.connect(host = "localhost",  
    2                       user = "searcher",  
    3                       passwd = "***",  
    4                       db = "***")  
    5cursor = conn.cursor()  
    6# get results  
    7cursor.execute('''''SELECT occurrence,url,link FROM words WHERE word=%s ORDER BY occurrence DESC''',(word))  
    8result = cursor.fetchall()  
    9cursor.close 
    view plain | print | copy to clipboard | ?

    The result page

    And finally we will display the results on a page where the searched word is displayed with the number of occurrences and the link to the appropriate page.

    for row in result:
    1<table style="text-align: left; width: 100%;" 
    2 border="1" cellpadding="2" cellspacing="2">  
    3<tbody> <tr> 
    4<td style="background-color: #99CCFF;" /td> 
    5<td style="background-color: #99CCFF;" /td> 
    6</tr> 
    7for row in result:  
    8<tr> 
    9      <td> row[0] </td> <td> row[2]</td> 
    10</tr>        
    11    </tbody> </table> 
    row[0] row[2]
    " originalcode=" for row in result:
    row[0] row[2]
    ">view plain | print | copy to clipboard | ?


    discuss this topic to forum

    relation tutorial

    No relevant information

    Category

      Development (6)
      Introduction to Python (5)
      Miscellaneous (4)
      Searching (2)
      Web Fetching (5)
      XML and Python (0)

    New

    Hot