Search Engine Project

Archive 2009-2011

Introduction

Supervisors: Philip Bille and Inge Li Gørtz

The overall goal of this project is to develop a scalable, high performance search engine. The main focus is on the algorithmic challenges in compactly representing a large data-set while supporting fast searches on it. The search engine project is inspired by  a project designed by Stephen Alstrup and Theis Rauhe.

The concrete data set for the project is the XML file for the Digital Library and Bibliography Project (DBLP) containing bibliographic information for various publications (books, articles, etc.). We are interested in supporting queries such as “which papers are written by author X”. Eventhough the search engine is built around the DBLP data set, the techniques used in the project applies to any data set.

The project is suitable for BSc. or B. Eng. students with at least basic programming skills and an introductory algorithms course. The project should normally be done in groups of 2-4. Depending on the level and amount of work the credit for the project is between 10 and 20 ECTS.

The project consists of a basic part and an advanced part. All groups are required to first solve the assignments in the basic part. When a group has completed the basic part they will have an initial search engine. After the basic part the group moves to the advanced part. Here, the group can continue in several directions, depending on interests and skills. A non-exhaustive list of suggestions for advanced assignments is given below. The amount of advanced assignments a group must complete depends on the amount of ECTS and the level of the group.

The following sections is a detailed description of the project consisting of the following:

  • Data Files A description of the format of the XML data files used for the search engine.
  • The Program Index1 A basic and very simple search engine which is used as the starting point for the project.
  • The Basic Part Description of the 4 assignments in the basic part.
  • The Advanced Part The (non-exhaustive) list of suggestions for assignments in the advanced part.
  • The Report Requirements and suggestions for the contents of the report.

Data Files

The input files consists of (parts of) the Digital Library and Bibliography Project (DBLP). The files are in a simple XML format containing bibliography information for publications, e.g. book, journal article, conference articles, etc. The following small example demonstrates the format:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE dblp SYSTEM "dblp.dtd">
<dblp>

<article mdate="2004-01-15" key="journals/cacm/Dijkstra68a">
<author>Edsger W. Dijkstra</author>
<title>Letters to the editor: go to statement considered harmful.</title>
<pages>147-148</pages>
<year>1968</year>
<volume>11</volume>
<journal>Commun. ACM</journal>
<number>3</number>
<ee>http://doi.acm.org/10.1145/362947</ee>
<url>db/journals/cacm/cacm11.html#Dijkstra68a</url>
</article>

<article mdate="2003-11-20" key="journals/cacm/BackusBGKMPRSVWWWN63">
<author>John W. Backus</author>
<author>Friedrich L. Bauer</author>
<author>Julien Green</author>
<author>C. Katz</author>
<author>John L. McCarthy</author>
<author>Alan J. Perlis</author>
<author>Heinz Rutishauser</author>
<author>Klaus Samelson</author>
<author>Bernard Vauquois</author>
<author>Joseph Henry Wegstein</author>
<author>Adriaan van Wijngaarden</author>
<author>Michael Woodger</author>
<author>Peter Naur</author>
<title>Revised report on the algorithm language ALGOL 60.</title>
<pages>1-17</pages>
<year>1963</year>
<volume>6</volume>
<journal>Commun. ACM</journal>
<number>1</number>
<ee>http://doi.acm.org/10.1145/366193.366201</ee>
<url>db/journals/cacm/cacm6.html#BackusBGKMPRSVWWWN63</url>
</article>

</dblp>

The example contains, apart from the header and footer in lines 1,2,3, and 41, bibliographic information for two journal articles both published in Communications of the ACM. The first one is the paper “Letters to the editor: go to statement considered harmful.” by Edsger W. Dijkstra from 1968, and the second one is the paper “Revised report on the algorithm language ALGOL 60.” by John W. Backus and 12 more from 1963. Each piece of information in the XML file is identified by opening and closing tags using the special < > characters.

In general, each publication in the XML file is delimited by opening and closing tags indicating the type of publication, which can be either articleinproceedingsproceedingsbookincollectionphdthesismastersthesis, or www. Several pieces of information may or may not be stored within the tags for publication and each information is on a separate. For the basic part, the author and title information will be used.

The following data files are available:

The file dblp.xml contains the complete data file for DBLP from September 27, 2005. The files dblp.xml.XMB are prefixes of the original dblp.xml of <X> megabytes. These files are from the Pizza&Chili corpus which is a standard set of XML files used for benchmarks purposes. The current (and much larger) XML file for DBLP is available from http://dblp.uni-trier.de/xml/.

The Program Index1

In addition to the input files an initial and very simple search engine name Index1 is provided below. You will need this for assignment 1 in the basic part.

import java.io.*;
import java.util.Scanner;

class Index1 {
    XMLItem start;

    private class XMLItem {
	String str;
	XMLItem next;
	XMLItem(String s, XMLItem n) {
	    str = s;
	    next = n;
	}
    }

    public Index1(String filename) {
	String line;
	XMLItem current, tmp;
	try {
	    Scanner input = new Scanner(new File(filename));
	    line = input.nextLine();
	    start = new XMLItem(line, null);
	    current = start;
	    while(input.hasNextLine()) {   // Read all lines of input
		line = input.nextLine();
		tmp = new XMLItem(line, null);
		current.next = tmp;
		current = tmp;
	    }
	    input.close();
	} catch (FileNotFoundException e) {
	    System.out.println("Error reading file " + filename);
	}
    }

    public boolean Search(String author) {
	XMLItem current = start;
	String name;
	while(current != null) {
	    if (current.str.startsWith("<author>")) {
		name = current.str.substring(8, current.str.length()-9); // Extract name without tags
		if(name.equals(author)) return true;
	    }
	    current = current.next;
	}
	return false;
    }

    public static void main(String[] args) {
	System.out.println("Preprocessing " + args[0]);
	Index1 I = new Index1(args[0]);
	Scanner console = new Scanner(System.in);
	for(;;) {
	    System.out.println("Input author name or type exit to stop");
	    String name = console.nextLine();
	    if (name.equals("exit")) break;
	    if (I.Search(name)) System.out.println(name + " exists");
	    else System.out.println(name + " does not exist");
	}
	console.close();
    }
}

Basic Part

The basic part consist of solving the following 4 assignments.

  1. Download and run the program Index1 (You can get the source code from the link in the top right of the listing above).
  2. Modify the search in Index1 to output the titles of all publications written by the specified author.
  3. Modify the construction of the data structure so that a linked list of the authors and their publications is constructed. Specifically, each object in the linked list should contain three fields:
    1. The name of the author
    2. A linked list of publications
    3. A reference to the next item in the list

    The linked list of publications should contain the title of the publication and a reference to the next item in the list. After modifying the data structure you have to also modify the search procedure.

  4. Further modify the data structure from assignment 3 to use a hashtable instead of a linked list of words. You can create the data structure using chained hashing. Hence, each item in the hashtable contains a reference to a linked list of publications.

To solve the above assignments you are not allowed to any packages other than java.iojava.util.Scanner, and java.lang. Consequently, you cannot use any of the built-in data structures such as LinkedList and HashMap. One of the purposes of the project is to learn how to program  such data structures and to further improve and extend them. You are allowed to use the operations on String objects, including the hashCode function. If you have any doubt whether some Java functionality is permitted contact one of the teachers. For the advanced part there are no restrictions on the Java functionality you may use.

To succesfully run some of the larger XML files you may have to increase the size of the maximum space to be used by the Java interpreter using the -Xmx flag. For instance, java -Xmx128m Index1 dblp.xml.50MB sets the maximum space to 128MB.

Advanced Part

The following is a list of suggestion (in no particular order) for assignments in the advanced part.

  • Prefix Search Add support for searching for prefixesFor example, a search for “Donald*” should find the papers of all authors whose name starts with Donald.
  • GUI Design and implement a graphical user interface for the search engine.
  • Space Efficiency Improve the space usage of the program. There are several techniques for this. Ask your teachers for more information. Improving the space will also have a significant effect on the preprocessing and query time due to effects of the memory hierarchy.
  • Boolean Search Implement boolean searches with operators such as . For instance, “Donald E. Knuth AND Vaughan Pratt” should find all papers co-authored by Donald E. Knuth and Vaughan Pratt.
  • Other Data Files Extend the search engine to handle other data files. For instance, you can build a search engine for the The Internet Movie Database (IMDb).
  • Auto-completion Given the prefix of an author suggest possible endings automatically. Particularly relevant when combined with a GUI.
  • Dynamic Indexing Extend the data structure to allow additions and deletions of publications. The challenge here is to obtain a good balance between the time for updates and the space for the index.
  • Property Specific Search Add support for searching for publications with specific properties. For instance, a search for “Donald E. Knuth pubtype:book” should return all books written by Donald E. Knuth.
  • Hash Function Write your own hash function.
  • Ranking When there are multiple possible authors for a search query, as in prefix search, report the results ordered by the rank of the different authors. The rank of an author could be the determined from the number of papers he/she has written.
  • Crawler Write a program to automatically extract information from some source, e.g., the internet, into suitable data files for a search engine.
  • Java Data Structures Implement the data structures for the index using the standard java collection classes.
  • Spelling Suggestion If a search does not match any author it may be because the user misspelled the name. Construct a mechanism to suggest alternatives that almost match the search query. Think the “did you mean?” functionality in Google.
  • Search Statitics Maintain statistical information for all searches to improve search quality, e.g., for ranking and spelling suggestions.
  • Other Find your own challenge.

All (almost) of the assignments in the advanced part can be combined so that an increasingly advanced search engine can be developed.

The Report

The following are the general requirements for the report:

  • You must describe how each search engine works, which data structures are used and how they are used. The description must be in natural language and Java code may only be used to support the description.
  • You must theoretically analyse the complexity of the different search engines. Typically, the  preprocessing time, space usage, and query time are important.
  • You have to experimentally evaluate the performance of the different search engines and compare the results to the theoretical analysis.
  • You have systematically test a small part of the program.

We suggest the following structure for the report:

  • Introduction
  • Description and analysis of the data structures needed throughout the project, e.g., linked lists and hashtables.
  • Description and analysis of Index1.
  • Description and analysis of Index2 (the search engine produced in assignment 2).
  • Description and analysis of Index3.
  • Description and analysis of Index4.
  • Description and analysis of Index5.
  • Experimental evaluation of the search engines.
  • Tests.
  • Conclusion.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.