Search Engine Project

Project Description

Introduction

Supervisors: Philip BilleInge Li Gørtz, Patrick Hagge Cording, Søren Vind

The overall goal of this project is to develop a scalable, high performance search engine. The main focus is on the algorithmic challenges in compactly representing a large data-set while supporting fast searches on it. The search engine project is inspired by a project designed by Stephen Alstrup and Theis Rauhe.

The concrete data set for the project is a snapshot of WikiPedia from 2010 converted into a simple text format. We are interested in supporting queries such as “which documents contain the word X”. Eventhough the search engine is built around the WikiPedia data set, the techniques used in the project applies to any data set.

The project is suitable for BSc. or B. Eng. students with at least basic programming skills and an introductory algorithms course. Students with significantly more skills in algorithms, i.e., advanced algorithms courses at the Msc. level, should discuss specific option and modifications with the teachers before signing up. The project should normally be done in groups of 2-4. Depending on the level and amount of work the credit for the project is between 10 and 20 ECTS.

The project consists of a basic part and an advanced part. All groups are required to first solve the assignments in the basic part. When a group has completed the basic part they will have an initial search engine. After the basic part the group moves to the advanced part. Here, the group can continue in several directions, depending on interests and skills. A non-exhaustive list of suggestions for advanced assignments is given below. The amount of advanced assignments a group must complete depends on the amount of ECTS and the level of the group.

The following sections is a detailed description of the project consisting of the following:

  • Data Files A description of the format of the data files used for the search engine.
  • The Program Index1 A basic and very simple search engine which is used as the starting point for the project.
  • The Basic Part Description of the 4 assignments in the basic part.
  • The Advanced Part The (non-exhaustive) list of suggestions for assignments in the advanced part.
  • The Report Requirements and suggestions for the contents of the report.

Data Files

The input files consists of (prefixes of) a snapshot of WikiPedia from 2010 converted into a simple text format [Shaoul, C. & Westbury C. (2010) The Westbury Lab Wikipedia Corpus, Edmonton, AB: University of Alberta. downloaded from http://www.psych.ualberta.ca/~westburylab/downloads/westburylab.wikicorp.download.html)%5D. The data is released under the Creative Commons Attribution 3.0 Unported License. The following small example demonstrates the format for a file with two documents:

Anarchism.
Anarchism is a political philosophy which considers the state undesirable, unnecessary and harmful, ...
...
---END.OF.DOCUMENT---
Autism.
Autism is a disorder of neural development characterized by impaired social interaction and communication, ...
...
---END.OF.DOCUMENT---

This files contains the two documents “Anarchism” and “Autism”. Each document starts with a title and ends with a special end-of-document marker. The body of document is listed in between (the … indicate removed text).

The following data files are available:

The size of the file WestburyLab.wikicorp.201004.txt is 1.7GB. It contains 990,248,478 words in over 2 million documents. The files WestburyLab.wikicorp.201004_X.txt are prefixes of the original file of size X. All files are compressed with bzip2.

The Program Index1

In addition to the input files an initial and very simple search engine name Index1 is provided below. You will need this for assignment 1 in the basic part.

import java.io.*;
import java.util.Scanner;

class Index1 {

    WikiItem start;

    private class WikiItem {
        String str;
        WikiItem next;

        WikiItem(String s, WikiItem n) {
            str = s;
            next = n;
        }
    }

    public Index1(String filename) {
        String word;
        WikiItem current, tmp;
        try {
            Scanner input = new Scanner(new File(filename), "UTF-8");
            word = input.next();
            start = new WikiItem(word, null);
            current = start;
            while (input.hasNext()) {   // Read all words in input
                word = input.next();
                System.out.println(word);
                tmp = new WikiItem(word, null);
                current.next = tmp;
                current = tmp;
            }
            input.close();
        } catch (FileNotFoundException e) {
            System.out.println("Error reading file " + filename);
        }
    }

    public boolean search(String searchstr) {
        WikiItem current = start;
        while (current != null) {
            if (current.str.equals(searchstr)) {
                return true;
            }
            current = current.next;
        }
        return false;
    }

    public static void main(String[] args) {
        System.out.println("Preprocessing " + args[0]);
        Index1 i = new Index1(args[0]);
        Scanner console = new Scanner(System.in);
        for (;;) {
            System.out.println("Input search string or type exit to stop");
            String searchstr = console.nextLine();
            if (searchstr.equals("exit")) {
                break;
            }
            if (i.search(searchstr)) {
                System.out.println(searchstr + " exists");
            } else {
                System.out.println(searchstr + " does not exist");
            }
        }
        console.close();
    }
}

Basic Part

The basic part consist of solving the following 4 assignments.

  1. Download and run the program Index1 (You can get the source code from the link in the top right of the listing above).
  2. Modify the search in Index1 to output the titles of all documents containing the specified search string.
  3. Modify the construction of the data structure so that a linked list of the possible search strings and the documents they appear in is constructed. Specifically, each object in the linked list should contain three fields:
    1. The search string
    2. A linked list of documents in which search string appears
    3. A reference to the next item in the list

    The linked list of search strings should contain the title of the documents and a reference to the next item in the list. After modifying the data structure you have to also modify the search procedure.

  4. Further modify the data structure from assignment 3 to use a hashtable instead of a linked list of words. You can create the data structure using chained hashing. Hence, each item in the hashtable contains a reference to a linked list of publications.

To solve the above assignments you are not allowed to any packages other than java.iojava.util.Scanner, and java.lang. Consequently, you cannot use any of the built-in data structures such as LinkedList and HashMap. One of the purposes of the project is to learn how to program such data structures and to further improve and extend them. You are allowed to use the operations on String objects, including the hashCode function. If you have any doubt whether some Java functionality is permitted contact one of the teachers. For the advanced part there are no restrictions on the Java functionality you may use.

To succesfully run some of the large files you may have to increase the size of the maximum space to be used by the Java interpreter using the -Xmx flag. For instance, java -Xmx128m Index1 WestburyLab.wikicorp.201004_50MB.txt sets the maximum space to 128MB. Note that even with these options the basic part search engines will not be able to handle the largest files. To do this more advanced techniques are needed.

Advanced Part

The following is a list of suggestion (in no particular order) for assignments in the advanced part.

  • Prefix Search Add support for searching for prefixesFor example, a search for “aut0*” should find all documents containing a word starting with “auto”.
  • GUI Design and implement a graphical user interface for the search engine.
  • Space Efficiency Improve the space usage of the data structures. There are several techniques for this. Ask your teachers for more information. Improving the space will also have a significant effect on the preprocessing and query time due to effects of the memory hierarchy.
  • Compression Further improve the space usage by using advanced compression techniques.
  • Boolean Search Implement boolean searches with operators such as . For instance, “latex AND typesetting” should find all documents containing both latex and typesetting.
  • Ranking Report the result documents ordered by the rank of the different documents. The rank of a document could be the determined from the number of occurrences of the search terms in the document.
  • Auto-completion Given the prefix of an search string suggest possible endings automatically. Particularly relevant when combined with a GUI.
  • Dynamic Indexing Extend the data structure to allow additions and deletions of publications. The challenge here is to obtain a good balance between the time for updates and the space for the index.
  • Full-text Indexing Add support for searching for sequences of words, e.g., “britney spears” should find all documents containing “britney” immediately followed by “spears”.
  • Hash Function Write your own hash function.
  • Crawler Write a program to automatically extract information from some source, e.g., the internet, into suitable data files for a search engine.
  • Java Data Structures Implement the data structures for the index using the standard Java collection classes. Critically compare the performance against your our own.
  • Spelling Suggestion If a search does not match any search string it may be because the user misspelled the search string. Construct a mechanism to suggest alternatives that almost match the search query. Think of the “did you mean?” functionality in Google.
  • Other Find your own challenge.

All (almost) of the assignments in the advanced part can be combined so that an increasingly advanced search engine can be developed.

The Report

The following are the general requirements for the report:

  • You must describe how each search engine works, which data structures are used and how they are used. The description must be in natural language and Java code may only be used to support the description.
  • You must theoretically analyse the complexity of the different search engines. Typically, the preprocessing time, space usage, and query time are important.
  • You have to experimentally evaluate the performance of the different search engines and compare the results to the theoretical analysis.
  • You have systematically test a small part of the program.

We suggest the following structure for the report:

  • Introduction
  • Description and analysis of the data structures needed throughout the project, e.g., linked lists and hashtables.
  • Description and analysis of Index1.
  • Description and analysis of Index2 (the search engine produced in assignment 2).
  • Description and analysis of Index3.
  • Description and analysis of Index4.
  • Description and analysis of Index5.
  • Experimental evaluation of the search engines.
  • Tests.
  • Conclusion.
Follow

Get every new post delivered to your Inbox.

%d bloggers like this: