String Pattern Recognition - ATHENS Course (March 2011)

Project 2: Linear-Time String Matching Algorithms - The Z Algorithm

DEADLINE: March 16th, 2011, 9 p.m.

 

1 Introduction

The goal of this project is to study the theoretical and experimental behaviour of the Z algorithm. In order to do so, this algorithm will test on several data test chosen from different contexts.
More specifically, goals of this project include:

  • Reinforcement of the knowledge acquired by students about the Z algorithm.
  • Understanding of the worst case for the Z algorithm.
  • Testing a string matching algorithm with data close to those found in practice.
  • Actual programming of string matching algorithms.
  • Conducting a computational experiment.
  • Comparison of theoretical and experimental results.
  • Interpretation of results and presentation of conclusions.
  • Writing an academic paper.

2 The Z Algorithm

The second part of this pro ject consists of programming the Z algorithm and checking it out over large files and patterns with wild cards. Below you have a table with a list of patterns and texts.

 

Pattern
Wild Card
Text
Alphabet
Pattern1.txt
No
Text1.txt
Binary
Pattern2.txt
Yes, char=X
Text2.txt
Binary
Pattern3.txt
No
Text3.txt
Binary
Pattern4.txt
No
Text4.txt
DNA
Pattern2.txt
Yes, char=X
Text5.txt
DNA

 

All the files can be found at:

http://www.eui.upm.es/~fmartin/webpgomez/Docencia/Pattern-Recognition/

Solve the SMC problem and after that answer the following questions:

  1. Is pattern P in T ?
  2. How many occurrences of P in T are found? Think carefully before answering this question for the string matching problem with wild cards.
  3. For each pair pattern-text the program should output the percentage of cases (Cases 1, 2a and 2b) it went through. Draw conclusions about those percentages and relate them to theoretical complexity.

 

3 Programming

Implementation of algorithms may be done in any language of student’s choice. However, the language  and its compiler should support certain features in order to be able to run the experiments properly. The choice of C, C++, Maple or the like should be enough. Source code and a .exe file have to be handed over.

4 Written Paper

A paper describing the following points must be handed over.

  • Brief explanation of the algorithms.
  • Brief explanation of the implementations. It can be done by including sufficiently detailed comments in the code.
  • Brief description of the experiment.
  • Interpretation of experimental data.
  • Conclusions. Draw your own conclusions (be creative, but not extravagant or too showy).

The paper has to be written in correct English; it also has to possess clarity of thought. Show me what you know; do not force to search for it through a poorly written paper.

 

5 Grading

The whole project counts 50% of your programming final grade. I will take points off if:

  • There are spelling mistakes.
  • It has irrelevant material. Down with the irrelevant!
  • It lacks clarity of thought.
  • It is lengthy, long-winded or poor in content.
  • Code is not properly commented.
  • Code is not properly structured.
  • Variables have absurd names.
  • There are run-time errors.

 

6 Questions and Office Hours


I am willing to answer your questions about algorithms, complexity or the experiment. I will not answer questions about coding errors as it is my feeling that, at this point, writing error-free code is
your responsibility.

Go to top