Text Algorithms (4cu)
Grade
Project work
Annotation
Schedule and Homework assignments
Lecture notes
Terminology Terminology in Estonian
Literature
Links to web based materials
Other links
Text algorithms research at Univ. of Tartu
Text Algorithms (4cu)
http://www.egeen.ee/u/vilo/edu/2005-06/Text_Algorithms/
Title: Text Algorithms (4cu)
Code: MTAT.03.190
Language: The course will be held in English
Lecturer: Jaak Vilo
Lectures: Wednesday 14.15 - 16.00 week 1-16
Practicals: Wednesday 16.15 - 18.00 week 2-16
Place: J. Liivi 2 - 315
Institute: Department of Computer Science
12 Lectures (24h + 20h individual work)
10 Practicals (20h + 30h individual work)
1 Term paper (10h)
1 Project work (40h)
Preparations for the exam (12h)
Exam ( 4h)
----------------------------------------------------------
Total: 160h (=4cu)
Prerequisites: Understanding of material of "Algorithms and Data Structures".
Elementary programming skills (preferably C, perl, ...)
Grade
MTAT.03.190 Text Algorithms (4cu)
Practicals 40% + bonus points
Term paper 10%
Project work 20%
Exam 30% (Minimum requirement: 50%)
----------------------------------------------------------
Total: 100% ?
Practicals are one of the most important part of your self study. One
needs to do at least 50% of all the exercises. In general I expect almost
100% presence at all practical sessions.
For those who cannot attend, the written versions of the solutions have
to be sent before 16:15 by using the special web form or by e-mail to
Jaak Vilo personal e-mail address.
Project work
Project works will be done in groups (2-3 people) or individually.
The subjects for projects will be proposed or picked by the group.
All subjects must be approved by the lecturer.
This is only a prelimninary list - actual subjects will be finalised
by the end of October.
Project work [8.8.2009 22:46]
.
Annotation
Kursuse käigus tutvustatakse sõnede täpse ja ligikaudse otsimise
algoritme, tekstide indekseerimise meetodeid, andmete pakkimise
meetodeid, andmetes mustrite tuvastamist, tekstide
kaevandamist ja tekstinfo otsinguid.
Schedule and Homework assignments
- Lecture 1 (Sep 1.)
- Lecture 2, Practical 1 [8.8.2009 22:46]
(Sep. 7)
- Lecture 3, Practical 2 [8.8.2009 22:46]
(Sep. 14)
- Lecture 4, Practical 3 [8.8.2009 22:46]
(Sep. 21)
- Lecture 5, Practical 4 [8.8.2009 22:46]
(Sep. 28)
- Lecture 6, Practical 5 [8.8.2009 22:46]
(Oct. 5)
- Lecture 7, Practical 6 [8.8.2009 22:46]
(Oct. 12)
- October 19 - no lecture or practical
- Lecture 8, Practical 7 [8.8.2009 22:46]
(Oct. 26), Distribution of project work topics
- November 2 - no lecture or practical
- Lecture 9, Practical 8 [8.8.2009 22:46]
(Nov. 9)
- Nov. 16 - cancelled
- Lecture 10, Practical 9 [8.8.2009 22:46]
(Nov. 23)
- Lecture 11, Practical 10 (Nov. 30)
- Lecture 12, Practical 11 (Dec. 7th)
- Deadline of project works, Poster session at CS department (Dec. 14th) (??)
- Exam (TBA).
Lecture notes
- Exact search [8.8.2009 22:46]
.
- Brute-force algorithm O(nm)
- Knuth-Morris-Pratt O(m+n)
- Boyer-Moore
- Rabin-Karp
- Shift-OR
- Factor based approaches
- Multiple string search [8.8.2009 22:46]
.
- Aho-Corasick (search of multiple words)
- Commentz-Walter
- Wu and Manber
- 2-D search [8.8.2009 22:46]
.
- 2-D pattern matching
- Bird and Baker
- Zhu and Takaoke
- Kärkkäinen and Ukkonen
- 2D matching with rotations (Amir, Kapah, Tsur)
- Regular expressions and automata [8.8.2009 22:46]
.
- Regular expressions and finite automata
- Creating finite automata from regular expressions (nondeterministic and deterministic)
- Matching of deterministic automatons
- Matching of non-deterministic automatons
- Sequence similarity [8.8.2009 22:46]
.
- Edit distance
- Dynamic programming
- The diagonal method
- Time warping
- Similarity measures for DNA and protein sequences - Smith Waterman etc.
- Approximate search [8.8.2009 22:46]
.
- Approximate search from long text
- q-gram methods - "filter and search"
- Bit-parallel methods - Shift/OR method (ja SHIFT-ADD)
- Suffix trees [8.8.2009 22:46]
- Suffix tree and -trie
- Applications of suffix trees
- Suffix automaton
- Suffix arrays
- PAT-index and OED project
- Modern data strauctures
- Text Compression [8.8.2009 22:46]
.
- datacompression.info
- Shannon-Fano and Huffman codes
- Arithmetic coding
- Lempel-Ziv family of compression techniques
- Burrows-Wheeler
- Algorithmic complexity and Kolmogorov distance
- Information Retrieval [8.8.2009 22:46]
- Information Retrieval
- Mining of sequence data
- Probabilistic motifs and grammars (PWM, HMM, SCFG) [8.8.2009 22:46]
.
- Probabilistic motifs (Position Weight Matrices, Position Specific Scoring Matrices)
- PWM search
- Information content
- Sequence logos
- Stochastic grammars
- HMM - Hidden Markov Models
- SCFG - Stochastic Context Free Grammars
- Pattern Dicovery (L11_Pattern_Discovery.thtml:tbw)
- Pattern discovery concepts
- Different pattern classes
- Gibbs Sampling
- Expectation Maximization
- ...
|
|
|
Terminology Terminology in Estonian
- String Sõne, string
- Stringology Stringoloogia (huhhh...)
- Edit distance Teisenduskaugus
- Brute-force Jõumeetod
- Pattern matching Mustrite sobitamine
Literature
-
Gonzalo Navarro, Mathieu Raffinot
: Flexble Pattern matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences
, Cambridge University Press
, 2002
.
[http://print.google.com/print?ie=UTF-8&q=gonzalo+navarro+pattern+matching&btnG=Search]
[http://portal.acm.org/citation.cfm?id=571024]
[http://www.dcc.uchile.cl/~gnavarro/FPMbook/]
-
Maxime Crochemore, Wojciech Rytter
: Jewels of Stringology: Text Algorithms
, World Scientific Publishing
, 2003
.
[http://www.amazon.co.uk/exec/obidos/ASIN/9810248970/qid=1126651009/sr=1-1/ref=sr_1_0_1/026-2403945-0401256]
-
Dan Gusfield
: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology.
, Cambridge University Press
, 1997
.
-
A. Aho
: Algorithms for finding patterns in strings. In Handbook of Theoretical Computer Science, Vol. A, pp255-300.
, Elsevier
, 1990
.
-
Cormen, T.H., Leiserson, C.E., Rivest, R.L.
: Introduction to Algorithms, Chapter 34 (String Matching).
, MIT Press
, 1990
.
- Durbin, R., Eddy, S., Krogh, A., Mitchison - Biological sequence
analysis - Probabilistic models of proteins and nucleic acids
(Cambridge University Press, 1998)
- Jorma Tarhio, Merkkijonomenetelmät (luentomoniste) ~ 1995
University of Helsinki
- Ricardo Baeza-Yates, Berthier Ribeiro-Neto:
Modern Information Retrieval, Addison Wesley
- Pevzner, P.A. Computational Molecular Biology - An Algorithmic
Approach (MIT Press, 2000)
- Baldi, Pierre, and Brunak, Soren, Bioinformatics - the Machine
Learning Approach 2nd edition, (MIT Press, 2001)
- jt.
|
|
|
Links to web based materials
Web Search
Links and pointers
Stefano Lonardi: CPM pointers
Tekstialgoritmid I, kevad 2003 loengumaterjalid
String Processing Algorithms: Univ. of Helsinki
Combinatorial Pattern Matching (CPM) archives
String Search, TR-92-gas-01, by Graham A. Stephen
ja kohalikud:
[DVI]
[Postscript]
Exact String Matching Algorithms - animations in Java (Christian Charras - Thierry Lecroq)
HANDBOOK OF COMPUTER SCIENCE AND ENGINEERING Chapter 6 Pattern matching and text compression algorithms
[local PDF]
Text
Searching and Processing Maxime Crochemore, Department of
Computer Science King's College London. (A lecture course and materials)
Jesper Larsson - thesis and research on suffix trees
[PDF]
Stefan Kurtz - Foundations of Sequence Analysis
[gzipped Postscript]
Computational Biology - Martin Tompa
The Algorithm Design Manual The CD-ROM (Steven S. Skiena)
Finite state automata utilities
datacompression.info - informatsioon andmete pakkimise kohta
Palindroomid: A Man, a Plan, a Pointless(?) Program
Other links
Text algorithms research at Univ. of Tartu
Text algorithms form one of the cornerstones for the bioinformatics and
practical algoriths research group lead by Dr. Jaak Vilo.