Kursuse kodulehekülg Viimati muudetud: 23.4.2003

Teema 7 - Pakkimismeetodid

  1. Ülesandepüstitus ja sissejuhatus
  2. Code words (Huffman coding)
  3. Run-length encoding
  4. Arithmetic coding
  5. Lempel-Ziv family (compress, gzip, zip, pkzip, ...)
  6. Burrow-Wheeler family (bzip2)
  7. Muid meetode (k.a. pildid)
  8. Kolmogorovi keerukus
  9. Search from compressed texts

Kirjandust


Tagasi

Ülesandepüstitus


Tagasi

Aritmeetiline kodeerimine

Lisakirjandust


Tagasi

Ziv-Lempel meetodid

LZ-77 pere


      Liikuv aken          pakitav osa
....bbaaabbbaaffacbbaa    abbbaabab...

  • Ette piilumise akna pikim prefiks mis algab liikuva akna seest, kodeeritakse [positsioon,pikkus]

  • Naites on selleks [5,6]

  • Kiire!

  • Kasutatakse kaubanduslikes lahendustes (PKZIP, Stacker, DoubleSpace, )

  • Mitu alternatiivset esitust samale stringile (aknas on mitu võimalikku alamsõnet mis sobib)

    Esialgne LZ77

    LZ-78 pere

  • Sõnastik

  • Salvesta sõnesid juba töödeldud sõnumi osast

  • Järgmine sümbol on pikim sõnastiku jada mis sobib töödeldava tekstiga

    LZ78 (Ziv and Lempel 1978)

  • Alguses sõnastikus vaid tühi sõne, indeks 0.

  • Kood on [i,c] - i on viide sõnastikule (sõne u) ja c on järgmine sümbol.

  • Sõne uc viiakse sõnastikku
    
    Näide:
    
    ababcabc  →  [0,a][0,b][1,b][0,c][3,c]
    
    

  • LZW (Welch 84)

  • Kood koosneb vaid indeksitest

  • Alguses sõnastikus iga sümbol

  • Seda uuendatakse nagu LZ78

  • Lahti pakkimisel oht:

  • Sõnastiku esitusviise on palju: seda saab hoida lihtsalt, kompaktselt hoida, paiskatabelis, sorteeritud listis, paisktabeli ja listi kombinatsioonis (gzip), kahendpuus, tries, jne.

    LZJ

  • Sõnastikus kõik töödeldud alla h pikkused jadad, esitatud trie-na.

  • Kodeerides otsi pikimat sobivat prefiksit.

  • Kood on selle tipu aadress

  • Trie juurest on siire iga táhestiku táhega (nagu LZW)

  • Mälu lõppedes eemalda kõik need tipud(harud) mis esinenud vaid korra (va. juure lapsed)

  • Praktikas näiteks h=6, sõnastikus 8192 tippu

    LZFG

  • Efektiivseim LZ-meetod

  • Arendatud LZJ-ist

  • Esita akna jaoks sufiksipuu

  • Kood: puu tipu aadress + kaarelt mitu märki lisaks

  • Lehed ja vahetipud kodeeritakse erinevalt (pikkuse jaotus erinev)

  • lühike kokkulangevus antakse otse...

    Tagasi

    Kontekstimeetodid

    Abrahamsoni sõltuvusmudel

    PPM - Prediction by Partial Matching


    Tagasi

    Burrows-Wheeler block sorting algorithm


    Tagasi

    Muud

    Süntaktiline pakkimine

    Piltide pakkimine


    Tagasi

    Otsimine pakitud tekstidest

    Approximate Matching of Run-Length Compressed Strings
    Veli Mäkinen, Gonzalo Navarro, Esko Ukkonen
    We focus on the problem of approximate matching of strings that have been compressed using run-length encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n , compressed to m' and n' runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m'n+n'm) complexity. Furthermore, we extend this algorithm to a weighted edit distance model, where the weights of the three basic edit operations can be chosen arbitrarily. This approach also gives an algorithm for approximate searching of a pattern of m letters (m' runs) in a text of n letters (n' runs) in O(mm'n') time. Then we propose improvements for a greedy algorithm for the LCS, and conjecture that the improved algorithm has O(m'n') expected case complexity. Experimental results are provided to support the conjecture.

    Tagasi

    Kolmogorovi keerukus

    Algorithmic information theory

    (Redirected from Kolmogorov complexity)

    Algorithmic information theory is a field of study which attempts to capture the concept of complexity using tools from theoretical computer science. The chief idea is to define the complexity (or Kolmogorov complexity) of a string as the length of the shortest program which, when run without any input, outputs that string. Strings that can be produced by short programs are considered to be not very complex. This notion is surprisingly deep and can be used to state and prove impossibility results akin to Gödel's incompleteness theorem and Turing's halting problem.

    The field was developed by Andrey Kolmogorov and Gregory Chaitin starting in the late 1960s.

    To formalize the above definition of complexity, one has to specify exactly what types of programs are allowed. Fortunately, it doesn't really matter: one could take a particular notation for Turing machines, or LISP programs, or Pascal programs, or Java virtual machine bytecode. If we agree to measure the lengths of all objects consistently in bits, then the resulting notions of complexity will only differ by a constant factor: if I1(s) and I2(s) are the complexitites of the string s according to two different programming languages L1 and L2, then there are constants C and D (which only depend on the languages chosen, but not on s) such that

    I1(s) ≤ C · I2(s) + D
    Here, D is the length of an interpreter for L2 written in L1, and C describes the overhead of storing programs written in L2 as part of programs in L1.

    In the following, we will fix one definition and simply write I(s) for the complexity of the string s.

    The first surprising result is that I(s) cannot be computed: there is no general algorithm which takes a string s as input and produces the number I(s) as output. The proof is a formalization of the amusing Berry paradox: "Let n be the smallest number that cannot be defined in less than twenty English words. Well, I just defined it in less than twenty English words."

    It is however straightforward to compute upper bounds for I(s): simply compress the string s with some method, implement the corresponding decompressor in the chosen language, concatenate the decompressor to the compressed string, and measure the resulting string's length.

    The next important result is about the randomness of strings. Most strings are complex in the sense that they cannot be significantly compressed: I(s) is not much smaller than |s|, the length of s in bits. The precise statement is as follows: there is a constant K (which depends only on the particular specification of "program" used in the definition of complexity) such that for every n, the probability that a random string s has complexity less than |s| - n is smaller than K 2-n. The proof is a counting argument: you count the programs and the strings, and compare. This theorem is the justification for Mike Goldman's challenge in the comp.compression FAQ:

    I will attach a prize of $5,000 to anyone who successfully meets this challenge. First, the contestant will tell me HOW LONG of a data file to generate. Second, I will generate the data file, and send it to the contestant. Last, the contestant will send me a decompressor and a compressed file, which will together total in size less than the original data file, and which will be able to restore the compressed file to the original state.

    With this offer, you can tune your algorithm to my data. You tell me the parameters of size in advance. All I get to do is arrange the bits within my file according to the dictates of my whim. As a processing fee, I will require an advance deposit of $100 from any contestant. This deposit is 100% refundable if you meet the challenge.

    Now for Chaitin's incompleteness result: though we know that most strings are complex in the above sense, the fact that a specific string is complex can never be proven (if the string's length is above a certain threshold). The precise formalization is as follows. Suppose we fix a particular consistent axiomatic system for the natural numbers, say Peano's axioms. Then there exists a constant L (which only depends on the particular axiomatic system and the choice of definition of complexity) such that there is no string s for which the statement

    I(s) ≥ L
    can be proven within the axiomatic system (even though, as we know, the vast majority of those statements must be true). Again, the proof of this result is a formalization of Berry's paradox.

    Similar ideas are used to prove the properties of Chaitin's constant.


    See also:

    Tagasi
    EOF