Hide menu

Project lab

The lab part of the course is a small project, where you are supposed to implement a couple of compression methods and test these on various types of data. You should also do entropy estimations for the different data.

The work should be done in groups of 1-3 students. Examination is by written report.

Entropy estimation

Make random models for the different test files and estimate entropies for them. You should at least estimate H(Xi), H(Xi|Xi-1) and H(Xi|Xi-1,Xi-2) for all sources.

Source coding

Implement two of the following compression methods and test them by coding all the test files. At least one of the implemented methods should utilize the memory of the sources. You should of course write both coders and decoders for the chosen methods.
  • Huffman coding (static or adaptive code tree)
  • Arithmetic coding (static or adaptive probability model)
  • Lempel-Ziv coding (LZ77, LZSS, LZ78, LZW or another variant)
  • Burrows-Wheeler block transform (requires that you also implement Huffman coding or arithmetic coding as the actual source coder)

Compare your results with the estimated entropies.

You are free to choose any programming language for your implementations. Matlab will work nicely for entropy estimation and for simple source coding, but might be too slow for some coding methods, especially those that involve searching (LZ) or sorting (BWT).


Test data

There are many files to use as test data for your coders. Treat the files as sequences of 8-bit bytes, ie don't make any assumptions of the actual contents of the files.

The Canterbury corpus

A standard test set of 11 files of different types and sizes from The Canterbury corpus.
File Category Size in bytes
alice29.txt English text 152089
asyoulik.txt Shakespeare 125179
cp.html HTML source 24603
fields.c C source 11150
grammar.lsp LISP source 3721
kennedy.xls Excel Spreadsheet 1029744
lcet10.txt Technical writing 426754
plrabn12.txt Poetry 481861
ptt5 CCITT fax test set 513216
sum SPARC Executable 38240
xargs.1 GNU manual page 4227
These files can be found under /site/edu/icg/tsbk08/canterbury/ on ISY's computer system. You can also download the files here.

Large corpus

A standard test set of 3 relatively large files of different types and sizes from The Canterbury corpus.
File Category Size in bytes
E.coli Complete genome of the E. Coli bacterium 4638690
bible.txt The King James version of the bible 4047392
world192.txt The CIA world fact book 1992 2473400
These files can be found under /site/edu/icg/tsbk08/large/ on ISY's computer system. You can also download the files here.


Report

The project is examined by a written report, where you describe how you have solved the problem and what results you achieve. Discuss how file type and file size influences the compression ratio. Include all program code files.

Give the name, person number and email address of all group members.

You can either leave a paper copy of the report directly to Harald, or submit it in electronic form, in PDF format. Send electronic reports by email to Harald.

Lab results will be reported to LADOK at approximately the same time as the exam results. If you are late in submitting your report, you will have to wait until June or August for your points.

Lab times

There are lots of lab times in the schedule, but you don't have to use those times if you don't want to. During the scheduled times, Harald will be available for questions, either in his office or in the computer lab.

Page responsible: Harald Nautsch
Last updated: 2017-06-08