Compress Wikipedia

Wikipedia is the online encyclopedia that everyone can edit.  It would be nice to be able to carry around an offline copy of Wikipedia for many reasons, such as reading on mobile devices when network service is unavailable, or as an archive in case of a prolonged Internet outage.  Wikipedia stores a lot of data, so compressing it would be helpful.  There is even a contest for solving this problem well.

General purpose text compression algorithms such as LZW (GZip) and BWT (bzip2) do a decent job, but it is possible to do even better.  Standard text compression algorithms limit themselves to linear-time algorithms for obvious reasons.  However, we might be able to do better with a relatively high-order polynomial-time algorithm. One approach would be to use a polynomial-time approximation scheme for a NP-complete problem that corresponds to text compression.

Project:

  • Reduce hypertext compression to a known approximable NP-complete problem.
  • Implement the approximation algorithm.
  • Compare the compression performance to competitors.
Advertisements