Burrows-Wheeler Transform Part1

1/20/2019

COLT MCANLIS: Inspirtation's a funny thing. I mean, there you are idly going about your day, eating a banana, and boom, some amazing revelation hits you. How fitting is it that all of the technological advances we've had over the past 40 years have been because of these sort of crazy, amazing inspirational thoughts.
Take Burrows-Wheeler Transform, for example. It's a very common compression algorithm for Linux and the web. But even the authors themselves will admit that they don't fully understand how they came up with the algorithm itself.
Huh, that's the trick with inspiration, I suppose. You get greatness from nothingness. But it doesn't really help you if you want to implement the algorithm itself. But fear not, young programmer. I'm here to help.
My name is Colt McAnlis, and this is "Compressor Head." [MUSIC PLAYING] One of funny things about information theory is that whole "theory" part. See, there's some interesting points about the theory that don't always work out right. Take entropy, for example, which generally measures the information content of a data set given in bits.
But one of the issues with entropy is that it fails to take into account the order of the symbols. See, regardless of how I shuffle this data set of 0 through 9 here, it will always have an entropy of 4. But if you've been watching the rest of our shows, you know that order does, in fact, matter.
Take delta coding, for example. If we have this variation of our data set, we can see that the delta-encoded version really doesn't provide us with any lower entropy. However, if we sort this data set, giving us the nice linear number range from 0 to 9, the delta-encoded version produces a much lower entropy than the source stream. Now, in an ideal world, we'd be able to just apply this type of sort to all of our data and end up with a perfectly compressible stream.
That's not actually how it works. See, pure sorts are generally one directional. That is, once you sort it, you can't get it back in its original order without a whole ton of extra data to tell you where everything goes. So we can't purely sort it, but we can get close. See, the Burrows-Wheeler transform, or BWT, will shuffle around the data to cluster together groups of the same symbol as much as it possibly can.
The benefit of this is that once the symbols have been clustered together, they effectively have an ordering, which can make it more compressible for other algorithms that can take advantage of it. Now, remember, this isn't a perfect sorting algorithm. It's more of a rough clustering algorithm. Well, technically, it's a lexographical permutation.
You know what? Anyhow, the point is that BWT has one amazing characteristic. We can encode to and decode from BWT without having to add any significant additional data to our stream. Let's take a look. Go! MAGNUS HYTTSTEN: Whoa!
COLT MCANLIS: We want him talking? The Transform part of Burrows-Wheeler Transform starts by creating a table of shifted permutations for your input stream. So let's take the word "banana" as an example and write that in the first row of our table. Now, on each row under that, let's do a rotational shift of that word to the right one character.

0 Comments

Burrows-Wheeler Transform Part1

Leave a Reply.

Author

Archives

Categories