Implementing a spellchecker on 64 kB of RAM back in the 1970s led to a compression algorithm that's technically unbeaten and part of it is still in use today

cm0002@lemmy.world · 7 months ago

Implementing a spellchecker on 64 kB of RAM back in the 1970s led to a compression algorithm that's technically unbeaten and part of it is still in use today

NoSpotOfGround@lemmy.world · edit-2 7 months ago

The real meat of the story is in the referenced blog post: https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram

TL;DR

If you’re short on time, here’s the key engineering story:

McIlroy’s first innovation was a clever linguistics-based stemming algorithm that reduced the dictionary to just 25,000 words while improving accuracy.

For fast lookups, he initially used a Bloom filter—perhaps one of its first production uses. Interestingly, Dennis Ritchie provided the implementation. They tuned it to have such a low false positive rate that they could skip actual dictionary lookups.

When the dictionary grew to 30,000 words, the Bloom filter approach became impractical, leading to innovative hash compression techniques.

They computed that 27-bit hash codes would keep collision probability acceptably low, but needed compression.

McIlroy’s solution was to store differences between sorted hash codes, after discovering these differences followed a geometric distribution.

Using Golomb’s code, a compression scheme designed for geometric distributions, he achieved 13.60 bits per word—remarkably close to the theoretical minimum of 13.57 bits.

Finally, he partitioned the compressed data to speed up lookups, trading a small memory increase (final size ~14 bits per word) for significantly faster performance.

ch00f@lemmy.world · 7 months ago

For anyone struggling, lemmy web interface added the colon into the URL for the blog post link. Here’s a clickable version without the colon:

https://blog.codingconfessions.com/p/how-unix-spell-ran-in-64kb-ram

NoSpotOfGround@lemmy.world · 7 months ago

Thanks, and sorry about that! I removed the colon from near my URL now, just in case.

0x0@programming.dev · 7 months ago

here’s another

db2@lemmy.world · 7 months ago

Thank you

potate@lemmy.ca · 7 months ago

The blog post is an incredible read.

Troy@lemmy.ca · 7 months ago

Long article for one sentence of trivia and no info on the algo itself. The death of the internet is upon us.

Em Adespoton@lemmy.ca · edit-2 7 months ago

Doesn’t even name the algorithm, and somehow spells LZMA wrong, despite just having written it out longhand.

Well, it’s PC Gamer.

[edit] I still can’t figure out if they’re referencing LZW encoding… the L and Z being the same Lempel and Ziv from LZMA, but with Welch having a different solution for the rest of the algorithm due to size constraints.

Troy@lemmy.ca · 7 months ago

Probably mostly AI written.

GrabtharsHammer@lemmy.world · 7 months ago

I’d like to imagine they took the short trivia fact and applied the inverse of the compression algorithm to bloat it into something that satisfied the editor.

SirFasy@lemmy.world · 7 months ago

If it aint broke, don’t fix it.

L3s@lemmy.world · 7 months ago

!lemmysilver

LemmySilverBot@lemmy.world · 7 months ago

Thank you for voting. You can vote again in 24 hours. leaderboard

0x0@programming.dev · 7 months ago

Only 1 GiB of RAM? Moooom!
Shut up Johnny, Voyager’s still out there with way less.