Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

HiddenLayer555@lemmy.ml · edit-2 2 months ago

Which of the 3 standard compression algorithms on Unix (gz, xz, or bz2) is best for long term data archival at their highest compression?

SGH@lemmy.ml · 2 months ago

Honestly, given that they should be purely compressing data, I would suppose that none of the formats you mentioned has ECC recovery nor builtin checksums (but I might be very mistaken on this). I think I only saw this within WinRAR, but also try other GUI tools like 7zip and check its features for anything that looks like what you need, if the formats support ECC then surely 7zip will offer you this option.

I just wanted to point out, no matter what someone else might say, if you were to split your data onto multiple compressed files, the chances of a bit rotting deleting your entire library are much lower, i.e. try to make it so that only small chunks of your data is lost in case something catastrophic happens.

However, if one of your filesystem-relevant bits rot, you may be in for a much longer recovery session.

YaBoyMax@programming.dev · 2 months ago

AFAIK none of those formats include any mechanism for error correction. You’d likely need to use a separate program like zfec to generate the extra parity data. Bzip2 and Zstandard are somewhat resistant to errors since they encode in blocks, but in the event of bit rot the entire affected block may still be unrecoverable.

Alternatively, if you’re especially concerned with robustness then it may be more advisable to simply maintain multiple copies across different drives or even to create an off-site backup. Parity bits are helpful but they won’t do you much good if your hard drive crashes or your house catches fire.

blackbrook@mander.xyz · 2 months ago

Lzip

just_another_person@lemmy.world · 2 months ago

Compression formats are just as susceptible to bitrot as any other file. The filesystem is where you want to start if you’re discussing archival purposes. All of the modern filesystems will support error correction, so using BTRFS or ZFS with proper configuration is what you’re looking for to prevent files from getting corrupted.

That being said, if you store something on a medium and then don’t use said medium (lock it in a safe or whatever), then the chances you’ll end up with corrupted files approaches 0%. Bitrot and general file corruption happens as the bits on a disk are shifted around, so by not using that disk, the likelihood this will happen is nearly 0.

Blue_Morpho@lemmy.world · 2 months ago

Bitrot happens even when sitting around. Magnetic domains flip. SSD cells leak electrons.

Reading and rewriting with an ECC system is the only way to prevent bit rot. It’s particularly critical for SSDs.

bacon_pdp@lemmy.world · 2 months ago

deleted by creator

IanTwenty@lemmy.world · 2 months ago

Upgrade from compression tools to backup tools. Look into using restic (a tool with dedup, compression and checksumming) on a filesystem which also checksums and compresses (btrfs/zfs) - that’s probably most reasonable protection and space saving available. Between restic’s checks and the filesystem you will know when a bit flips and that’s when you replace the hardware (restoring from one of your other backups).

anotherspinelessdem@lemmy.ml · 2 months ago

Honestly amazing question, I’ve lost entire 7z archives because of minor amounts of bit rot

tromboneflatsteel@lemmy.world · edit-2 2 months ago

~~Error correction and compression are usually at odds.~~ Error correction usually relies on redudant data to identify what was corrupted it also helps if the process for error correction is ran more frequent. So storing it away offline is counter to the correction and the added redundancy will reduce the space gains. You can check different error correction software or technique. Ex RAID. I recommend following the 3-2-1 data backup rule. Also even if you can’t do all the steps doing the ones you can, helps.

Sidenote optionally investigate which storage brand/medium/grade you want. Some are more resistant than other for long term vs short term. Also even unused storage will degrade over time whether the physical components, the magnetic charge weakening or electric charge representing your data. So again offline all the time isn’t the best; run it a couple times a year if not more to ensure errors don’t accumulate.

Sadly I won’t give specifics because I haven’t tried your use case and I am not familiar, but hopefully the keywords help.

waigl@lemmy.world · 2 months ago

Error correction and compression are usually at odds.

Not really. If your data compresses well, you can compress it by easily 60, 70%, then add Reed-Solomon forward error correction blocks at like 20% redundancy, and you’d still be up overall.

TrickDacy@lemmy.world · 2 months ago

There are a lot of smart answers in here, but personally I wouldn’t risk it by using a compressed archive. Disk space is cheap.

Olap@lemmy.world · 2 months ago

This is a money game, how much are you willing to invest?