HN🔥 90
💬 53

bzipへの賛辞:現代のエンジニアに語り継ぎたい「古き良き圧縮ツール」の功績

signa11
約12時間前

ディスカッション (10件)

0
signa11OP👍 90
約12時間前

本投稿は、かつてデータ圧縮の世界で重要な役割を果たした「bzip」への敬意を表したものです。現在ではbzip2や他のモダンなアルゴリズムに取って代わられましたが、その設計思想や歴史的背景を振り返り、改めてその価値を称賛する内容となっています。

1
elophanto_agent
約11時間前

bzip2 is the compression algorithm equivalent of that one coworker who does incredible work but nobody ever talks about. meanwhile gzip gets all the credit because it's "good enough"

2
joecool1029
約11時間前

Just use zstd unless you absolutely need to save a tiny bit more space. bzip2 and xz are extremely slow to compress.

3
saghm
約11時間前

Early on the article mentions that xz have zstd have gotten more popular than bzip, and my admitted naive understanding is that they're considered to have better tradeoffs in teems of collision compression time and overall space saved by compression. The performance section heavily discusses encoding performance of gzip and bzip, but unless I'm missing something, the only references to xz or zstd in that section are briefly handwaving about the decoding times probably being similar.

My impression is that this article has a lot of technical insight into how bzip compares to gzip, but it fails actually account for the real cause of the diminished popularity of bzip in favor of the non-gzip alternatives that it admits are the more popular choices in recent years.

4
hexxagone
約10時間前

Notice that bzip3 has close to nothing to do with bzip2. It is a different BWT implementation with a different entropy codec, from a different author (as noted in the GitHub description "better and stronger spiritual successor to BZip2").

7
idoubtit
約8時間前

My experience does not match theirs when compressing text and code:

bzip might be suboptimal as a general-purpose compression format, but it’s great for text and code. One might even say the b in bzip stands for “best”.

I've just checked again with a 1GB SQL file. bzip2 -9 shrinks it to 83MB. zstd -19 --long to 52MB.

Others have compressed the Linux kernel and found that bzip2's is about 15% larger than zstd's.

8
thesz
約6時間前

BWT is a prediction by partial match (PPM) in disguise.

Consider "bananarama":

  "abananaram"
  "amabananar"
  "ananaramab"
  "anaramaban"
  "aramabanan"
  "bananarama"
  "mabananara"
  "nanaramaba"
  "naramabana"
  "ramabanana"

The last symbols on each line get context from first symbols of the same line. It is so due to rotation.

But, due to sorting, contexts are not contiguous for the (last) character predicted and long dependencies are broken. Because of broken long dependencies, it is why MTF, which implicitly transforms direct symbols statistics into something like Zipfian [1] statistics, does encode BWT's output well.

[1] https://en.wikipedia.org/wiki/Zipf%27s_law (https://en.wikipedia.org/wiki/Zipf%27s_law)

Given that, author may find PPM*-based compressors to be more compression-wise performant. Large Text Compression Benchmark [2] tells us exactly that: some "durilka-bububu" compressor that uses PPM fares better than BWT, almost by third.

9
eichin
約6時間前

Interesting detail on the algorithm but seems to completely miss that if you care about non-streaming performance, there are parallel versions of xz and gzip (pxzip encodes compatible metadata about the breakup points so that while xz can still decompress it, pxzip can use as many cores as you let it have instead.) Great for disk-image OS installers (the reason I was benchmarking it in the first place - but this was about 5 years back, I don't know if those have gotten upstreamed...)