HN🔥 186
💬 25

再現性の低いバグを確率で追い詰める!ベイズ推定版git bisect「Git bayesect」が登場

hauntsaninja
5日前

ディスカッション (10件)

0
hauntsaninjaOP🔥 186
5日前

Hacker Newsで、再現が不安定な(ノンデターミニスティックな)バグの原因コミットを特定する新ツール「Git bayesect」が公開されました。通常のgit bisectは二分探索を用いて「成功か失敗か」を明確に判断しますが、Git bayesectはベイズ推定を活用。100%再現しない「たまにしか起きないバグ」に対しても、統計的に怪しいコミットを効率よく絞り込めます。Flaky Test(不安定なテスト)に悩まされているエンジニア必見のツールです。

1
hauntsaninja
5日前

git bisect works great for tracking down regressions, but relies on the bug presenting deterministically. But what if the bug is non-deterministic? Or worse, your behaviour was always non-deterministic, but something has changed, e.g. your tests went from somewhat flaky to very flaky.

In addition to the repo linked in the title, I also wrote up a little bit of the math behind it here: https://hauntsaninja.github.io/git_bayesect.html (https://hauntsaninja.github.io/git_bayesect.html)

2
supermdguy
5日前

Okay this is really fun and mathematically satisfying. Could even be useful for tough bugs that are technically deterministic, but you might not have precise reproduction steps.

Does it support running a test multiple times to get a probability for a single commit instead of just pass/fail? I guess you’d also need to take into account the number of trials to update the Beta properly.

3
Retr0id
1日前

Super cool!

A related situation I was in recently was where I was trying to bisect a perf regression, but the benchmarks themselves were quite noisy, making it hard to tell whether I was looking at a "good" vs "bad" commit without repeated trials (in practice I just did repeats).

I could pick a threshold and use bayesect as described, but that involves throwing away information. How hard would it be to generalize this to let me plug in a raw benchmark score at each step?

4
davidkunz
1日前

Useful for tests with LLM interactions.

5
SugarReflex
1日前

I hope this comment is not out of place, but I am wondering what the application for all this is? How can this help us or what does it teach us or help us prove? I am asking out of genuine curiosity as I barely understand it but I believe it has something to do with probability.

edit: thanks for the responses! I was not even familiar with git bisect before this, so I've got some new things to learn.

6
rs545837
1日前

Really fun work, and the writeup on the math is great. The Beta-Bernoulli conjugacy trick making the marginal likelihood closed-form is elegant.

We ran benchmarks comparing bisect vs bayesect across flakiness levels. At 90/10, bisect drops to ~44% accuracy while bayesect holds at ~96%. At 70/30 it's 9% vs 67%. The entropy-minimization selection is key here since naive median splitting converges much slower.

One thing we found, you can squeeze out another 10-15% accuracy by weighting the prior with code structure. Commits that change highly-connected functions (many transitive dependents in the call graph) are more likely culprits than commits touching isolated code. That prior is free, zero test runs needed.

Information-theoretically, the structural prior gives you I_prior bits before running any test, reducing the total tests needed from log2(n)/D_KL to (log2(n) - I_prior)/D_KL. On 1024-commit repos with 80/20 flakiness: 92% accuracy with graph priors vs 85% pure bayesect vs 10% git bisect.

We're building this into sem (https://github.com/ataraxy-labs/sem (https://github.com/ataraxy-labs/sem)), which has an entity dependency graph that provides the structural signal.

7
convexly
1日前

Do you expose the posterior probabilities anywhere so you can see how confident it is in the result?

8
IshKebab
1日前

Does this tool assume it takes the same amount of time to test two commits once as it does to test one commit twice? Maybe true for interpreted languages, but if you're waiting 15 minutes to compile LLVM you're probably going to want to run your 1 second flaky test more than once. Probably pretty trivial to fix this though?

Great idea anyway!

9
volume_tech
1日前

the test-only commit case is an interesting edge for the priors. when bisecting flakiness regressions, commits that only touch test files are often the actual culprit (tightened timing assertion, changed retry logic, etc.) -- but a code-change signal would weight them lower than commits touchin