HN🔥 216
💬 104

Firefoxのクラッシュ、実は10%が「ビット反転」のせいだった!物理メモリの不具合に迫る

marvinborner
1日前

ディスカッション (11件)

0
marvinbornerOP🔥 216
1日前

Firefoxのクラッシュデータを詳細に分析した結果、全体の約1割が「ビットフリップ(メモリ内の0と1が予期せず入れ替わる現象)」によって引き起こされていることが明らかになりました。これはソフトウェア側のバグではなく、ハードウェアの劣化や宇宙線の影響といった物理的な要因が、ブラウザの安定性に無視できない影響を与えていることを示しています。エンジニアにとっても、デバッグでは追いきれないハードウェア由来の課題として注目を集めています。

1
thegrim33
1日前

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

2
adonovan
約24時間前

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685... (https://github.com/golang/go/issues/71425#issuecomment-3968582591)) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

3
camkego
約23時間前

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

4
netcoyote
約19時間前

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

  • overclocked CPU

  • bad memory wait-state configuration

  • underpowered power supply

  • overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

5
shevy-java
約3時間前

In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

Bold claim. From my gut feeling this must be incorrect; I don't seem to get the same amount of crashes using chromium-based browsers such as thorium.

7
kev009
約2時間前

It's high enough that I would wonder if some systems software issues are mixed in, like rare races in malloc or page table management.

8
Animats
約2時間前

ECC should have become standard around the time memories passed 1GB.

It's seriously annoying that ECC memory is hard to get and expensive, but memory with useless LEDs attached is cheap.

9
aforwardslash
約2時間前

Going to be downvoted, but I call bullshit on this. Bitflips are frequent (and yes ECC is an improvement but does not solve the problem), but not that frequent. One can either assume users that enabled telemetry are an odd bunch with flaky hardware, or the implementation isnt actually detecting bitflips (potentially, as the messages indicate), but a plathora of problems. Having a 1/10 probability a given struct is either processed wrong, parsed wrong or saved wrong would have pretty severe effects in many, many scenarios - from image editing to cad. Also, bitflips on flaky hardware dont choose protection rings - it would also affect the OS routines such as reading/writing to devices and everything else that touches memory. Yup, i've seen plenty of faulty ram systems (many WinME crashes were actually caused by defective ram sticks that would run fine with W98), it doesnt choose browsers or applications.

10
fooker
約1時間前

This seems like the kind of metric that 3 users with 15 year old machines can skew significantly.

Has to be normalized, and outliers eliminated in some consistent manner.