home

Dan Luu

https://danluu.com/atom/index.xml
Recent content on Dan Luu
http://danluu.com/atom.xml


Files are fraught with peril

This is a psuedo-transcript for a talk given at Deconstruct 2019. To make this accessible for people on slow connections as well as people using screen readers, the slides have been replaced by in-line text (the talk has ~120 slides; at an average of 20 kB per slide, that's 2.4 MB. If you think that's trivial, consider that half of Americans still aren't on broadband and the situation is much worse in developing countries.

Let's talk about files! Most developers seem to think that files are easy. Just for example, let's take a look at the top reddit r/programming comments from when Dropbox announced that they were only going to support ext4 on Linux (the most widely used Linux filesystem). For people not familiar with reddit r/programming, I suspect r/programming is the most widely read English langauge programming forum in the world.

The top comment reads:

I'm a bit confused, why do these applications have to support these file systems directly? Doesn't the kernel itself abstract away from having to know the lower level details of how the files themselves are stored?

The only differences I could possibly see between different file systems are file size limitations and permissions, but aren't most modern file systems about on par with each other?

The #2 comment (and the top replies going two levels down) are:

#2: Why does an application care what the filesystem is?

#2: Shouldn't that be abstracted as far as "normal apps" are concerned by the OS?

Reply: It's a leaky abstraction. I'm willing to bet each different FS has its own bugs and its own FS specific fixes in the dropbox codebase. More FS's means more testing to make sure everything works right . . .

2nd level reply: What are you talking about? This is a dropbox, what the hell does it need from the FS? There are dozenz of fssync tools, data transfer tools, distributed storage software, and everything works fine with inotify. What the hell does not work for dropbox exactly?

another 2nd level reply: Sure, but any bugs resulting from should be fixed in the respective abstraction layer, not by re-implementing the whole stack yourself. You shouldn't re-implement unless you don't get the data you need from the abstraction. . . . DropBox implementing FS-specific workarounds and quirks is way overkill. That's like vim providing keyboard-specific workarounds to avoid faulty keypresses. All abstractions are leaky - but if no one those abstractions, nothing will ever get done (and we'd have billions of "operating systems").

In this talk, we're going to look at how file systems differ from each other and other issues we might encounter when writing to files. We're going to look at the file "stack" starting at the top with the file API, which we'll see is nearly impossible to use correctly and that supporting multiple filesystems without corrupting data is much harder than supporting a single filesystem; move down to the filesystem, which we'll see has serious bugs that cause data loss and data corruption; and then we'll look at disks and see that disks can easily corrupt data at a rate five million times greater than claimed in vendor datasheets.

File API

Writing one file

Let's say we want to write a file safely, so that we don't want to get data corruption. For the purposes of this talk, this means we'd like our write to be "atomic" -- our write should either fully complete, or we should be able to undo the write and end up back where we started. Let's look at an example from Pillai et al., OSDI’14.

We have a file that contains the text a foo and we want to overwrite foo with bar so we end up with a bar. We're going to make a number of simplifications. For example, you should probably think of each character we're writing as a sector on disk (or, if you prefer, you can imagine we're using a hypothetical advanced NVM drive). Don't worry if you don't know what that means, I'm just pointing this out to note that this talk is going to contain many simplifications, which I'm not going to call out because we only have twenty-five minutes and the unsimplified version of this talk would probably take about three hours.

To write, we might use the pwrite syscall. This is a function provided by the operating system to let us interact with the filesystem. Our invocation of this syscall looks like:

pwrite(
  [file], 
  “bar”, // data to write
  3,     // write 3 bytes
  2)     // at offset 2

pwrite takes the file we're going to write, the data we want to write, bar, the number of bytes we want to write, 3, and the offset where we're going to start writing, 2. If you're used to using a high-level language, like Python, you might be used to an interface that looks different, but underneath the hood, when you write to a file, it's eventually going to result in a syscall like this one, which is what will actually write the data into a file.

If we just call pwrite like this, we might succeed and get a bar in the output, or we might end up doing nothing and getting a foo, or we might end up with something in between, like a boo, a bor, etc.

What's happening here is that we might crash or lose power when we write. Since pwrite isn't guaranteed to be atomic, if we crash, we can end up with some fraction of the write completing, causing data corruption. One way to avoid this problem is to store an "undo log" that will let us restore corrupted data. Before we're modify the file, we'll make a copy of the data that's going to be modified (into the undo log), then we'll modify the file as normal, and if nothing goes wrong, we'll delete the undo log.

If we crash while we're writing the undo log, that's fine -- we'll see that the undo log isn't complete and we know that we won't have to restore because we won't have started modifying the file yet. If we crash while we're modifying the file, that's also ok. When we try to restore from the crash, we'll see that the undo log is complete and we can use it to recover from data corruption:

creat(/d/log) // Create undo log
write(/d/log, "2,3,foo", 7) // To undo, at offset 2, write 3 bytes, "foo"
pwrite(/d/orig, “bar", 3, 2) // Modify original file as before
unlink(/d/log) // Delete log file

If we're using ext3 or ext4, widely used Linux filesystems, and we're using the mode data=journal (we'll talk about what these modes mean later), here are some possible outcomes we could get:

d/log: "2,3,f"
d/orig: "a foo"

d/log: ""
d/orig: "a foo"

It's possible we'll crash while the log file write is in progress and we'll have an incomplete log file. In the first case above, we know that the log file isn't complete because the file says we should start at offset 2 and write 3 bytes, but only one byte, f, is specified, so the log file must be incomplete. In the second case above, we can tell the log file is incomplete because the undo log format should start with an offset and a length, but we have neither. Either way, since we know that the log file isn't complete, we know that we don't need to restore.

Another possible outcome is something like:

d/log: "2,3,foo"
d/orig: "a boo"

d/log: "2,3,foo"
d/orig: "a bar"

In the first case, the log file is complete we crashed while writing the file. This is fine, since the log file tells us how to restore to a known good state. In the second case, the write completed, but since the log file hasn't been deleted yet, we'll restore from the log file.

If we're using ext3 or ext4 with data=ordered, we might see something like:

d/log: "2,3,fo"
d/orig: "a boo"

d/log: ""
d/orig: "a bor"

With data=ordered, there's no guarantee that the write to the log file and the pwrite that modifies the original file will execute in program order. Instesad, we could get

creat(/d/log) // Create undo log
pwrite(/d/orig, “bar", 3, 2) // Modify file before writing undo log!
write(/d/log, "2,3,foo", 7) // Write undo log
unlink(/d/log) // Delete log file

To prevent this re-ordering, we can use another syscall, fsync. fsync is a barrier (prevents re-ordering) and it flushes caches (which we'll talk about later).

creat(/d/log)
write(/d/log, “2,3,foo”, 7)
fsync(/d/log) // Add fsync to prevent re-ordering
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig) // Add fsync to prevent re-ordering
unlink(/d/log)

This works with ext3 or ext4, data=ordered, but if we use data=writeback, we might see something like:

d/log: "2,3,WAT"
d/orig: "a boo"

Unfortunately, with data=writeback, the write to the log file isn't guaranteed to be atomic and the filesystem metadata that tracks the file length can get updated before we've finished writing the log file, which will make it look like the log file contains whatever bits happened to be on disk where the log file was created. Since the log file exists, when we try to restore after a crash, we may end up "restoring" random garbage into the original file. To prevent this, we can add a checksum (a way of making sure the file is actually valid) to the log file.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7) // Add checksum to log file to detect incomplete log file
fsync(/d/log)
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

This should work with data=writeback, but we could still see the following:

d/orig: "a boo"

There's no log file! Although we created a file, wrote to it, and then fsync'd it. Unfortunately, there's no guarantee that the directory will actually store the location of the file if we crash. In order to make sure we can easily find the file when we restore from a crash, we need to fsync the parent of the newly created log.

creat(/d/log)
write(/d/log,“…[✓∑],foo”,7)
fsync(/d/log)
fsync(/d) /// fsync parent directory
pwrite(/d/orig, “bar”, 3, 2)
fsync(/d/orig)
unlink(/d/log)

There are a couple more things we should do. We shoud also fsync after we're done (not shown), and we also need to check for errors. These syscalls can return errors and those errors need to be handled appropriately. There's at least one filesystem issue that makes this very difficult, but since that's not an API usage thing per se, we'll look at this again in the Filesystems section.

We've now seen what we have to do to write a file safely. It might be more complicated than we like, but it seems doable -- if someone asks you to write a file in a self-contained way, like an interview question, and you know the appropriate rules, you can probably do it correctly. But what happens if we have to do this as a day-to-day part of our job, where we'd like to write to files safely every time to write to files in a large codebase.

API in practice

Pillai et al., OSDI’14 looked at a bunch of software that writes to files, including things we'd hope write to files safely, like databases and version control systems: Leveldb, LMDB, GDBM, HSQLDB, Sqlite, PostgreSQL, Git, Mercurial, HDFS, Zookeeper. They then wrote a static analysis tool that can find incorrect usage of the file API, things like incorrectly assuming that operations that aren't atomic are actually atomic, incorrectly assuming that operations that can be re-ordered will execute in program order, etc.

When they did this, they found that every single piece of software they tested except for SQLite in one particular mode had at least one bug. This isn't a knock on the developers of this software or the software -- the programmers who work on things like Leveldb, LBDM, etc., know more about filesystems than the vast majority programmers and the software has more rigorous tests than most software. But they still can't use files safely every time! A natural follow-up to this is the question: why the file API so hard to use that even experts make mistakes?

Concurrent programming is hard

There are a number of reasons for this. If you ask people "what are hard problems in programming?", you'll get answers like distributed systems, concurrent programming, security, aligning things with CSS, dates, etc.

And if we look at what mistakes cause bugs when people do concurrent programming, we see bugs come from things like "incorrectly assuming operations are atomic" and "incorrectly assuming operations will execute in program order". These things that make concurrent programming hard also make writing files safely hard -- we saw examples of both of these kinds of bugs in our first example. More generally, many of the same things that make concurrent programming hard are the same things that make writing to files safely hard, so of course we should expect that writing to files is hard!

Another property writing to files safely shares with concurrent programming is that it's easy to write code that has infrequent, non-deterministc failures. With respect to files, people will sometimes say this makes things easier ("I've never noticed data corruption", "your data is still mostly there most of the time", etc.), but if you want to write files safely because you're working on software that shouldn't corrupt data, this makes things more difficult by making it more difficult to tell if your code is really correct.

API inconsistent

As we saw in our first example, even when using one filesystem, different modes may have significantly different behavior. Large parts of the file API look like this, where behavior varies across filesystems or across different modes of the same filesystem. For example, if we look at mainstream filesystems, appends are atomic, except when using ext3 or ext4 with data=writeback, or ext2 in any mode and directory operations can't be re-ordered w.r.t. any other operations, except on btrfs. In theory, we should all read the POSIX spec carefully and make sure all our code is valid according to POSIX, but if they check filesystem behavior at all, people tend to code to what their filesystem does and not some abtract spec.

If we look at one particular mode of one filesystem (ext4 with data=journal), that seems relatively possible to handle safely, but when writing for a variety of filesystems, especially when handling filesystems that are very different from ext3 and ext4, like btrfs, it becomes very difficult for people to write correct code.

Docs unclear

In our first example, we saw that we can get different behavior from using different data= modes. If we look at the manpage (manual) on what these modes mean in ext3 or ext4, we get:

journal: All data is committed into the journal prior to being written into the main filesystem.

ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal.

writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

If you want to know how to use your filesystem safely, and you don't already know what a journaling filesystem is, this definitely isn't going to help you. If you know what a journaling filesystem is, this will give you some hints but it's still not sufficient. It's theoretically possible to figure everything out from reading the source code, but this is pretty impractical for most people who don't already know how the filesystem works.

For English-language documentation, there's lwn.net and the Linux kernel mailing list (LKML). LWN is great, but they can't keep up with everything, so you LKML is the place to go if you want something comprehensive. Here's an example of an exchange on LKML about filesystems:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
FS dev: Probably from some 6-8 years ago, in e-mail postings that I made.

While the filesystem developers tend to be helpful and they write up informative responses, most people probably don't keep up with the past 6-8 years of LKML.

Performance / correctness conflict

Another issue is that the file API has an inherent conflict between performance and correctness. We noted before that fsync is a barrier (which we can use to enforce ordering) and that it flushes caches. If you've ever worked on the design of a high-performance cache, like a microprocessor cache, you'll probably find the bundling of these two things into a single primitive to be unusual. A reason this is unusual is that flushing caches has a significant performance cost and there are many cases where we want to enforce ordering without paying this performance cost. Bundling these two things into a single primitive forces us to pay the cache flush cost when we only care about ordering.

Chidambaram et al., SOSP’13 looked at the performance cost of this by modifying ext4 to add a barrier mechanism that doesn't flush caches and they found that, if they modified software appropriately and used their barrier operation where a full fsync wasn't necessary, they were able to achieve performance roughly equivalent to ext4 with cache flushing entirely disabled (which is unsafe and can lead to data corruption) without sacrificing safety. However, making your own filesystem and getting it adopted is impractical for most people writing user-level software. Some databases will bypass the filesystem entirely or almost entirely, but this is also impractical for most software.

That's the file API. Now that we've seen that it's extraordinarily difficult to use, let's look at filesystems.

Filesystem

If we want to make sure that filessystems work, one of the most basic tests we could do is to inject errors are the layer below the filesystem to see if the filesystem handles them properly. For example, on a write, we could have the disk fail to write the data and return the appropriate error. If the filesystem drops this error or doesn't handle ths properly, that means we have data loss or data corruption. This is analogous to the kinds of distributed systems faults Kyle Kingsbury talked about in his distributed systems testing talk yesterday (although these kinds of errors are much more straightforward to test).

Prabhakaran et al., SOSP’05 did this and found that, for most filesystems tested, almost all write errors were dropped. The major exception to this was on ReiserFS, which did a pretty good job with all types of errors tested, but ReiserFS isn't really used today for reasons beyond the scope of this talk.

We (Wesley Aptekar-Cassels and I) looked at this again in 2017 and found that things had improved significantly. Most filesystems (other than JFS) could pass these very basic tests on error handling.

Another way to look for errors is to look at filesystems code to see if it handles internal errors correctly. Gunawai et al., FAST’08 did this and found that internal errors were dropped a significant percentage of the time. The technique they used made it difficult to tell if functions that could return many different errors were correctly handling each error, so they also looked at calls to functions that can only return a single error. In those cases, depending on the function, errors were dropped roughly 2/3 to 3/4 of the time, depending on the function.

Wesley and I also looked at this again in 2017 and found significant improvement -- errors for the same functions Gunawi et al. looked at were "only" ignored 1/3 to 2/3 of the time, depending on the function.

Gunawai et al. also looked at comments near these dropped errors and found comments like "Just ignore errors at this point. There is nothing we can do except to try to keep going." (XFS) and "Error, skip block and hope for the best." (ext3).

Now we've seen that while filesystems used to drop even the most basic errors, they now handle then correctly, but there are some code paths where errors can get dropped. For a concrete example of a case where this happens, let's look back at our first example. If we get an error on fsync, unless we have a pretty recent Linux kernel (Q2 2018-ish), there's a pretty good chance that the error will be dropped and it may even get reported to the wrong process!

On recent Linux kernels, there's a good chance the error will be reported (to the correct process, even). Wilcox, PGCon’18 notes that an error on fsync is basically unrecoverable. The details for depending on filesystem -- on XFS and btrfs, modified data that's in the filesystem will get thrown away and there's no way to recover. On ext4, the data isn't thrown away, but it's marked as unmodified, so the filesystem won't try to write it back to disk later, and if there's memory pressure, the data can be thrown out at any time. If you're feeling adventurous, you can try to recover the data before it gets thrown out with various tricks (e.g., by forcing the filesystem to mark it as modified again, or by writing it out to another device, which will force the filesystem to write the data out even though it's marked as unmodified), but there's no guarantee you'll be able to recover the data before it's thrown out. On Linux ZFS, it appears that there's a code path designed to do the right thing, but CPU usage spikes and the system may hang or become unusable.

In general, there isn't a good way to recover from this on Linux. Postgres, MySQL, and MongoDB (widely used databases) will crash themselves and the user is expected to restore from the last checkpoint. Most software will probably just silently lose or corrupt data. And fsync is a relatively good case -- for example, syncfs simply doesn't return errors on Linux at all, leading to silent data loss and data corruption.

BTW, when Craig Ringer first proposed that Postgres should crash on fsync error, the first response on the Postgres dev mailing list was:

Surely you jest . . . If [current behavior of fsync] is actually the case, we need to push back on this kernel brain damage

But after talking through the details, everyone agreed that crashing was the only good option. One of the many unfortunate things is that most disk errors are transient. Since the filesystem discards critical information that's necessary to proceed without data corruption on any error, transient errors that could be retried instead force software to take drastic measures.

And while we've talked about Linux, this isn't unique to Linux. Fsync error handling (and error handling in general) is broken on many different operating systems. At the time Postgres "discovered" the behavior of fsync on Linux, FreeBSD had arguably correct behavior, but OpenBSD and NetBSD behaved the same as Linux (true error status dropped, retrying causes success response, data lost). This has been fixed on OpenBSD and probably some other BSDs, but Linux still basically has the same behavior and you don't have good guarantees that this will work on any random UNIX-like OS.

Now that we've seen that, for many years, filesystems failed to handle errors in some of the most straightforward and simple cases and that there are cases that still aren't handled correctly today, let's look at disks.

Disk

Flushing

We've seen that it's easy to not realize we have to call fsync when we have to call fsync, and that even if we call fsync appropriately, bugs may prevent fsync from actually working. Rajimwale et al., DSN’11 into whether or not disks actually flush when you ask them to flush, assuming everything above the disk works correctly (their paper is actually mostly about something else, they just discuss this briefly at the beginning). Someone from Microsoft anonymously told them "[Some disks] do not allow the file system to force writes to disk properly" and someone from Seagate, a disk manufacturer, told them "[Some disks (though none from us)] do not allow the file system to force writes to disk properly". Bairavasundaram et al., FAST’07 also found the same thing when they looked into disk reliability.

Error rates

We've seen that filessystems sometimes don't handle disk errors correctly. If we want to know how serious this issue is, we should look at the rate at which disks emit errors. Disk datasheets will usually an uncorrectable bit error rate of 1e-14 for consumer HDDs (often called spinning metal or spinning rust disks), 1e-15 for enterprise HDDs, 1e-15 for consumer SSDs, and 1e-16 for enterprise SSDs. This means that, on average, we expect to see one unrecoverable data error every 1e14 bits we read on an HDD.

To get an intuition for what this means in practice, 1TB is now a pretty normal disk size. If we read a full drive once, that's 1e12 bytes, or almost 1e13 bits (technically 8e12 bits), which means we should see, in expectation, one unrecoverable if we buy a 1TB HDD and read the entire disk ten-ish times. Nowadays, we can buy 10TB HDDs, in which case we'd expect to see an error (technically, 8/10th errors) on every read of an entire consumer HDD.

In practice, observed data rates are are significantly higher. Narayanan et al., SYSTOR’16 (Microsoft) observed SSD error rates from 1e-11 to 6e-14, depending on the drive model. Meza et al., SIGMETRICS’15 (FB) observed even worse SSD error rates, 2e-9 to 6e-11 depending on the model of drive. Depending on the type of drive, 2e-9 is 2 gigabits, or 250 MB, 500 thousand to 5 million times worse than stated on datasheets depending on the class of drive.

Bit error rate is arguably a bad metric for disk drives, but this is the metric disk vendors claim, so that's what we have to compare against if we want an apples-to-apples comparison. See Bairavasundaram et al., SIGMETRICS'07, Schroeder et al., FAST'16, and others for other kinds of error rates.

One thing to note is that it's often claimed that SSDs don't have problems with corruption because they use error correcting codes (ECC), which can fix data corruption issues. "Flash banishes the specter of the unrecoverable data error", etc. The thing this misses is that modern high-density flash devices are very unreliable and need ECC to be usable at all. Grupp et al., FAST’12 looked at error rates of the kind of flash the underlies SSDs and found errors rates from 1e-1 to 1e-8. 1e-1 is one error every ten bits, 1e-8 is one error every 100 megabits.

Power loss

Another claim you'll hear is that SSDs are safe against power loss and some types of crashes because they now have "power loss protection" -- there's some mechanism in the SSDs that can hold power for long enough during an outage that the internal SSD cache can be written out safely.

Luke Leighton tested this by buying 6 SSDs that claim to have power loss protection and found that four out of the six models of drive he tested failed (every drive that wasn't an Intel drive). If we look at the details of the tests, when drives fail, it appears to be because they were used in a way that the implementor of power loss protection didn't expect (writing "too fast", although well under the rate at which the drive is capable of writing, or writing "too many" files in parallel). When a drive advertises that it has power loss protection, this appears to mean that someone spent some amount of effort implementing something that will, under some circumstances, prevent data loss or data corruption under power loss. But, as we saw in Kyle's talk yesterday on distributed systems, if you want to make sure that the mechanism actually works, you can't rely on the vendor to do rigorous or perhaps even any semi-serious testing and you have to test it yourself.

Retention

If we look at SSD datasheets, a young-ish drive (one with 90% of its write cycles remaining) will usually be specced to hold data for about ten years after a write. If we look at a worn out drive, one very close to end-of-life, it's specced to retain data for one year to three months, depending on the class of drive. I think people are often surprised to find that it's within spec for a drive to lose data three months after the data is written.

These numbers all come from datasheets and specs, as we've seen, datasheets can be a bit optimistic. On many early SSDs, using up most or all of a drives write cycles would cause the drive to brick itself, so you wouldn't even get the spec'd three month data retention.

Corollaries

Now that we've seen that there are significant problems at every level of the file stack, let's look at a couple things that follow from this.

What to do?

What we should do about this is a big topic, in the time we have left, one thing we can do instead of writing to files is to use databases. If you want something lightweight and simple that you can use in most places you'd use a file, SQLite is pretty good. I'm not saying you should never use files. There is a tradeoff here. But if you have an application where you'd like to reduce the rate of data corruption, considering using a database to store data instead of using files.

FS support

At the start of this talk, we looked at this Dropbox example, where most people thought that there was no reason to remove support for most Linux filesystems because filesystems are all the same. I believe their hand was forced by the way they want to store/use data, which they can only do with ext given how they're doing things (which is arguably a mis-feature), but even if that wasn't the case, perhaps you can see why software that's attempting to sync data to disk reliably and with decent performance might not want to support every single filesystem in the universe for an OS that, for their product, is relatively niche. Maybe it's worth supporting every filesystem for PR reasons and then going through the contortions necessary to avoid data corruption on a per-filesystem basis (you can try coding straight to your reading of the POSIX spec, but as we've seen, that won't save you on Linux), but the PR problem is caused by a misunderstanding.

The other comment we looked at on reddit, and also a common sentiment, is that it's not a program's job to work around bugs in libraries or the OS. But user data gets corrupted regardless of who's "fault" the bug is, and as we've seen, bugs can persist in the filesystem layer for many years. In the case of Linux, most filesystems other than ZFS seem to have decided it's correct behavior to throw away data on fsync error and also not report that the data can't be written (as opposed to FreeBSD or OpenBSD, where most filesystems will at least report an error on subsequent fsyncs if the error isn't resolved). This is arguably a bug and also arguably correct behavior, but either way, if your software doesn't take this into account, you're going to lose or corrupt data. If you want to take the stance that it's not your fault that the filesystem is corrupting data, your users are going to pay the cost for that.

FAQ

While putting this talk to together, I read a bunch of different online discussions about how to write to files safely. For discussions outside of specialized communities (e.g., LKML, the Postgres mailing list, etc.), many people will drop by to say something like "why is everyone making this so complicated? You can do this very easily and completely safely with this one weird trick". Let's look at the most common "one weird trick"s from two thousand internet comments on how to write to disk safely.

Rename

The most frequently mentioned trick is to rename instead of overwriting. If you remember our single-file write example, we made a copy of the data that we wanted to overwrite before modifying the file. The trick here is to do the opposite:

  1. Make a copy of the entire file
  2. Modify the copy
  3. Rename the copy on top of the original file

This trick doesn't work. People seem to think that this is safe becaus the POSIX spec says that rename is atomic, but that only means rename is atomic with respect to normal operation, that doesn't mean it's atomic on crash. This isn't just a theoretical problem; if we look at mainstream Linux filesystems, most have at least one mode where rename isn't atomic on crash. Rename also isn't guaranteed to execute in program order, as people sometimes expect.

The most mainstream exception where rename is atomic on crash is probably btrfs, but even there, it's a bit subtle -- as noted in Bornholt et al., ASPLOS’16, rename is only atomic on crash when renaming to replace an existing file, not when renaming to create a new file. Also, Mohan et al., OSDI’18 found numerous rename atomicity bugs on btrfs, some quite old and some introduced the same year as the paper, so you want not want to rely on this without extensive testing, even if you're writing btrfs specific code.

And even if this worked, the performance of this technique is quite poor.

Append

The second most frequently mentioned trick is to only ever append (instead of sometimes overwriting). This also doesn't work. As noted in Pillai et al., OSDI’14 and Bornholt et al., ASPLOS’16, appends don't guarantee ordering or atomicity and believing that appends are safe is the cause of some bugs.

One weird tricks

We've seen that the most commonly cited simple tricks don't work. Something I find interesting is that, in these discussions, people will drop into a discussion where it's already been explained, often in great detail, why writing to files is harder than someone might naively think, ignore all warnings and explanations and still proceed with their explanation for why it's, in fact, really easy. Even when warned that files are harder than people think, people still think they're easy!

Conclusion

In conclusion, computers don't work (but you probably already know this if you're here at Gary-conf). This talk happened to be about files, but there are many areas we could've looked into where we would've seen similar things.

One thing I'd like to note before we finish is that, IMO, the underlying problem isn't technical. If you look at what huge tech companies do (companies like FB, Amazon, MS, Google, etc.), they often handle writes to disk pretty safely. They'll make sure that they have disks where power loss protection actually work, they'll have patches into the OS and/or other instrumentation to make sure that errors get reported correctly, there will be large distributed storage groups to make sure data is replicated safely, etc. We know how to make this stuff pretty reliable. It's hard, and it takes a lot of time and effort, i.e., a lot of money, but it can be done.

If you ask someone who works on that kind of thing why they spend mind boggling sums of money to ensure (or really, increase the probability of) correctness, you'll often get an answer like "we have a zillion machines and if you do the math on the rate of data corruption, if we didn't do all of this, we'd have data corruption every minute of every day. It would be totally untenable". A huge tech company might have, what, order of ten million machines? The funny thing is, if you do the math for how many consumer machines there are out there and much consumer software runs on unreliable disks, the math is similar. There are many more consumer machines; they're typically operated at much lighter load, but there are enough of them that, if you own a widely used piece of desktop/laptop/workstation software, the math on data corruption is pretty similar. Without "extreme" protections, we should expect to see data corruption all the time.

But if we look at how consumer software works, it's usually quite unsafe with respect to handling data. IMO, the key difference here is that when a huge tech company loses data, whether that's data on who's likely to click on which ads or user emails, the company pays the cost, directly or indirectly and the cost is large enough that it's obviously correct to spend a lot of effort to avoid data loss. But when consumers have data corruption on their own machines, they're mostly not sophisticated enough to know who's at fault, so the company can avoid taking the brunt of the blame. If we have a global optimization function, the math is the same -- of course we should put more effort into protecting data on consumer machines. But if we're a company that's locally optimizing for our own benefit, the math works out differently and maybe it's not worth it to spend a lot of effort on avoiding data corruption.

Yesterday, Ramsey Nasser gave a talk where he made a very compelling case that something was a serious problem, which was followed up by a comment that his proposed solution will have a hard time getting adoption. I agree with both parts -- he discussed an important problem, and it's not clear how solving that problem will make anyone a lot of money, so the problem is likely to go unsolved.

With GDPR, we've seen that regulation can force tech companies to protect people's privacy in a way they're not naturally inclined to do, but regulation is a very big hammer and the unintended consequences can often negate or more than negative the benefits of regulation. When we look at the history of regulations that are designed to force companies to do the right thing, we can see that it's often many years, sometimes decades, before the full impact of the regulation is understood. Designing good regulations is hard, much harder than any of the technical problems we've discussed today.

Acknowledgements

Thanks to Leah Hanson, Gary Bernhardt, Kamal Marhubi, Rebecca Isaacs, Jesse Luehrs, Tom Crayford, Wesley Aptekar-Cassels, Rose Ames, and Benjamin Gilbert for their help with this talk!

Sorry we went so fast. If there's anything you missed you can catch it in the pseudo-transcript at danluu.com/deconstruct-files.

This "transcript" is pretty rough since I wrote it up very quickly this morning before the talk. I'll try to clean it within a few weeks, which will include adding material that was missed, inserting links, fixing typos, adding references that were missed, etc.

Thanks to Anatole Shaw, Jernej Simoncic, @junh1024, and Josh Duff for comments/corrections/discussion on this transcript.

Fri, 12 Jul 2019 00:00:00 +0000


Randomized trial on gender in Overwatch

A recurring discussion in Overwatch (as well as other online games) is whether or not women are treated differently from men. If you do a quick search, you can find hundreds of discussions about this, some of which have well over a thousand comments. These discussions tend to go the same way and involve the same debate every time, with the same points being made on both sides. Just for example, these three threads on reddit that spun out of a single post that have a total of 10.4k comments. On one side, you have people saying "sure, women get trash talked, but I'm a dude and I get trash talked, everyone gets trash talked there's no difference", "I've never seen this, it can't be real", etc., and on the other side you have people saying things like "when I play with my boyfriend, I get accused of being carried by him all the time but the reverse never happens", "people regularly tell me I should play mercy[, a character that's a female healer]", and so on and so forth. In less time than has been spent on a single large discussion, we could just run the experiment, so here it is.

This is the result of playing 339 games in the two main game modes, quick play (QP) and competitive (comp), where roughly half the games were played with a masculine name (where the username was a generic term for a man) and half were played with a feminine name (where the username was a woman's name). I recorded all of the comments made in each of the games and then classified the comments by type. Classes of comments were "sexual/gendered comments", "being told how to play", "insults", and "compliments".

In each game that's included, I decided to include the game (or not) in the experiment before the character selection screen loaded. In games that were included, I used the same character selection algorithm, I wouldn't mute anyone for spamming chat or being a jerk, I didn't speak on voice chat (although I had it enabled), I never sent friend requests, and I was playing outside of a group in order to get matched with 5 random players. When playing normally, I might choose a character I don't know how to use well and I'll mute people who pollute chat with bad comments. There are a lot of games that weren't included in the experiment because I wasn't in a mood to listen to someone rage at their team for fifteen minutes and the procedure I used involved pre-committing to not muting people who do that.

Sexual or sexually charged comments

I thought I'd see more sexual comments when using the feminine name as opposed to the masculine name, but that turned out to not be the case. There was some mention of sex, genitals, etc., in both cases and the rate wasn't obviously different and was actually higher in the masculine condition.

Zero games featured comments were directed specifically at me in the masculine condition and two (out of 184) games in the feminine condition featured comments that were directed at me. Most comments were comments either directed at other players or just general comments to team or game chat.

Examples of typical undirected comments that would occur in either condition include ""my girlfriend keeps sexting me how do I get her to stop?", "going in balls deep", "what a surprise. *strokes dick* [during the post-game highlight]", and "support your local boobies".

The two games that featured sexual comments directed at me had the following comments:

During games not included in the experiment (I generally didn't pay attention to which username I was on when not in the experiment), I also got comments like "send nudes". Anecdotally, there appears to be a different in the rate of these kinds of comments directed at the player, but the rate observed in the experiment is so low that uncertainty intervals around any estimates of the true rate will be similar in both conditions unless we use a strong prior.

The fact that this difference couldn't be observed in 339 games was surprising to me, although it's not inconsistent with McDaniel's thesis, a survey of women who play video games. 339 games probably sounds like a small number to serious gamers, but the only other randomized experiment I know of on this topic (besides this experiment) is Kasumovic et al., which notes that "[w]e stopped at 163 [games] as this is a substantial time effort".

All of the analysis uses the number of games in which a type of comment occured and not tone to avoid having to code comments as having a certain tone in order to avoid possibly injecting bias into the process. Sentiment analysis models, even state-of-the-art ones often return nonsensical results, so this basically has to be done by hand, at least today. With much more data, some kind of sentiment analysis, done with liberal spot checking and re-training of the model, could work, but the total number of comments is so small in this case that it would amount to coding each comment by hand.

Coding comments manually in an unbiased fashion can also be done with a level of blinding, but doing that would probably require getting more people involved (since I see and hear comments while I'm playing) and relying on unpaid or poorly paid labor.

Being told how to play

The most striking, easy to quantify, difference was the rate at which I played games in which people told me how I should play. Since it's unclear how much confidence we should have in the difference if we just look at the raw rates, we'll use a simple statistical model to get the uncertainty interval around the estimates. Since I'm not sure what my belief about this should be, this uses an uninformative prior, so the estimate is close to the actual rate. Anyway, here are the uncertainty intervals a simple model puts on the percent of games where at least one person told me I was playing wrong, that I should change how I'm playing, or that I switch characters:

Cond Est P25 P75
F comp 19 13 25
M comp 6 2 10
F QP 4 3 6
M QP 1 0 2

The experimental conditions in this table are masculine vs. feminine name (M/F) and competitive mode vs quick play (comp/QP). The numbers are percents. Est is the estimate, P25 is the 25%-ile estimate, and P75 is the 75%-ile estimate. Competitive mode and using a feminine name are both correlated with being told how to play. See this post by Andrew Gelman for why you might want to look at the 50% interval instead of the 95% interval.

For people not familiar with overwatch, in competitive mode, you're explicitly told what your ELO-like rating is and you get a badge that reflects your rating. In quick play, you have a rating that's tracked, but it's never directly surfaced to the user and you don't get a badge.

It's generally believed that people are more on edge during competitive play and are more likely to lash out (and, for example, tell you how you should play). The data is consistent with this common belief.

Per above, I didn't want to code tone of messages to avoid bias, so this table only indicates the rate at which people told me I was playing incorrectly or asked that I switch to a different character. The qualitative difference in experience is understated by this table. For example, the one time someone asked me to switch characters in the masculine condition, the request was a one sentence, polite, request ("hey, we're dying too quickly, could we switch [from the standard one primary healer / one off healer setup] to double primary healer or switch our tank to [a tank that can block more damage]?"). When using the feminine name, a typical case would involve 1-4 people calling me human garbage for most of the game and consoling themselves with the idea that the entire reason our team is losing is that I won't change characters.

The simple model we're using indicates that there's probably a difference between both competitive and QP and playing with a masculine vs. a feminine name. However, most published results are pretty bogus, so let's look at reasons this result might be bogus and then you can decide for yourself.

Threats to validity

The biggest issue is that this wasn't a pre-registered trial. I'm obviously not going to go and officially register a trial like this, but I also didn't informally "register" this by having this comparison in mind when I started the experiment. A problem with non-pre-registered trials is that there are a lot of degrees of freedom, both in terms of what we could look at, and in terms of the methodology we used to look at things, so it's unclear if the result is "real" or an artifact of fishing for something that looks interesting. A standard example of this is that, if you look for 100 possible effects, you're likely to find 1 that appears to be statistically significant with p = 0.01.

There are standard techniques to correct for this problem (e.g., Bonferroni correction), but I don't find these convincing because they usually don't capture all of the degrees of freedom that go into a statistical model. An example is that it's common to take a variable and discretize it into a few buckets. There are many ways to do this and you generally won't see papers talk about the impact of this or correct for this in any way, although changing how these buckets are arranged can drastically change the results of a study. Another common knob people can use to manipulate results is curve fitting to an inappropriate curve (often a 2nd a 3rd degree polynomial when a scatterplot shows that's clearly incorrect). Another way to handle this would be to use a more complex model, but I wanted to keep this as simple as possible.

If I wanted to really be convinced on this, I'd want to, at a minimum, re-run this experiment with this exact comparison in mind. As a result, this experiment would need to be replicated to provide more than a preliminary result that is, at best, weak evidence.

One other large class of problem with randomized controlled trials (RCTs) is that, despite randomization, the two arms of the experiment might be different in some way that wasn't randomized. Since Overwatch doesn't allow you to keep changing your name, this experiment was done with two different accounts and these accounts had different ratings in competitive mode. On average, the masculine account had a higher rating due to starting with a higher rating, which meant that I was playing against stronger players and having worse games on the masculine account. In the long run, this will even out, but since most games in this experiment were in QP, this didn't have time to even out in comp. As a result, I had a higher win rate as well as just generally much better games with the feminine account in comp.

With no other information, we might expect that people who are playing worse get told how to play more frequently and people who are playing better should get told how to play less frequently, which would mean that the table above understates the actual difference.

However Kasumovic et al., in a gender-based randomized trial of Halo 3, found that players who were playing poorly were more negative towards women, especially women who were playing well (there's enough statistical manipulation of the data that a statement this concise can only be roughly correct, see study for details). If that result holds, it's possible that I would've gotten fewer people telling me that I'm human garbage and need to switch characters if I was average instead of dominating most of my games in the feminine condition.

If that result generalizes to OW, that would explain something which I thought was odd, which was that a lot of demands to switch and general vitriol came during my best performances with the feminine account. A typical example of this would be a game where we have a 2-2-2 team composition (2 players playing each of the three roles in the game) where my counterpart in the same role ran into the enemy team and died at the beginning of the fight in almost every engagement. I happened to be having a good day and dominated the other team (37-2 in a ten minute comp game, while focusing on protecting our team's healers) while only dying twice, once on purpose as a sacrifice and second time after a stupid blunder. Immediately after I died, someone asked me to switch roles so they could take over for me, but at no point did someone ask the other player in my role to switch despite their total uselesses all game (for OW players this was a Rein who immediately charged into the middle of the enemy team at every opportunity, from a range where our team could not possibly support them; this was Hanamura 2CP, where it's very easy for Rein to set up situations where their team cannot help them). This kind of performance was typical of games where my team jumped on me for playing incorrectly. This isn't to say I didn't have bad games; I had plenty of bad games, but a disproportionate number of the most toxic experiences came when I was having a great game.

I tracked how well I did in games, but this sample doesn't have enough ranty games to do a meaningful statistical analysis of my performance vs. probability of getting thrown under the bus.

Games at different ratings are probably also generally different environments and get different comments, but it's not clear if there are more negative comments at 2000 than 2500 or vice versa. There are a lot of online debates about this; for any rating level other than the very lowest or the very highest ratings, you can find a lot of people who say that the rating band they're in has the highest volume of toxic comments.

Other differences

Here are some things that happened while playing with the feminine name that didn't happen with the masculine name during this experiment or in any game outside of this experiment:

The rate of all these was low enough that I'd have to play many more games to observe something without a huge uncertainty interval.

I didn't accept any friend requests from people I had no interaction with. Anecdotally, some people report people will send sexual comments or berate them after an unsolicited friend request. It's possible that the effect show in the table would be larger if I accepted these friend requests and it couldn't be smaller.

I didn't attempt to classify comments as flirty or not because, unlike the kinds of commments I did classify, this is often somewhat subtle and you could make a good case that any particular comment is or isn't flirting. Without responding (which I didn't do), many of these kinds of comments are ambiguous

Another difference was in the tone of the compliments. The rate of games where I was complimented wasn't too different, but compliments under the masculine condition tended to be short and factual (e.g., someone from the other team saying "no answer for [name of character I was playing]" after a dominant game) and compliments under the feminine condition tended to be more effusive and multiple people would sometimes chime in about how great I was.

Non differences

The rate of complements and the rate of insults in games that didn't include explanations of how I'm playing wrong or how I need to switch characters were similar in both conditions.

Other factors

Some other factors that would be interesting to look at would be time of day, server, playing solo or in a group, specific character choice, being more or less communicative, etc., but it would take a lot more data to be able to get good estimates when adding it more variables. Blizzard should have the data necessary to do analyses like this in aggregate, but they're notoriously private with their data, so someone at Blizzard would have to do the work and then publish it publicly, and they're not really in the habit of doing that kind of thing. If you work at Blizzard and are interested in letting a third party do some analysis on an anonymized data set, let me know and I'd be happy to dig in.

Experimental minutiae

Under both conditions, I avoided ever using voice chat and would call things out in text chat when time permitted. Also under both conditions, I mostly filled in with whatever character class the team needed most, although I'd sometimes pick DPS (in general, DPS are heavily oversubscribed, so you'll rarely play DPS if you don't pick one even when unnecessary).

For quickplay, backfill games weren't counted (backfill games are games where you join after the game started to fill in for a player who left; comp doesn't allow backfills). 6% of QP games were backfills.

These games are from before the "endorsements" patch; most games were played around May 2018. All games were played in "solo q" (with 5 random teammates). In order to avoid correlations between games depending on how long playing sessions were, I quit between games and waited for enough time (since you're otherwise likely to end up in a game with some or many of the same players as before).

The model used probability of a comment happening in a game to avoid the problem that Kasumovic et al. ran into, where a person who's ranting can skew the total number of comments. Kasumovic et al. addressed this by removing outliers, but I really don't like manually reaching in and removing data to adjust results. This could also be addressed by using a more sophisticated model, but a more sophisticated model means more knobs which means more ways for bias to sneak in. Using the number of players who made comments instead would be one way to mitigate this problem, but I think this still isn't ideal because these aren't independent -- when one player starts being negative, this greatly increases the odds that another player in that game will be negative, but just using the number of players makes four games with one negative person the same as one game with four negative people. This can also be accounted for with a slightly more sophisticated model, but that also involves adding more knobs to the model.

Comments on other comments

If I step back from the gender experiment, the most baffling thing to me about Overwatch is the Herculean level of self-delusion adult Overwatch players have (based on the sounds of people's voice, I think that most people who are absolutely livid about the performance of a teammate are adults).

Just for example, if you look at games where someone leaves in the early phase of the game, causing it to get canceled, IME, this is almost always because one player was AFK and eventually timed out. I've seen this happen maybe 40 times, and the vast majority of the time, 1-3 players will rage about how they were crushing the other team and are being unfairly denied a (probable) victory. I have not seen a single case where someone notices that the game was 6v5 the entire time because someone was AFK out of nearly 40 games of people compaining about this.

If this were a field sport, I'd expect that players would notice that the other team is literally missing a player for the entire game by second grade. The median person at my rank (amazingly, above median for all players) who's raging at other players has roughly the same level of game sense as a first grader. And yet, they'll provide detailed (wrong, but detailed) commentary on other people's shortcomings, suggest that someone is "throwing [the game]" if they don't attempt to use a sophisticated strategy that pros have to practice to get right, etc.

Of course, by definition, people can't be aware that they're not aware that the other team is missing a player. However, someone with awareness that poor must get surprised by events in the game all the time (my awareness is substantially better than median for my rank because my technical skill is worse, and I still get blindsided multiple times per game; each time, it's a reminder that I'm not properly tracking what's happening in the game and can't even keep up with what my character should be doing).

It requires a staggering level of willful blindness to be that unaware, get surprised by events all the time, and still think that your understanding of the game is deep enough that you can, while playing your own character and tracking everything you're supposed to track, also track everything someone on your team is supposed to track and understand what's going on deeply enough to provide advice.

This stupefying level of confidence combined with willfull blindness was also found in this survey of people's perception of their own skill level vs. their actual rating found that 1% of people thought they were overrated, 32% of people thought they were rated accurately, and the other 77% of people thought they were underrated. If you're paying attention, you get feedback about how good you are all the time (for example, the game helpfully shows you exactly how you died every time you die), but the majority of players manage to ignore this feedback.

There's nothing wrong with not paying attention to how good you are and playing "just for fun", but these guys who rage at their teammates for bringing them down aren't playing for fun, they're serious. I don't understand how people who are so serious that they'd get angry at strangers for their perceived lack of skill in a video game would choose to ignore all of this information that they could use to get better and choose to yell at teammates (increasing the probability of a loss) instead of fixing the gross errors in their own gameplay (decreasing the probability of a loss).

Appendix: comments / advice to overwatch players

A common complaint, perhaps the most common complaint by people below 2000 SR (roughly 30%-ile) or perhaps 1500 SR (roughly 10%-ile) is that they're in "ELO hell" and are kept down because their teammates are too bad. Based on my experience, I find this to be extremely unlikely.

People often split skill up into "mechanics" and "gamesense". My mechanics are pretty much as bad as it's possible to get. The last game I played seriously was a 90s video game that's basically online asteroids and the last game before that I put any time into was the original SNES super mario kart. As you'd expect from someone who hasn't put significant time into a post-90s video game or any kind of FPS game, my aim and dodging are both atrocious. On top of that, I'm an old dude with slow reflexes and I was able to get to 2500 SR (roughly 60%-ile among players who play "competitive", likely higher among all players) by avoiding a few basic fallacies and blunders despite have approximately zero mechanical skill. If you're also an old dude with basically no FPS experience, you can do the same thing; if you have good reflexes or enough FPS experience to actually aim or dodge, you basically can't be worse mechnically than I am and you can do much better by avoiding a few basic mistakes.

The most common fallacy I see repeated is that you have to play DPS to move out of bronze or gold. The evidence people give for this is that, when a GM streamer plays flex, tank, or healer, they sometimes lose in bronze. I guess the idea is that, because the only way to ensure a 99.9% win rate in bronze is to be a GM level DPS player and play DPS, the best way to maintain a 55% or a 60% win rate is to play DPS, but this doesn't follow.

Healers and tanks are both very powerful in low ranks. Because low ranks feature both poor coordination and relatively poor aim (players with good coordination or aim tend to move up quickly), time-to-kill is very slow compared to higher ranks. As a result, an off healer can tilt the result of a 1v1 (and sometimes even a 2v1) matchup and a primary healer can often determine the result of a 2v1 matchup. Because coordination is poor, most matchups end up being 2v1 or 1v1. The flip side of the lack of coordination is that you'll almost never get help from teammates. It's common to see an enemy player walk into the middle of my team, attack someone, and then walk out while literally no one else notices. If the person being attacked is you, the other healer typically won't notice and will continue healing someone at full health and none of the classic "peel" characters will help or even notice what's happening. That means it's on you to pay attention to your surroundings and watching flank routes to avoid getting murdered.

If you can avoid getting murdered constantly and actually try to heal (as opposed to many healers at low ranks, who will try to kill people or stick to a single character and continue healing them all the time even if they're at full health), you outheal a primary healer half the time when playing an off healer and, as a primary healer, you'll usually be able to get 10k-12k healing per 10 min compared to 6k to 8k for most people in Silver (sometimes less if they're playing DPS Moira). That's like having an extra half a healer on your team, which basically makes the game 6.5 v 6 instead of 6v6. You can still lose a 6.5v6 game, and you'll lose plenty of games, but if you're consistently healing 50% more than an normal healer at your rank, you'll tend to move up even if you get a lot of major things wrong (heal order, healing when that only feeds the other team, etc.).

A corollary to having to watch out for yourself 95% when playing a healer is that, as a character who can peel, you can actually watch out for your teammates and put your team at a significant advantage in 95% of games. As Zarya or Hog, if you just boringly play towards the front of your team, you can basically always save at least one teammate from death in a team fight, and you can often do this 2 or 3 times. Meanwhile, your counterpart on the other team is walking around looking for 1v1 matchups. If they find a good one, they'll probably kill someone, and if they don't (if they run into someone with a mobility skill or a counter like brig or reaper), they won't. Even in the case where they kill someone and you don't do a lot, you still provide as much value as them and, on average, you'll provide more value. A similar thing is true of many DPS characters, although it depends on the character (e.g., McCree is effective as a peeler, at least at the low ranks that I've played in). If you play a non-sniper DPS that isn't suited for peeling, you can find a DPS on your team who's looking for 1v1 fights and turn those fights into 2v1 fights (at low ranks, there's no shortage of these folks on both teams, so there are plenty of 1v1 fights you can control by making them 2v1).

All of these things I've mentioned amount to actually trying to help your team instead of going for flashy PotG setups or trying to dominate the entire team by yourself. If you say this in the abstract, it seems obvious, but most people think they're better than their rating. It doesn't help that OW is designed to make people think they're doing well when they're not and the best way to get "medals" or "play of the game" is to play in a way that severely reduces your odds of actually winning each game.

Outside of obvious gameplay mistakes, the other big thing that loses games is when someone tilts and either starts playing terribly or flips out and says something to enrage someone else on the team, who then starts playing terribly. I don't think you can actually do much about this directly, but you can never do this, so 5/6th of your team will do this at some base rate, whereas 6/6 of the other team will do this. Like all of the above, this won't cause you to win all of your games, but everything you do that increases your win rate makes a difference.

Poker players have the right attitude when they talk about leaks. The goal isn't to win every hand, it's to increase your EV by avoiding bad blunders (at high levels, it's about more than avoiding bad blunders, but we're talking about getting out of below median ranks, not becoming GM here). You're going to have terrible games where you get 5 people instalocking DPS. Your odds of winning a game are low, say 10%. If you get mad and pick DPS and reduce your odds even further (say this is to 2%), all that does is create a leak in your win rate during games when your teammates are being silly.

If you gain/lose 25 rating per game for a win or a loss, your average rating change from a game is 25 (W_rate - L_rate) = 25 (2W_rate - 1). Let's say 1/40 games are these silly games where your team decides to go all DPS. The per-game SR difference of trying to win these vs. soft throwing is maybe something like 1/40 * 25 (2 * 0.08) = 0.1. That doesn't sound like much and these numbers are just guesses, but everyone outside of very high-level games is full of leaks like these, and they add up. And if you look at a 60% win rate, which is pretty good considering that your influence is limited because you're only one person on a 6 person team, that only translates to an average of 5SR per game, so it doesn't actually take that many small leaks to really move your average SR gain or loss.

Appendix: general comments on online gaming, 20 years ago vs. today

Since I'm unlikely to write another blog post on gaming any time soon, here are some other random thoughts that won't fit with any other post. My last serious experience with online games was with a game from the 90s. Even though I'd heard that things were a lot worse, I was still surprised by it. IRL, the only time I encounter the same level and rate of pointless nastiness in a recreational activity is down at the bridge club (casual bridge games tend to be very nice). When I say pointless nastiness, I mean things like getting angry and then making nasty comments to a teammate mid-game. Even if your "criticism" is correct (and, if you review OW games or bridge hands, you'll see that these kinds of angry comments are almost never correct), this has virtually no chance of getting your partner to change their behavior and it has a pretty good chance of tilting them and making them play worse. If you're trying to win, there's no reason to do this and good reason to avoid this.

If you look at the online commentary for this, it's common to see people blaming kids, but this doesn't match my experience at all. For one thing, when I was playing video games in the 90s, a huge fraction of the online gaming population was made up of kids, and online game communities were nicer than they are today. Saying that "kids nowadays" are worse than kids used to be is a pastime that goes back thousands of years, but it's generally not true and there doesn't seem to be any reason to think that it's true here.

Additionally, this simply doesn't match what I saw. If I just look at comments over audio chat, there were a couple of times when some kids were nasty, but almost all of the comments are from people who sound like adults. Moreover, if I look at when I played games that were bad, a disproportionately large number of those games were late (after 2am eastern time, on the central/east server), where the relative population of adults is larger.

And if we look at bridge, the median age of an ACBL member is in the 70s, with an increase in age of a whopping 0.4 years per year.

Sure, maybe people tend to get more mature as they age, but in any particular activity, that effect seems to be dominated by other factors. I don't have enough data at hand to make a good guess as to what happened, but I'm entertained by the idea that this might have something to do with it:

I’ve said this before, but one of the single biggest culture shocks I’ve ever received was when I was talking to someone about five years younger than I was, and she said “Wait, you play video games? I’m surprised. You seem like way too much of a nerd to play video games. Isn’t that like a fratboy jock thing?”

Appendix: FAQ

Here are some responses to the most common online comments.

Plat? You suck at Overwatch

Yep. But I sucked roughly equally on both accounts (actually somewhat more on the masculine account because it was rated higher and I was playing a bit out of my depth). Also, that's not a question.

This is just a blog post, it's not an academic study, the results are crap.

There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.

When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my abillity) even if it makes the result look less interesting or less likely to win an award.

Gamers have always been toxic, that's just nostalgia talking.

If I pull game logs for subspace, this seems to be false. YMMV depending on what games you played, I suppose. FWIW, airmash seems to be the modern version of subspace, and (until the game died), it was much more toxic than subspace even if you just compare on a per-game basis despite having much smaller games (25 people for a good sized game in airmash, vs. 95 for subsace).

This is totally invalid because you didn't talk on voice chat.

At the ranks I played, not talking on voice was the norm. It would be nice to have talking or not talking on voice chat be an indepedent variable, but that would require playing even more games to get data for another set of conditions, and if I wasn't going to do that, choosing the condition that's most common doesn't make the entire experiment invalid, IMO.

Some people report that, post "endorsements" patch, talking on voice chat is much more common. I tested this out by playing 20 (non-comp) games just after the "Paris" patch. Three had comments on voice chat. One was someone playing random music clips, one had someone screaming at someone else for playing incorrectly, and one had useful callouts on voice chat. It's possible I'd see something different with more games or in comp, but I don't think it's obvious that voice chat is common for most people after the "endorsements" patch.

Appendix: code and data

If you want to play with this data and model yourself, experiment with different priors, run a posterior predictive check, etc., here's a snippet of R code that embeds the data:

library(brms)
library(modelr)
library(tidybayes)
library(tidyverse)

d <- tribble(
  ~game_type, ~gender, ~xplain, ~games,
  "comp", "female", 7, 35,
  "comp", "male", 1, 23,
  "qp", "female", 6, 149,
  "qp", "male", 2, 132
)

d <- d %>% mutate(female = ifelse(gender == "female", 1, 0), comp = ifelse(game_type == "comp", 1, 0))


result <-
  brm(data = d, family = binomial,
      xplain | trials(games) ~ female + comp,
      prior = c(set_prior("normal(0,10)", class = "b")),
      iter = 25000, warmup = 500, cores = 4, chains = 4)

The model here is simple enough that I wouldn't expect the version of software used to significantly affect results, but in case you're curious, this was done with brms 2.7.0, rstan 2.18.2, on R 3.5.1.

Thanks to Leah Hanson, Sean Talts and Sean's math/stats reading group, Annie Cherkaev, Robert Schuessler, Wesley Aptekar-Cassels, Julia Evans, Paul Gowder, Jonathan Dahan, Bradley Boccuzzi, Akiva Leffert, and one or more anonymous commenters for comments/corrections/discussion.

Tue, 19 Feb 2019 00:00:00 +0000


Fsyncgate: errors on fsync are unrecovarable

This is an archive of the original "fsyncgate" email thread. This is posted here because I wanted to have a link that would fit on a slide for a talk on file safety with a mobile-friendly non-bloated format.

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Subject:Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date:2018-03-28 02:23:46

Hi all

Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should've been a fatal error.

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

Pg wrote some blocks, which went to OS dirty buffers for writeback. Writeback failed due to an underlying storage error. The block I/O layer and XFS marked the writeback page as failed (AS_EIO), but had no way to tell the app about the failure. When Pg called fsync() on the FD during the next checkpoint, fsync() returned EIO because of the flagged page, to tell Pg that a previous async write failed. Pg treated the checkpoint as failed and didn't advance the redo start position in the control file.

All good so far.

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

The write never made it to disk, but we completed the checkpoint, and merrily carried on our way. Whoops, data loss.

The clear-error-and-continue behaviour of fsync is not documented as far as I can tell. Nor is fsync() returning EIO unless you have a very new linux man-pages with the patch I wrote to add it. But from what I can see in the POSIX standard we are not given any guarantees about what happens on fsync() failure at all, so we're probably wrong to assume that retrying fsync( ) is safe.

If the server had been using ext3 or ext4 with errors=remount-ro, the problem wouldn't have occurred because the first I/O error would've remounted the FS and stopped Pg from continuing. But XFS doesn't have that option. There may be other situations where this can occur too, involving LVM and/or multipath, but I haven't comprehensively dug out the details yet.

It proved possible to recover the system by faking up a backup label from before the first incorrectly-successful checkpoint, forcing redo to repeat and write the lost blocks. But ... what a mess.

I posted about the underlying fsync issue here some time ago:

https://stackoverflow.com/q/42434872/398670

but haven't had a chance to follow up about the Pg specifics.

I've been looking at the problem on and off and haven't come up with a good answer. I think we should just PANIC and let redo sort it out by repeating the failed write when it repeats work since the last checkpoint.

The API offered by async buffered writes and fsync offers us no way to find out which page failed, so we can't just selectively redo that write. I think we do know the relfilenode associated with the fd that failed to fsync, but not much more. So the alternative seems to be some sort of potentially complex online-redo scheme where we replay WAL only the relation on which we had the fsync() error, while otherwise servicing queries normally. That's likely to be extremely error-prone and hard to test, and it's trying to solve a case where on other filesystems the whole DB would grind to a halt anyway.

I looked into whether we can solve it with use of the AIO API instead, but the mess is even worse there - from what I can tell you can't even reliably guarantee fsync at all on all Linux kernel versions.

We already PANIC on fsync() failure for WAL segments. We just need to do the same for data forks at least for EIO. This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us.

There are rather a lot of pg_fsync() callers. While we could handle this case-by-case for each one, I'm tempted to just make pg_fsync() itself intercept EIO and PANIC. Thoughts?


From:Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date:2018-03-28 03:53:08

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-03-29 02:30:59

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 02:48:27

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

"Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen."

That's... I'm speechless.


From:Justin Pryzby <pryzby(at)telsasoft(dot)com>
Date:2018-03-29 05:00:31

On Thu, Mar 29, 2018 at 11:30:59AM +0900, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

(Note, I can see that it might be useful to PANIC on EIO but retry for ENOSPC).

On Thu, Mar 29, 2018 at 03:48:27PM +1300, Thomas Munro wrote:

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article? https://lwn.net/Articles/724307/

Worse, the article acknowledges the behavior without apparently suggesting to change it:

"Storing that value in the file structure has an important benefit: it makes it possible to report a writeback error EXACTLY ONCE TO EVERY PROCESS THAT CALLS FSYNC() .... In current kernels, ONLY THE FIRST CALLER AFTER AN ERROR OCCURS HAS A CHANCE OF SEEING THAT ERROR INFORMATION."

I believe I reproduced the problem behavior using dmsetup "error" target, see attached.

strace looks like this:

kernel is Linux 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

1open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
2write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
3write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
4write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
5write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
6write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
7write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 8192
8write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = 2560
9write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192) = -1 ENOSPC (No space left on device)
10dup(2)                                  = 4
11fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
12brk(NULL)                               = 0x1299000
13brk(0x12ba000)                          = 0x12ba000
14fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
15write(4, "write(1): No space left on devic"..., 34write(1): No space left on device
16) = 34
17close(4)                                = 0
18fsync(3)                                = -1 EIO (Input/output error)
19dup(2)                                  = 4
20fcntl(4, F_GETFL)                       = 0x8402 (flags O_RDWR|O_APPEND|O_LARGEFILE)
21fstat(4, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 2), ...}) = 0
22write(4, "fsync(1): Input/output error\n", 29fsync(1): Input/output error
23) = 29
24close(4)                                = 0
25close(3)                                = 0
26open("/dev/mapper/eio", O_RDWR|O_CREAT, 0600) = 3
27fsync(3)                                = 0
28write(3, "\0", 1)                       = 1
29fsync(3)                                = 0
30exit_group(0)                           = ?

2: EIO isn't seen initially due to writeback page cache;

9: ENOSPC due to small device

18: original IO error reported by fsync, good

25: the original FD is closed

26: ..and file reopened

27: fsync on file with still-dirty data+EIO returns success BAD

10, 19: I'm not sure why there's dup(2), I guess glibc thinks that perror should write to a separate FD (?)

Also note, close() ALSO returned success..which you might think exonerates the 2nd fsync(), but I think may itself be problematic, no? In any case, the 2nd byte certainly never got written to DM error, and the failure status was lost following fsync().

I get the exact same behavior if I break after one write() loop, such as to avoid ENOSPC.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 05:06:22

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby wrote:

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, the page is still dirty. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:25:51

On 29 March 2018 at 13:06, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby wrote:

The retries are the source of the problem ; the first fsync() can return EIO, and also clears the error causing a 2nd fsync (of the same data) to return success.

What I'm failing to grok here is how that error flag even matters, whether it's a single bit or a counter as described in that patch. If write back failed, the page is still dirty. So all future calls to fsync() need to try to try to flush it again, and (presumably) fail again (unless it happens to succeed this time around). http://www.enterprisedb.com

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:32:43

On 29 March 2018 at 10:48, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 3:30 PM, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

Craig, is the phenomenon you described the same as the second issue "Reporting writeback errors" discussed in this article?

https://lwn.net/Articles/724307/

A variant of it, by the looks.

The problem in our case is that the kernel only tells us about the error once. It then forgets about it. So yes, that seems like a variant of the statement:

"Current kernels might report a writeback error on an fsync() call, but there are a number of ways in which that can fail to happen."

That's... I'm speechless.

Yeah.

It's a bit nuts.

I was astonished when I saw the behaviour, and that it appears undocumented.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:35:47

On 29 March 2018 at 10:30, Michael Paquier wrote:

On Tue, Mar 27, 2018 at 11:53:08PM -0400, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

Any callers of pg_fsync in the backend code are careful enough to check the returned status, sometimes doing retries like in mdsync, so what is proposed here would be a regression.

I covered this in my original post.

Yes, we check the return value. But what do we do about it? For fsyncs of heap files, we ERROR, aborting the checkpoint. We'll retry the checkpoint later, which will retry the fsync(). Which will now appear to succeed because the kernel forgot that it lost our writes after telling us the first time. So we do check the error code, which returns success, and we complete the checkpoint and move on.

But we only retried the fsync, not the writes before the fsync.

So we lost data. Or rather, failed to detect that the kernel did so, so our checkpoint was bad and could not be completed.

The problem is that we keep retrying checkpoints without repeating the writes leading up to the checkpoint, and retrying fsync.

I don't pretend the kernel behaviour is sane, but we'd better deal with it anyway.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 05:58:45

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

It's not necessary on ext3/ext4 with errors=remount-ro, but that's only because the FS stops us dead in our tracks.

I don't pretend it's sane. The kernel behaviour is IMO crazy. If it's going to lose a write, it should at minimum mark the FD as broken so no further fsync() or anything else can succeed on the FD, and an app that cares about durability must repeat the whole set of work since the prior succesful fsync(). Just reporting it once and forgetting it is madness.

But even if we convince the kernel folks of that, how do other platforms behave? And how long before these kernels are out of use? We'd better deal with it, crazy or no.

Please see my StackOverflow post for the kernel-level explanation. Note also the test case link there. https://stackoverflow.com/a/42436054/398670

Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

If that's actually the case, we need to push back on this kernel brain damage, because as you're describing it fsync would be completely useless.

It's not useless, it's just telling us something other than what we think it means. The promise it seems to give us is that if it reports an error once, everything after that is useless, so we should throw our toys, close and reopen everything, and redo from the last known-good state.

Though as Tomas posted below, it provides rather weaker guarantees than I thought in some other areas too. See that lwn.net article he linked.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

"If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed."

but that doesn't clarify matters much either, because it can be read to mean that once there's been an error reported for some IO operations there's no guarantee those operations are ever completed even after a subsequent fsync returns success.

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 12:07:56

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer wrote:

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

Moreover, POSIX is entirely clear that successful fsync means all preceding writes for the file have been completed, full stop, doesn't matter when they were issued.

I can't find anything that says so to me. Please quote relevant spec.

I'm working from http://pubs.opengroup.org/onlinepubs/009695399/functions/fsync.html which states that

"The fsync() function shall request that all data for the open file descriptor named by fildes is to be transferred to the storage device associated with the file described by fildes. The nature of the transfer is implementation-defined. The fsync() function shall not return until the system has completed that action or until an error is detected."

My reading is that POSIX does not specify what happens AFTER an error is detected. It doesn't say that error has to be persistent and that subsequent calls must also report the error. It also says:

FWIW my reading is the same as Tom's. It says "all data for the open file descriptor" without qualification or special treatment after errors. Not "some".

I'm not seeking to defend what the kernel seems to be doing. Rather, saying that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be "a mess" and "a surprise" in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment "Failed write, redirty."


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-29 13:15:10

On 29 March 2018 at 20:07, Thomas Munro wrote:

On Thu, Mar 29, 2018 at 6:58 PM, Craig Ringer wrote:

On 28 March 2018 at 11:53, Tom Lane wrote:

Craig Ringer writes:

TL;DR: Pg should PANIC on fsync() EIO return.

Surely you jest.

No. I'm quite serious. Worse, we quite possibly have to do it for ENOSPC as well to avoid similar lost-page-write issues.

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Yeah, I see why you want to PANIC.

In more ways than one ;)

I'm not seeking to defend what the kernel seems to be doing. Rather, saying

that we might see similar behaviour on other platforms, crazy or not. I haven't looked past linux yet, though.

I see no reason to think that any other operating system would behave that way without strong evidence... This is openly acknowledged to be "a mess" and "a surprise" in the Filesystem Summit article. I am not really qualified to comment, but from a cursory glance at FreeBSD's vfs_bio.c I think it's doing what you'd hope for... see the code near the comment "Failed write, redirty."

Ok, that's reassuring, but doesn't help us on the platform the great majority of users deploy on :(

"If on Linux, PANIC"

Hrm.


From:Catalin Iacob <iacobcatalin(at)gmail(dot)com>
Date:2018-03-29 16:20:00

On Thu, Mar 29, 2018 at 2:07 PM, Thomas Munro wrote:

I found your discussion with kernel hacker Jeff Layton at https://lwn.net/Articles/718734/ in which he said: "The stackoverflow writeup seems to want a scheme where pages stay dirty after a writeback failure so that we can try to fsync them again. Note that that has never been the case in Linux after hard writeback failures, AFAIK, so programs should definitely not assume that behavior."

And a bit below in the same comments, to this question about PG: "So, what are the options at this point? The assumption was that we can repeat the fsync (which as you point out is not the case), or shut down the database and perform recovery from WAL", the same Jeff Layton seems to agree PANIC is the appropriate response: "Replaying the WAL synchronously sounds like the simplest approach when you get an error on fsync. These are uncommon occurrences for the most part, so having to fall back to slow, synchronous error recovery modes when this occurs is probably what you want to do.". And right after, he confirms the errseq_t patches are about always detecting this, not more: "The main thing I working on is to better guarantee is that you actually get an error when this occurs rather than silently corrupting your data. The circumstances where that can occur require some corner-cases, but I think we need to make sure that it doesn't occur."

Jeff's comments in the pull request that merged errseq_t are worth reading as well: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

The article above that says the same thing a couple of different ways, ie that writeback failure leaves you with pages that are neither written to disk successfully nor marked dirty.

If I'm reading various articles correctly, the situation was even worse before his errseq_t stuff landed. That fixed cases of completely unreported writeback failures due to sharing of PG_error for both writeback and read errors with certain filesystems, but it doesn't address the clean pages problem.

Indeed, that's exactly how I read it as well (opinion formed independently before reading your sentence above). The errseq_t patches landed in v4.13 by the way, so very recently.

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-03-29 21:18:14

On Fri, Mar 30, 2018 at 5:20 AM, Catalin Iacob wrote:

Jeff's comments in the pull request that merged errseq_t are worth reading as well: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=088737f44bbf6378745f5b57b035e57ee3dc4750

Wow. It looks like there may be a separate question of when each filesystem adopted this new infrastructure?

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

The pre-errseq_t problems are beyond our control. There's nothing we can do about that in userspace (except perhaps abandon OS-buffered IO, a big project). We just need to be aware that this problem exists in certain kernel versions and be grateful to Layton for fixing it.

The dropped dirty flag problem is something we can and in my view should do something about, whatever we might think about that design choice. As Andrew Gierth pointed out to me in an off-list chat about this, by the time you've reached this state, both PostgreSQL's buffer and the kernel's buffer are clean and might be reused for another block at any time, so your data might be gone from the known universe -- we don't even have the option to rewrite our buffers in general. Recovery is the only option.

Thank you to Craig for chasing this down and +1 for his proposal, on Linux only.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-03-31 13:24:28

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-03-31 16:13:09

On 31 March 2018 at 21:24, Anthony Iliopoulos wrote:

On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

Yeah, I see why you want to PANIC.

Indeed. Even doing that leaves question marks about all the kernel versions before v4.13, which at this point is pretty much everything out there, not even detecting this reliably. This is messy.

There may still be a way to reliably detect this on older kernel versions from userspace, but it will be messy whatsoever. On EIO errors, the kernel will not restore the dirty page flags, but it will flip the error flags on the failed pages. One could mmap() the file in question, obtain the PFNs (via /proc/pid/pagemap) and enumerate those to match the ones with the error flag switched on (via /proc/kpageflags). This could serve at least as a detection mechanism, but one could also further use this info to logically map the pages that failed IO back to the original file offsets, and potentially retry IO just for those file ranges that cover the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.

I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.

I'll send a patch this week.


From:Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date:2018-03-31 16:38:12

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-01 00:20:38

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

That won't fix anything released already, so as per the information gathered something has to be done anyway. The discussion of this thread is spreading quite a lot actually.

Handling things at a low-level looks like a better plan for the backend. Tools like pg_basebackup and pg_dump also issue fsync's on the data created, we should do an equivalent for them, with some exit() calls in file_utils.c. As of now failures are logged to stderr but not considered fatal.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-01 00:58:22

On Sun, Apr 01, 2018 at 12:13:09AM +0800, Craig Ringer wrote:

On 31 March 2018 at 21:24, Anthony Iliopoulos <[1]ailiop(at)altatus(dot)com> wrote:

 On Fri, Mar 30, 2018 at 10:18:14AM +1300, Thomas Munro wrote:

 > >> Yeah, I see why you want to PANIC.
 > >
 > > Indeed. Even doing that leaves question marks about all the kernel
 > > versions before v4.13, which at this point is pretty much everything
 > > out there, not even detecting this reliably. This is messy.
 There may still be a way to reliably detect this on older kernel
 versions from userspace, but it will be messy whatsoever. On EIO
 errors, the kernel will not restore the dirty page flags, but it
 will flip the error flags on the failed pages. One could mmap()
 the file in question, obtain the PFNs (via /proc/pid/pagemap)
 and enumerate those to match the ones with the error flag switched
 on (via /proc/kpageflags). This could serve at least as a detection
 mechanism, but one could also further use this info to logically
 map the pages that failed IO back to the original file offsets,
 and potentially retry IO just for those file ranges that cover
 the failed pages. Just an idea, not tested.

That sounds like a huge amount of complexity, with uncertainty as to how it'll behave kernel-to-kernel, for negligble benefit.

Those interfaces have been around since the kernel 2.6 times and are rather stable, but I was merely responding to your original post comment regarding having a way of finding out which page(s) failed. I assume that indeed there would be no benefit, especially since those errors are usually not transient (typically they come from hard medium faults), and although a filesystem could theoretically mask the error by allocating a different logical block, I am not aware of any implementation that currently does that.

I was exploring the idea of doing selective recovery of one relfilenode, based on the assumption that we know the filenode related to the fd that failed to fsync(). We could redo only WAL on that relation. But it fails the same test: it's too complex for a niche case that shouldn't happen in the first place, so it'll probably have bugs, or grow bugs in bitrot over time.

Fully agree, those cases should be sufficiently rare that a complex and possibly non-maintainable solution is not really warranted.

Remember, if you're on ext4 with errors=remount-ro, you get shut down even harder than a PANIC. So we should just use the big hammer here.

I am not entirely sure what you mean here, does Pg really treat write() errors as fatal? Also, the kind of errors that ext4 detects with this option is at the superblock level and govern metadata rather than actual data writes (recall that those are buffered anyway, no actual device IO has to take place at the time of write()).


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-01 01:14:46

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive). Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-01 18:24:51

On Fri, Mar 30, 2018 at 10:18 AM, Thomas Munro wrote:

... on Linux only.

Apparently I was too optimistic. I had looked only at FreeBSD, which keeps the page around and dirties it so we can retry, but the other BSDs apparently don't (FreeBSD changed that in 1999). From what I can tell from the sources below, we have:

Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies
FreeBSD, Illumos: retrying fsync() after EIO tells the truth

Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?

http://mail-index.netbsd.org/netbsd-users/2018/03/30/msg020576.html https://github.com/NetBSD/src/blob/trunk/sys/kern/vfs_bio.c#L1059 https://github.com/openbsd/src/blob/master/sys/kern/vfs_bio.c#L867 https://github.com/freebsd/freebsd/blob/master/sys/kern/vfs_bio.c#L2631 https://github.com/freebsd/freebsd/commit/e4e8fec98ae986357cdc208b04557dba55a59266 https://github.com/illumos/illumos-gate/blob/master/usr/src/uts/common/os/bio.c#L441


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-02 15:03:42

On 2 April 2018 at 02:24, Thomas Munro wrote:

Maybe my drive-by assessment of those kernel routines is wrong and someone will correct me, but I'm starting to think you might be better to assume the worst on all systems. Perhaps a GUC that defaults to panicking, so that users on those rare OSes could turn that off? Even then I'm not sure if the failure mode will be that great anyway or if it's worth having two behaviours. Thoughts?

I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

BTW, the support team at 2ndQ is presently working on two separate issues where ENOSPC resulted in DB corruption, though neither of them involve logs of lost page writes. I'm planning on taking some time tomorrow to write a torture tester for Pg's ENOSPC handling and to verify ENOSPC handling in the test case I linked to in my original StackOverflow post.

If this is just an EIO issue then I see no point doing anything other than PANICing unconditionally.

If it's a concern for ENOSPC too, we should try harder to fail more nicely whenever we possibly can.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 18:13:46

Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Keeping around dirty pages that cannot possibly be written out is essentially a memory leak, as those pages would stay around even after the application has exited.

Why do dirty pages need to be kept around in the case of persistent errors? I don't think the lack of automatic recovery in that case is what anybody is complaining about. It's that the error goes away and there's no reasonable way to separate out such an error from some potential transient errors.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 18:53:20

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Hi,

On 2018-04-01 03:14:46 +0200, Anthony Iliopoulos wrote:

On Sat, Mar 31, 2018 at 12:38:12PM -0400, Tom Lane wrote:

Craig Ringer writes:

So we should just use the big hammer here.

And bitch, loudly and publicly, about how broken this kernel behavior is. If we make enough of a stink maybe it'll get fixed.

It is not likely to be fixed (beyond what has been done already with the manpage patches and errseq_t fixes on the reporting level). The issue is, the kernel needs to deal with hard IO errors at that level somehow, and since those errors typically persist, re-dirtying the pages would not really solve the problem (unless some filesystem remaps the request to a different block, assuming the device is alive).

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 19:32:45

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that.

Which isn't what I've suggested.

Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

Meh.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 20:38:06

On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).


From:Stephen Frost <sfrost(at)snowman(dot)net>
Date:2018-04-02 20:58:08

Greetings,

Anthony Iliopoulos (ailiop(at)altatus(dot)com) wrote:

On Mon, Apr 02, 2018 at 12:32:45PM -0700, Andres Freund wrote:

On 2018-04-02 20:53:20 +0200, Anthony Iliopoulos wrote:

On Mon, Apr 02, 2018 at 11:13:46AM -0700, Andres Freund wrote:

Throwing away the dirty pages and persisting the error seems a lot more reasonable. Then provide a fcntl (or whatever) extension that can clear the error status in the few cases that the application that wants to gracefully deal with the case.

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Meh^2.

"no reason" - except that there's absolutely no way to know what state the data is in. And that your application needs explicit handling of such failures. And that one FD might be used in a lots of different parts of the application, that fsyncs in one part of the application might be an ok failure, and in another not. Requiring explicit actions to acknowledge "we've thrown away your data for unknown reason" seems entirely reasonable.

As long as fsync() indicates error on first invocation, the application is fully aware that between this point of time and the last call to fsync() data has been lost. Persisting this error any further does not change this or add any new info - on the contrary it adds confusion as subsequent write()s and fsync()s on other pages can succeed, but will be reported as failures.

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean "run another fsync() to see if the error has resolved itself." Requiring that to mean "you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything" is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-02 23:05:44

Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.

Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:

(a) it still messes with the top-level error reporting so mixing it with callers that use fsync() and do care about errors will produce the same issue (clearing the error status).

(b) the error-reporting granularity is coarse (failure reporting applies to the entire requested range so you still don't know which particular pages/file sub-ranges failed writeback)

(c) the same "report and forget" semantics apply to repeated invocations of the sync_file_range() call, so again action will need to be taken upon first error encountered for the particular ranges.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail. Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write(). The pattern should be roughly like: write() -> fsync() -> free(), rather than write() -> free() -> fsync(). For example, if a partition gets full upon fsync(), then the application has a chance to persist the data in a different location, while the kernel cannot possibly make this decision and recover.

Callers that are not affected by the potential outcome of fsync() and do not react on errors, have no reason for calling it in the first place (and thus masking failure from subsequent callers that may indeed care).

Reacting on an error from an fsync() call could, based on how it's documented and actually implemented in other OS's, mean "run another fsync() to see if the error has resolved itself." Requiring that to mean "you have to go dirty all of the pages you previously dirtied to actually get a subsequent fsync() to do anything" is really just not reasonable- a given program may have no idea what was written to previously nor any particular reason to need to know, on the expectation that the fsync() call will flush any dirty pages, as it's documented to do.

I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above (although having the FS layer recovering from hard errors transparently is open for discussion from what it seems [1]). Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges). Another issue is that of reporting semantics (report and clear), which is also a design choice made to avoid having higher-resolution error tracking and the corresponding memory overheads [1].

[1] https://lwn.net/Articles/718734/


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-02 23:23:24

On 2018-04-03 01:05:44 +0200, Anthony Iliopoulos wrote:

Would using sync_file_range() be helpful? Potential errors would only apply to pages that cover the requested file ranges. There are a few caveats though:

To quote sync_file_range(2):

   Warning
       This  system  call  is  extremely  dangerous and should not be used in portable programs.  None of these operations writes out the
       file's metadata.  Therefore, unless the application is strictly performing overwrites of already-instantiated disk  blocks,  there
       are no guarantees that the data will be available after a crash.  There is no user interface to know if a write is purely an over‐
       write.  On filesystems using copy-on-write semantics (e.g., btrfs) an overwrite of existing allocated blocks is impossible.   When
       writing  into  preallocated  space,  many filesystems also require calls into the block allocator, which this system call does not
       sync out to disk.  This system call does not flush disk write caches and thus does not provide any data integrity on systems  with
       volatile disk write caches.

Given the lack of metadata safety that seems entirely a no go. We use sfr(2), but only to force the kernel's hand around writing back earlier without throwing away cache contents.

The application will need to deal with that first error irrespective of subsequent return codes from fsync(). Conceptually every fsync() invocation demarcates an epoch for which it reports potential errors, so the caller needs to take responsibility for that particular epoch.

We do deal with that error- by realizing that it failed and later retrying the fsync(), which is when we get back an "all good! everything with this file descriptor you've opened is sync'd!" and happily expect that to be truth, when, in reality, it's an unfortunate lie and there are still pages associated with that file descriptor which are, in reality, dirty and not sync'd to disk.

It really turns out that this is not how the fsync() semantics work though

Except on freebsd and solaris, and perhaps others.

, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.

That's not guaranteed at all, think NFS.

Instead the kernel opts for marking those pages clean (since there is no other recovery strategy), and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

It's broken behaviour justified post facto with the only rational that was available, which explains why it's so unconvincing. You could just say "this ship has sailed, and it's to onerous to change because xxx" and this'd be a done deal. But claiming this is reasonable behaviour is ridiculous.

Again, you could just continue to error for this fd and still throw away the data.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync()

If you open the same file with two fds, and write with one, and fsync with another that's definitely supposed to work. And sync() isn't a realistic replacement in any sort of way because it's obviously systemwide, and thus entirely and completely unsuitable. Nor does it have any sort of better error reporting behaviour, does it?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-02 23:27:35

On 3 April 2018 at 07:05, Anthony Iliopoulos wrote:

Hi Stephen,

On Mon, Apr 02, 2018 at 04:58:08PM -0400, Stephen Frost wrote:

fsync() doesn't reflect the status of given pages, however, it reflects the status of the file descriptor, and as such the file, on which it's called. This notion that fsync() is actually only responsible for the changes which were made to a file since the last fsync() call is pure foolishness. If we were able to pass a list of pages or data ranges to fsync() for it to verify they're on disk then perhaps things would be different, but we can't, all we can do is ask to "please flush all the dirty pages associated with this file descriptor, which represents this file we opened, to disk, and let us know if you were successful."

Give us a way to ask "are these specific pages written out to persistant storage?" and we would certainly be happy to use it, and to repeatedly try to flush out pages which weren't synced to disk due to some transient error, and to track those cases and make sure that we don't incorrectly assume that they've been transferred to persistent storage.

Indeed fsync() is simply a rather blunt instrument and a narrow legacy interface but further changing its established semantics (no matter how unreasonable they may be) is probably not the way to go.

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

So I don't buy this argument.

It really turns out that this is not how the fsync() semantics work though, exactly because the nature of the errors: even if the kernel retained the dirty bits on the failed pages, retrying persisting them on the same disk location would simply fail.

might simply fail.

It depends on why the error ocurred.

I originally identified this behaviour on a multipath system. Multipath defaults to "throw the writes away, nobody really cares anyway" on error. It seems to figure a higher level will retry, or the application will receive the error and retry.

(See no_path_retry in multipath config. AFAICS the default is insanely dangerous and only suitable for specialist apps that understand the quirks; you should use no_path_retry=queue).

Instead the kernel opts for marking those pages clean (since there is no other recovery strategy),

and reporting once to the caller who can potentially deal with it in some manner. It is sadly a bad and undocumented convention.

It could mark the FD.

It's not just undocumented, it's a slightly creative interpretation of the POSIX spec for fsync.

Consider two independent programs where the first one writes to a file and then calls the second one whose job it is to go out and fsync(), perhaps async from the first, those files. Is the second program supposed to go write to each page that the first one wrote to, in order to ensure that all the dirty bits are set so that the fsync() will actually return if all the dirty pages are written?

I think what you have in mind are the semantics of sync() rather than fsync(), but as long as an application needs to ensure data are persisted to storage, it needs to retain those data in its heap until fsync() is successful instead of discarding them and relying on the kernel after write().

This is almost exactly what we tell application authors using PostgreSQL: the data isn't written until you receive a successful commit confirmation, so you'd better not forget it.

We provide applications with clear boundaries so they can know exactly what was, and was not, written. I guess the argument from the kernel is the same is true: whatever was written since the last successful fsync is potentially lost and must be redone.

But the fsync behaviour is utterly undocumented and dubiously standard.

I think we are conflating a few issues here: having the OS kernel being responsible for error recovery (so that subsequent fsync() would fix the problems) is one. This clearly is a design which most kernels have not really adopted for reasons outlined above

[citation needed]

What do other major platforms do here? The post above suggests it's a bit of a mix of behaviours.

Now, there is the issue of granularity of error reporting: userspace could benefit from a fine-grained indication of failed pages (or file ranges).

Yep. I looked at AIO in the hopes that, if we used AIO, we'd be able to map a sync failure back to an individual AIO write.

But it seems AIO just adds more problems and fixes none. Flush behaviour with AIO from what I can tell is inconsistent version to version and generally unhelpful. The kernel should really report such sync failures back to the app on its AIO write mapping, but it seems nothing of the sort happens.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-03 00:03:39

On Apr 2, 2018, at 16:27, Craig Ringer wrote:

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-03 00:05:09

On April 2, 2018 5:03:39 PM PDT, Christophe Pettus wrote:

On Apr 2, 2018, at 16:27, Craig Ringer wrote:

They're undocumented and extremely surprising semantics that are arguably a violation of the POSIX spec for fsync(), or at least a surprising interpretation of it.

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-03 00:07:41

On Apr 2, 2018, at 17:05, Andres Freund wrote:

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

For sure on the second sentence; the first was not clear to me.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-03 00:48:00

On Mon, Apr 2, 2018 at 5:05 PM, Andres Freund wrote:

Even accepting that (I personally go with surprising over violation, as if my vote counted), it is highly unlikely that we will convince every kernel team to declare "What fools we've been!" and push a change... and even if they did, PostgreSQL can look forward to many years of running on kernels with the broken semantics. Given that, I think the PANIC option is the soundest one, as unappetizing as it is.

Don't we pretty much already have agreement in that? And Craig is the main proponent of it?

I wonder how bad it will be in practice if we PANIC. Craig said "This isn't as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us". It would be nice to get more information on that.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-03 01:29:28

On Tue, Apr 3, 2018 at 3:03 AM, Craig Ringer wrote:

I see little benefit to not just PANICing unconditionally on EIO, really. It shouldn't happen, and if it does, we want to be pretty conservative and adopt a data-protective approach.

I'm rather more worried by doing it on ENOSPC. Which looks like it might be necessary from what I recall finding in my test case + kernel code reading. I really don't want to respond to a possibly-transient ENOSPC by PANICing the whole server unnecessarily.

Yeah, it'd be nice to give an administrator the chance to free up some disk space after ENOSPC is reported, and stay up. Running out of space really shouldn't take down the database without warning! The question is whether the data remains in cache and marked dirty, so that retrying is a safe option (since it's potentially gone from our own buffers, so if the OS doesn't have it the only place your committed data can definitely still be found is the WAL... recovery time). Who can tell us? Do we need a per-filesystem answer? Delayed allocation is a somewhat filesystem-specific thing, so maybe. Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no
Illumos/Solaris, Linux, macOS: yes

I don't know if macOS really means it or not; it just tells you to see the errors for read(2) and write(2). By the way, speaking of macOS, I was curious to see if the common BSD heritage would show here. Yeah, somewhat. It doesn't appear to keep buffers on writeback error, if this is the right code1.

[1] https://github.com/apple/darwin-xnu/blob/master/bsd/vfs/vfs_bio.c#L2695


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-03 02:54:26

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote:

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to throw away the user's data. If the writes are going to fail, then let them keep on failing every time. That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time. We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-03 03:45:30

On Mon, Apr 2, 2018 at 7:54 PM, Robert Haas wrote:

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

I fear that the conventional wisdom from the Kernel people is now "you should be using O_DIRECT for granular control". The LWN article Thomas linked (https://lwn.net/Articles/718734) cites Ted Ts'o:

"Monakhov asked why a counter was needed; Layton said it was to handle multiple overlapping writebacks. Effectively, the counter would record whether a writeback had failed since the file was opened or since the last fsync(). Ts'o said that should be fine; applications that want more information should use O_DIRECT. For most applications, knowledge that an error occurred somewhere in the file is all that is necessary; applications that require better granularity already use O_DIRECT."


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 10:35:39

Hi Robert,

On Mon, Apr 02, 2018 at 10:54:26PM -0400, Robert Haas wrote:

On Mon, Apr 2, 2018 at 2:53 PM, Anthony Iliopoulos wrote:

Given precisely that the dirty pages which cannot been written-out are practically thrown away, the semantics of fsync() (after the 4.13 fixes) are essentially correct: the first call indicates that a writeback error indeed occurred, while subsequent calls have no reason to indicate an error (assuming no other errors occurred in the meantime).

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.

throw away the user's data. If the writes are going to fail, then let them keep on failing every time. That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through. Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.

dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then

That's the only way to think about fsync() guarantees unless you are on a kernel that keeps retrying to persist dirty pages. Assuming such a model, after repeated and unrecoverable hard failures the process would have to explicitly inform the kernel to drop the dirty pages. All the process could do at that point is read back to userspace the dirty/failed pages and attempt to rewrite them at a different place (which is current possible too). Most applications would not bother though to inform the kernel and drop the permanently failed pages; and thus someone eventually would hit the case that a large amount of failed writeback pages are running his server out of memory, at which point people will complain that those semantics are completely unreasonable.

we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.

Again, conflating two separate issues, that of buffering and retrying failed pages and that of error reporting. Yes it would be convenient for applications not to have to care at all about recovery of failed write-backs, but at some point they would have to face this issue one way or another (I am assuming we are always talking about hard failures, other kinds of failures are probably already being dealt with transparently at the kernel level).

As for the reporting, it is also unreasonable to effectively signal and persist an error on a file-wide granularity while it pertains to subsets of that file and other writes can go through, but I am repeating myself.

I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-03 11:26:05

On 3 April 2018 at 11:35, Anthony Iliopoulos wrote:

Hi Robert,

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.

This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.

The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program using fsync for its original intended purpose of guaranteeing that the all writes are synced to disk it no longer has any guarantee at all.

This would be a highly user-visible change of semantics from edge- triggered to level-triggered behavior.

It was always documented as level-triggered. This edge-triggered concept is a completely surprise to application writers.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 13:36:47

On Tue, Apr 03, 2018 at 12:26:05PM +0100, Greg Stark wrote:

On 3 April 2018 at 11:35, Anthony Iliopoulos wrote:

Hi Robert,

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error. But I think one would have a hard time defending a modification to the kernel where this is further extended to cover cases where:

process A does write() on some file offset which fails writeback, fsync() gets EIO and exit()s.

process B does write() on some other offset which succeeds writeback, but fsync() gets EIO due to (uncleared) failures of earlier process.

Surely that's exactly what process B would want? If it calls fsync and gets a success and later finds out that the file is corrupt and didn't match what was in memory it's not going to be happy.

You can't possibly make this assumption. Process B may be reading and writing to completely disjoint regions from those of process A, and as such not really caring about earlier failures, only wanting to ensure its own writes go all the way through. But even if it did care, the file interfaces make no transactional guarantees. Even without fsync() there is nothing preventing process B from reading dirty pages from process A, and based on their content proceed to to its own business and write/persist new data, while process A further modifies the not-yet-flushed pages in-memory before flushing. In this case you'd need explicit synchronization/locking between the processes anyway, so why would fsync() be an exception?

This seems like an attempt to co-opt fsync for a new and different purpose for which it's poorly designed. It's not an async error reporting mechanism for writes. It would be useless as that as any process could come along and open your file and eat the errors for writes you performed. An async error reporting mechanism would have to document which writes it was giving errors for and give you ways to control that.

The errseq_t fixes deal with that; errors will be reported to any process that has an open fd, irrespective to who is the actual caller of the fsync() that may have induced errors. This is anyway required as the kernel may evict dirty pages on its own by doing writeback and as such there needs to be a way to report errors on all open fds.

The semantics described here are useless for everyone. For a program needing to know the error status of the writes it executed, it doesn't know which writes are included in which fsync call. For a program

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error and proceed or some other process will need to leverage that without coordination, or which writes actually failed for that matter. We would be back to the case of requiring explicit synchronization between processes that care about this, in which case the processes may as well synchronize over calling fsync() in the first place.

Having an opt-in persisting EIO per-fd would practically be a form of "contract" between "cooperating" processes anyway.

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-03 14:29:10

On 3 April 2018 at 10:54, Robert Haas wrote:

I think it's always unreasonable to throw away the user's data.

Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can be interpreted kind of similarly.

I don't agree with it, but I don't think it's as wholly unreasonable as all that. I think leaving it undocumented is absolutely gobsmacking, and it's dubious at best, but it's not totally insane.

If the writes are going to fail, then let them keep on failing every time.

Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for that AFAIK. Unless you expect the app to close() and re-open() the file. Replacing one nonstandard issue with another may not be a win.

That wouldn't cause any data loss, because we'd never be able to checkpoint, and eventually the user would have to kill the server uncleanly, and that would trigger recovery.

Yep. That's what I expected to happen on unrecoverable I/O errors. Because, y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that ext4's errors=remount-ro apparently doesn't concern its self with mere user data, and may exhibit the same behaviour - I need to rerun my test case on it tomorrow.

Also, this really does make it impossible to write reliable programs.

In the presence of multiple apps interacting on the same file, yes. I think that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes you did.

Sucks if your app wants to have multiple processes working together on a file without some kind of journal or WAL, relying on fsync() alone, mind you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.

Even leaving that aside, a PANIC means a prolonged outage on a prolonged system - it could easily take tens of minutes or longer to run recovery. So saying "oh, just do that" is not really an answer. Sure, we can do it, but it's like trying to lose weight by intentionally eating a tapeworm. Now, it's possible to shorten the checkpoint_timeout so that recovery runs faster, but then performance drops because data has to be fsync()'d more often instead of getting buffered in the OS cache for the maximum possible time.

It's also spikier. Users have more issues with latency with short, frequent checkpoints.

We could also dodge this issue in another way: suppose that when we write a page out, we don't consider it really written until fsync() succeeds. Then we wouldn't need to PANIC if an fsync() fails; we could just re-write the page. Unfortunately, this would also be terrible for performance, for pretty much the same reasons: letting the OS cache absorb lots of dirty blocks and do write-combining is necessary for good performance.

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write out a block, we transfer ownership to the OS. Yeah, I'm dreaming. But we'd sure need to be able to trust the OS not to just forget the block then!)

The error reporting is thus consistent with the intended semantics (which are sadly not properly documented). Repeated calls to fsync() simply do not imply that the kernel will retry to writeback the previously-failed pages, so the application needs to be aware of that. Persisting the error at the fsync() level would essentially mean moving application policy into the kernel.

I might accept this argument if I accepted that it was OK to decide that an fsync() failure means you can forget that the write() ever happened in the first place, but it's hard to imagine an application that wants that behavior. If the application didn't care about whether the bytes really got to disk or not, it would not have called fsync() in the first place. If it does care, reporting the error only once is never an improvement.

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They report an ERROR on a statement in a txn then go on with life, merrily forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit rollback (or ROLLBACK TO SAVEPOINT) to restore the session to a usable state. This is good.

But we're the odd one out there. Almost everyone else does much like what fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC tomorrow, propose a patch, report back.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-03 14:37:30

On 3 April 2018 at 14:36, Anthony Iliopoulos wrote:

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By "clear the error" you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error reporting system anyways. It just needs to know when all writes have been synced to disk.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-03 16:52:07

On Tue, Apr 03, 2018 at 03:37:30PM +0100, Greg Stark wrote:

On 3 April 2018 at 14:36, Anthony Iliopoulos wrote:

If EIO persists between invocations until explicitly cleared, a process cannot possibly make any decision as to if it should clear the error

I still don't understand what "clear the error" means here. The writes still haven't been written out. We don't care about tracking errors, we just care whether all the writes to the file have been flushed to disk. By "clear the error" you mean throw away the dirty pages and revert part of the file to some old data? Why would anyone ever want that?

It means that the responsibility of recovering the data is passed back to the application. The writes may never be able to be written out. How would a kernel deal with that? Either discard the data (and have the writer acknowledge) or buffer the data until reboot and simply risk going OOM. It's not what someone would want, but rather need to deal with, one way or the other. At least on the application-level there's a fighting chance for restoring to a consistent state. The kernel does not have that opportunity.

But instead of deconstructing and debating the semantics of the current mechanism, why not come up with the ideal desired form of error reporting/tracking granularity etc., and see how this may be fitted into kernels as a new interface.

Because Postgres is portable software that won't be able to use some Linux-specific interface. And doesn't really need any granular error

I don't really follow this argument, Pg is admittedly using non-portable interfaces (e.g the sync_file_range()). While it's nice to avoid platform specific hacks, expecting that the POSIX semantics will be consistent across systems is simply a 90's pipe dream. While it would be lovely to have really consistent interfaces for application writers, this is simply not going to happen any time soon.

And since those problematic semantics of fsync() appear to be prevalent in other systems as well that are not likely to be changed, you cannot rely on preconception that once buffers are handed over to kernel you have a guarantee that they will be eventually persisted no matter what. (Why even bother having fsync() in that case? The kernel would eventually evict and writeback dirty pages anyway. The point of reporting the error back to the application is to give it a chance to recover - the kernel could repeat "fsync()" itself internally if this would solve anything).

reporting system anyways. It just needs to know when all writes have been synced to disk.

Well, it does know when some writes have not been synced to disk, exactly because the responsibility is passed back to the application. I do realize this puts more burden back to the application, but what would a viable alternative be? Would you rather have a kernel that risks periodically going OOM due to this design decision?


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-03 21:47:01

On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos wrote:

Like other people here, I think this is 100% unreasonable, starting with "the dirty pages which cannot been written out are practically thrown away". Who decided that was OK, and on the basis of what wording in what specification? I think it's always unreasonable to

If you insist on strict conformance to POSIX, indeed the linux glibc configuration and associated manpage are probably wrong in stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation matches that of the flexibility allowed by not supporting SIO. There's a long history of brokenness between linux and posix, and I think there was never an intention of conforming to the standard.

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD, POSIX.1-2001, which on the first system I tested, it did. Also, the summary should be changed from the current "fsync, fdatasync - synchronize a file's in-core state with storage device" by adding ", possibly by randomly undoing some of the changes you think you made to the file".

I believe (as tried to explain earlier) there is a certain assumption being made that the writer and original owner of data is responsible for dealing with potential errors in order to avoid data loss (which should be only of interest to the original writer anyway). It would be very questionable for the interface to persist the error while subsequent writes and fsyncs to different offsets may as well go through.

No, that's not questionable at all. fsync() doesn't take any argument saying which part of the file you care about, so the kernel is entirely not entitled to assume it knows to which writes a given fsync() call was intended to apply.

Another process may need to write into the file and fsync, while being unaware of those newly introduced semantics is now faced with EIO because some unrelated previous process failed some earlier writes and did not bother to clear the error for those writes. In a similar scenario where the second process is aware of the new semantics, it would naturally go ahead and clear the global error in order to proceed with its own write()+fsync(), which would essentially amount to the same problematic semantics you have now.

I don't deny that it's possible that somebody could have an application which is utterly indifferent to the fact that earlier modifications to a file failed due to I/O errors, but is A-OK with that as long as later modifications can be flushed to disk, but I don't think that's a normal thing to want.

Also, this really does make it impossible to write reliable programs. Imagine that, while the server is running, somebody runs a program which opens a file in the data directory, calls fsync() on it, and closes it. If the fsync() fails, postgres is now borked and has no way of being aware of the problem. If we knew, we could PANIC, but we'll never find out, because the unrelated process ate the error. This is exactly the sort of ill-considered behavior that makes fcntl() locking nearly useless.

Fully agree, and the errseq_t fixes have dealt exactly with the issue of making sure that the error is reported to all file descriptors that happen to be open at the time of error.

Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure. What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them. The documentation for fsync() gives zero indication that it's edge-triggered, and so complaining that people wouldn't like it if it became level-triggered seems like an ex post facto justification for a poorly-chosen behavior: they probably think (as we did prior to a week ago) that it already is.

Not sure I understand this case. The application may indeed re-write a bunch of pages that have failed and proceed with fsync(). The kernel will deal with combining the writeback of all the re-written pages. But further the necessity of combining for performance really depends on the exact storage medium. At the point you start caring about write-combining, the kernel community will naturally redirect you to use DIRECT_IO.

Well, the way PostgreSQL works today, we typically run with say 8GB of shared_buffers even if the system memory is, say, 200GB. As pages are evicted from our relatively small cache to the operating system, we track which files need to be fsync()'d at checkpoint time, but we don't hold onto the blocks. Until checkpoint time, the operating system is left to decide whether it's better to keep caching the dirty blocks (thus leaving less memory for other things, but possibly allowing write-combining if the blocks are written again) or whether it should clean them to make room for other things. This means that only a small portion of the operating system memory is directly managed by PostgreSQL, while allowing the effective size of our cache to balloon to some very large number if the system isn't under heavy memory pressure.

Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that "real men use O_DIRECT" and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.

Unfortunately, that is going to add a huge amount of complexity, because if we ran with shared_buffers set to a large percentage of system memory, we couldn't allocate large chunks of memory for sorts and hash tables from the operating system any more. We'd have to allocate it from our own shared_buffers because that's basically all the memory there is and using substantially more might run the system out entirely. So it's a huge, huge architectural change. And even once it's done it is in some ways inferior to what we are doing today -- true, it gives us superior control over writeback timing, but it also makes PostgreSQL play less nicely with other things running on the same machine, because now PostgreSQL has a dedicated chunk of whatever size it has, rather than using some portion of the OS buffer cache that can grow and shrink according to memory needs both of other parts of PostgreSQL and other applications on the system.

I suppose that if the check-and-clear semantics are problematic for Pg, one could suggest a kernel patch that opts-in to a level-triggered reporting of fsync() on a per-descriptor basis, which seems to be non-intrusive and probably sufficient to cover your expected use-case.

That would certainly be better than nothing.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-03 23:59:27

On Tue, Apr 3, 2018 at 1:29 PM, Thomas Munro wrote:

Interestingly, there don't seem to be many operating systems that can report ENOSPC from fsync(), based on a quick scan through some documentation:

POSIX, AIX, HP-UX, FreeBSD, OpenBSD, NetBSD: no Illumos/Solaris, Linux, macOS: yes

Oops, reading comprehension fail. POSIX yes (since issue 5), via the note that read() and write()'s error conditions can also be returned.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 00:56:37

On Tue, Apr 3, 2018 at 05:47:01PM -0400, Robert Haas wrote:

Well, in PostgreSQL, we have a background process called the checkpointer which is the process that normally does all of the fsync() calls but only a subset of the write() calls. The checkpointer does not, however, necessarily have every file open all the time, so these fixes aren't sufficient to make sure that the checkpointer ever sees an fsync() failure.

There has been a lot of focus in this thread on the workflow:

write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 01:54:50

On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian wrote:

There has been a lot of focus in this thread on the workflow:

    write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

    write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that "bug #2", with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 02:05:19

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 12:56 PM, Bruce Momjian wrote:

There has been a lot of focus in this thread on the workflow:

    write() -> blocks remain in kernel memory -> fsync() -> panic?

But what happens in this workflow:

    write() -> kernel syncs blocks to storage -> fsync()

Is fsync() going to see a "kernel syncs blocks to storage" failure?

There was already discussion that if the fsync() causes the "syncs blocks to storage", fsync() will only report the failure once, but will it see any failure in the second workflow? There is indication that a failed write to storage reports back an error once and clears the dirty flag, but do we know it keeps things around long enough to report an error to a future fsync()?

You would think it does, but I have to ask since our fsync() assumptions have been wrong for so long.

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

The second issues is that the pages are marked clean after the error is reported, so further attempts to fsync() the data (in our case for a new attempt to checkpoint) will be futile but appear successful. Call that "bug #2", with the proviso that some people apparently think it's reasonable behaviour and not a bug. At least there is a plausible workaround for that: namely the nuclear option proposed by Craig.

Yes, that one I understood.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 02:14:28

On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 02:40:16

On 4 April 2018 at 05:47, Robert Haas wrote:

Now, I hear the DIRECT_IO thing and I assume we're eventually going to have to go that way: Linux kernel developers seem to think that "real men use O_DIRECT" and so if other forms of I/O don't provide useful guarantees, well that's our fault for not using O_DIRECT. That's a political reason, not a technical reason, but it's a reason all the same.

I looked into buffered AIO a while ago, by the way, and just ... hell no. Run, run as fast as you can.

The trouble with direct I/O is that it pushes a lot of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.

We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 02:44:22

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

On Tue, Apr 3, 2018 at 10:05:19PM -0400, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 01:54:50PM +1200, Thomas Munro wrote:

I believe there were some problems of that nature (with various twists, based on other concurrent activity and possibly different fds), and those problems were fixed by the errseq_t system developed by Jeff Layton in Linux 4.13. Call that "bug #1".

So all our non-cutting-edge Linux systems are vulnerable and there is no workaround Postgres can implement? Wow.

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

https://github.com/torvalds/linux/blob/master/mm/filemap.c#L682

When userland calls fsync (or something like nfsd does the equivalent), we want to report any writeback errors that occurred since the last fsync (or since the file was opened if there haven't been any).

But I'm not sure what the lifetime of the passed-in "file" and more importantly "file->f_wb_err" is. Specifically, what happens to it if no one has the file open at all, between operations? It is reference counted, see fs/file_table.c. I don't know enough about it to comment.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 05:29:28

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */
f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

If so I'm not sure how that can possibly be considered to be an implementation of _POSIX_SYNCHRONIZED_IO: "the fsync() function shall force all currently queued I/O operations associated with the file indicated by file descriptor fildes to the synchronized I/O completion state." Note "the file", not "this file descriptor + copies", and without reference to when you opened it.

But I'm not sure what the lifetime of the passed-in "file" and more importantly "file->f_wb_err" is.

It's really inode->i_mapping->wb_err's lifetime that I should have been asking about there, not file->f_wb_err, but I see now that that question is irrelevant due to the above.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 06:00:21

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Does that mean that the ONLY ways to do reliable I/O are:

?


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 07:32:04

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Does that mean that the ONLY ways to do reliable I/O are:

I suppose you could some up with some crazy complicated IPC scheme to make sure that the checkpointer always has an fd older than any writes to be flushed, with some fallback strategy for when it can't take any more fds.

I haven't got any good ideas right now.

As a bit of an aside, I gather that when you resize files (think truncating/extending relation files) you still need to call fsync() even if you read/write all data with O_DIRECT, to make it flush the filesystem meta-data. I have no idea if that could also be affected by eaten writeback errors.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 07:51:53

On 4 April 2018 at 14:00, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/testcases/fsync-error-clear now.

Warning, this runs a Docker container in privileged mode on your system, and it uses devicemapper. Read it before you run it, and while I've tried to keep it safe, beware that it might eat your system.

For now it tests only xfs and EIO. Other FSs should be easy enough.

I haven't added coverage for multi-processing yet, but given what you found above, I should. I'll probably just system() a copy of the same proc with instructions to only fsync(). I'll do that next.

I haven't worked out a reliable way to trigger ENOSPC on fsync() yet, when mapping without the error hole. It happens sometimes but I don't know why, it almost always happens on write() instead. I know it can happen on nfs, but I'm hoping for a saner example than that to test with. ext4 and xfs do delayed allocation but eager reservation so it shouldn't happen to them.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 13:49:38

On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any errors that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
    write, fsync, close                5360.341 ops/sec     187 usecs/op
    write, close, fsync                4785.240 ops/sec     209 usecs/op

Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.

I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.

Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(

The last time I remember being this surprised about storage was in the early Postgres years when we learned that just because the BSD file system uses 8k pages doesn't mean those are atomically written to storage. We knew the operating system wrote the data in 8k chunks to storage but:

This is why we added pre-page images are written to WAL, which is what full_page_writes controls.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 13:53:01

On Wed, Apr 4, 2018 at 10:40:16AM +0800, Craig Ringer wrote:

The trouble with direct I/O is that it pushes a lot of work back on PostgreSQL regarding knowledge of the storage subsystem, I/O scheduling, etc. It's absurd to have the kernel do this, unless you want it reliable, in which case you bypass it and drive the hardware directly.

We'd need pools of writer threads to deal with all the blocking I/O. It'd be such a nightmare. Hey, why bother having a kernel at all, except for drivers?

I believe this is how Oracle views the kernel, so there is precedent for this approach, though I am not advocating it.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:00:15

On 4 April 2018 at 15:51, Craig Ringer wrote:

On 4 April 2018 at 14:00, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:44 PM, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 2:14 PM, Bruce Momjian wrote:

Uh, are you sure it fixes our use-case? From the email description it sounded like it only reported fsync errors for every open file descriptor at the time of the failure, but the checkpoint process might open the file after the failure and try to fsync a write that happened before the failure.

I'm not sure of anything. I can see that it's designed to report errors since the last fsync() of the file (presumably via any fd), which sounds like the desired behaviour:

[..]

Scratch that. Whenever you open a file descriptor you can't see any preceding errors at all, because:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

https://github.com/torvalds/linux/blob/master/fs/open.c#L752

Our whole design is based on being able to open, close and reopen files at will from any process, and in particular to fsync() from a different process that didn't inherit the fd but instead opened it later. But it looks like that might be able to eat errors that occurred during asynchronous writeback (when there was nobody to report them to), before you opened the file?

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

I'll see if I can expand my testcase for that. I'm presently dockerizing it to make it easier for others to use, but that turns out to be a major pain when using devmapper etc. Docker in privileged mode doesn't seem to play nice with device-mapper.

Done, you can find it in https://github.com/ringerc/scrapcode/tree/master/ testcases/fsync-error-clear now.

Update. Now supports multiple FSes.

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

Still working on getting ENOSPC on fsync() rather than write(). Kernel code reading suggests this is possible, but all the above FSes reserve space eagerly on write( ) even if they do delayed allocation of the actual storage, so it doesn't seem to happen at least in my simple single-process test.

I'm not overly inclined to complain about a fsync() succeeding after a write() error. That seems reasonable enough, the kernel told the app at the time of the failure. What else is it going to do? I don't personally even object hugely to the current fsync() behaviour if it were, say, DOCUMENTED and conformant to the relevant standards, though not giving us any sane way to find out the affected file ranges makes it drastically harder to recover sensibly.

But what's come out since on this thread, that we cannot even rely on fsync() giving us an EIO once when it loses our data, because:

... that's really troubling. I thought we could at least fix this by PANICing on EIO, and was mostly worried about ENOSPC. But now it seems we can't even do that and expect reliability. So how the @#$ are we meant to do?

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:09:09

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

What bewilders me is that running with data=journal doesn't seem to be safe either. WTF?

[26438.846111] EXT4-fs (dm-0): mounted filesystem with journalled data
mode. Opts: errors=remount-ro,data_err=abort,data=journal
[26454.125319] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[26454.125326] Buffer I/O error on device dm-0, logical block 59393
[26454.125337] Buffer I/O error on device dm-0, logical block 59394
[26454.125343] Buffer I/O error on device dm-0, logical block 59395
[26454.125350] Buffer I/O error on device dm-0, logical block 59396

and splat, there goes your data anyway.

It's possible that this is in some way related to using the device-mapper "error" target and a loopback device in testing. But I don't really see how.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 14:25:47

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 14:42:18

On 4 April 2018 at 22:25, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with
outstanding buffered I/O that's really going to hurt us here. I'll be
expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.

Yep, I gathered. I was referring to data_err.


From:Antonis Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-04 15:23:31

On Wed, Apr 4, 2018 at 4:42 PM, Craig Ringer wrote:

On 4 April 2018 at 22:25, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 10:09:09PM +0800, Craig Ringer wrote:

On 4 April 2018 at 22:00, Craig Ringer wrote:

It's the error reporting issues around closing and reopening files with outstanding buffered I/O that's really going to hurt us here. I'll be expanding my test case to cover that shortly.

Also, just to be clear, this is not in any way confined to xfs and/or lvm as I originally thought it might be.

Nor is ext3/ext4's errors=remount-ro protective. data_err=abort doesn't help either (so what does it do?).

Anthony Iliopoulos reported in this thread that errors=remount-ro is only affected by metadata writes.

Yep, I gathered. I was referring to data_err.

As far as I recall data_err=abort pertains to the jbd2 handling of potential writeback errors. Jbd2 will inetrnally attempt to drain the data upon txn commit (and it's even kind enough to restore the EIO at the address space level, that otherwise would get eaten).

When data_err=abort is set, then jbd2 forcibly shuts down the entire journal, with the error being propagated upwards to ext4. I am not sure at which point this would be manifested to userspace and how, but in principle any subsequent fs operations would get some filesystem error due to the journal being down (I would assume similar to remounting the fs read-only).

Since you are using data=journal, I would indeed expect to see something more than what you saw in dmesg.

I can have a look later, I plan to also respond to some of the other interesting issues that you guys raised in the thread.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-04 15:23:51

On 4 April 2018 at 21:49, Bruce Momjian wrote:

On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:

On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer wrote:

On 4 April 2018 at 13:29, Thomas Munro wrote:

/* Ensure that we skip any errors that predate opening of the file */ f->f_wb_err = filemap_sample_wb_err(f->f_mapping);

[...]

Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel will deliberately hide writeback errors that predate our fsync() call from us?

Predates the opening of the file by the process that calls fsync(). Yeah, it sure looks that way based on the above code fragment. Does anyone know better?

Uh, just to clarify, what is new here is that it is ignoring any errors that happened before the open(). It is not ignoring write()'s that happened but have not been written to storage before the open().

FYI, pg_test_fsync has always tested the ability to fsync() writes() from from other processes:

    Test if fsync on non-write file descriptor is honored:
    (If the times are similar, fsync() can sync data written on a

different descriptor.) write, fsync, close 5360.341 ops/sec 187 usecs/op write, close, fsync 4785.240 ops/sec 209 usecs/op

Those two numbers should be similar. I added this as a check to make sure the behavior we were relying on was working. I never tested sync errors though.

I think the fundamental issue is that we always assumed that writes to the kernel that could not be written to storage would remain in the kernel until they succeeded, and that fsync() would report their existence.

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

The obvious answer seems to be "until the FD is closed". But we just discussed how Pg relies on being able to open and close files freely. That may not be as reasonable a thing to do as we thought it was when you consider error reporting. What's the kernel meant to do? How long should it remember "I had an error while doing writeback on this file"? Should it flag the file metadata and remember across reboots? Obviously not, but where does it stop? Tell the next program that does an fsync() and forget? How could it associate a dirty buffer on a file with no open FDs with any particular program at all? And what if the app did a write then closed the file and went away, never to bother to check the file again, like most apps do?

Some I/O errors are transient (network issue, etc). Some are recoverable with some sort of action, like disk space issues, but may take a long time before an admin steps in. Some are entirely unrecoverable (disk 1 in striped array is on fire) and there's no possible recovery. Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

That does leave us in a pickle when it comes to the checkpointer and opening/closing FDs. I don't know what the "right" thing for the kernel to do from our perspective even is here, but the best I can come up with is actually pretty close to what it does now. Report the fsync() error to the first process that does an fsync() since the writeback error if one has occurred, then forget about it. Ideally I'd have liked it to mark all FDs pointing to the file with a flag to report EIO on next fsync too, but it turns out that won't even help us due to our opening and closing behaviour, so we're going to have to take responsibility for handling and communicating that ourselves, preventing checkpoint completion if any backend gets an fsync error. Probably by PANICing. Some extra work may be needed to ensure reliable ordering and stop checkpoints completing if their fsync() succeeds due to a recent failed fsync() on a normal backend that hasn't PANICed or where the postmaster hasn't noticed yet.

Our only option might be to tell administrators to closely watch for > kernel write failure messages, and then restore or failover. :-( >

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-04 17:23:58

On 04. 04. 2018 15:49, Bruce Momjian wrote:

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure? To the first fsync that happens after the failure? How long should it continue to record the failure? What if no fsync() every happens, which is likely for non-Postgres workloads? I think once they decided to discard failed syncs and not retry them, the fsync behavior we are complaining about was almost required.

Ideally the kernel would keep its data for as little time as possible. With fsync, it doesn't really know which process is interested in knowing about a write error, it just assumes the caller will know how to deal with it. Most unfortunate issue is there's no way to get information about a write error.

Thinking aloud - couldn't/shouldn't a write error also be a file system event reported by inotify? Admittedly that's only a thing on Linux, but still.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-04 17:51:03

On Wed, Apr 4, 2018 at 11:23:51PM +0800, Craig Ringer wrote:

On 4 April 2018 at 21:49, Bruce Momjian wrote:

I can understand why kernel developers don't want to keep failed sync buffers in memory, and once they are gone we lose reporting of their failure. Also, if the kernel is going to not retry the syncs, how long should it keep reporting the sync failure?

Ideally until the app tells it not to.

But there's no standard API for that.

You would almost need an API that registers before the failure that you care about sync failures, and that you plan to call fsync() to gather such information. I am not sure how you would allow more than the first fsync() to see the failure unless you added another API to clear the fsync failure, but I don't see the point since the first fsync() might call that clear function. How many applications are going to know there is another application that cares about the failure? Not many.

Currently we kind of hope the kernel will deal with figuring out which is which and retrying. Turns out it doesn't do that so much, and I don't think the reasons for that are wholly unreasonable. We may have been asking too much.

Agreed.

Our only option might be to tell administrators to closely watch for kernel write failure messages, and then restore or failover. :-(

Speaking of, there's not necessarily any lost page write error in the logs AFAICS. My tests often just show "Buffer I/O error on device dm-0, logical block 59393" or the like.

I assume that is the kernel logs. I am thinking the kernel logs have to be monitored, but how many administrators do that? The other issue I think you are pointing out is how is the administrator going to know this is a Postgres file? I guess any sync error to a device that contains Postgres has to assume Postgres is corrupted. :-(


see explicit treatment of retrying, though I'm not entirely sure if the retry flag is set just for async write-back), and apparently unlike every other kernel I've tried to grok so far (things descended from ancestral BSD but not descended from FreeBSD, with macOS/Darwin apparently in the first category for this purpose).

Here's a new ticket in the NetBSD bug database for this stuff:

http://gnats.netbsd.org/53152

As mentioned in that ticket and by Andres earlier in this thread, keeping the page dirty isn't the only strategy that would work and may be problematic in different ways (it tells the truth but floods your cache with unflushable stuff until eventually you force unmount it and your buffers are eventually invalidated after ENXIO errors? I don't know.). I have no qualified opinion on that. I just know that we need a way for fsync() to tell the truth about all preceding writes or our checkpoints are busted.

*We mmap() + msync() in pg_flush_data() if you don't have sync_file_range(), and I see now that that is probably not a great idea on ZFS because you'll finish up double-buffering (or is that triple-buffering?), flooding your page cache with transient data. Oops. That is off-topic and not relevant for the checkpoint correctness topic of this thread through, since pg_flush_data() is advisory only.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-04 22:14:24

On Thu, Apr 5, 2018 at 9:28 AM, Thomas Munro wrote:

On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer wrote:

I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO. Didn't try zfs-on-linux or other platforms yet.

While contemplating what exactly it would do (not sure),

See manual for failmode=wait | continue | panic. Even "continue" returns EIO to all new write requests, so they apparently didn't bother to supply an 'eat-my-data-but-tell-me-everything-is-fine' mode. Figures.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-05 07:09:57

Summary to date:

It's worse than I thought originally, because:

So the checkpointer cannot trust that a successful fsync() means ... a successful fsync().

Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.

I previously though that errors=remount-ro was a sufficient safeguard. It isn't. There doesn't seem to be anything that is, for ext3, ext4, btrfs or xfs.

It's not clear to me yet why data_err=abort isn't sufficient in data=ordered or data=writeback mode on ext3 or ext4, needs more digging. (In my test tools that's: make FSTYPE=ext4 MKFSOPTS="" MOUNTOPTS="errors=remount-ro, data_err=abort,data=journal" as of the current version d7fe802ec). AFAICS that's because data_error=abort only affects data=ordered, not data=journal. If you use data=ordered, you at least get retries of the same write failing. This post https://lkml.org/lkml/2008/10/10/80 added the option and has some explanation, but doesn't explain why it doesn't affect data=journal.

zfs is probably not affected by the issues, per Thomas Munro. I haven't run my test scripts on it yet because my kernel doesn't have zfs support and I'm prioritising the multi-process / open-and-close issues.

So far none of the FSes and options I've tried exhibit the behavour I actually want, which is to make the fs readonly or inaccessible on I/O error.

ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().

I think what we really need is a block-layer fix, where an I/O error flips the block device into read-only mode, as if blockdev --setro had been used. Though I'd settle for a kernel panic, frankly. I don't think anybody really wants this, but I'd rather either of those to silent data loss.

I'm currently tweaking my test to do some close and reopen the file between each write() and fsync(), and to support running with nfs.

I've also just found the device-mapper "flakey" driver, which looks fantastic for simulating unreliable I/O with intermittent faults. I've been using the "error" target in a mapping, which lets me remap some of the device to always error, but "flakey" looks very handy for actual PostgreSQL testing.

For the sake of Google, these are errors known to be associated with the problem:

ext4, and ext3 mounted with ext4 driver:

[42084.327345] EXT4-fs warning (device dm-0): ext4_end_bio:323: I/O error
10 writing to inode 12 (offset 0 size 0 starting block 59393)
[42084.327352] Buffer I/O error on device dm-0, logical block 59393

xfs:

[42193.771367] XFS (dm-0): writeback error on sector 118784
[42193.784477] XFS (dm-0): writeback error on sector 118784

jfs: (nil, silence in the kernel logs)

You should also beware of "lost page write" or "lost write" errors.

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-05 08:46:08

On 5 April 2018 at 15:09, Craig Ringer wrote:

Also, it's been reported to me off-list that anyone on the system calling sync(2) or the sync shell command will also generally consume the write error, causing us not to see it when we fsync(). The same is true for /proc/sys/vm/drop_caches. I have not tested these yet.

I just confirmed this with a tweak to the test that

records the file position
close()s the fd
sync()s
open(s) the file
lseek()s back to the recorded position

This causes the test to completely ignore the I/O error, which is not reported to it at any time.

Fair enough, really, when you look at it from the kernel's point of view. What else can it do? Nobody has the file open. It'd have to mark the file its self as bad somehow. But that's pretty bad for our robustness AFAICS.

There's some level of agreement that we should PANIC on fsync() errors, at least on Linux, but likely everywhere. But we also now know it's insufficient to be fully protective.

If dirty writeback fails between our close() and re-open() I see the same behaviour as with sync(). To test that I set dirty_writeback_centisecs and dirty_expire_centisecs to 1 and added a usleep(3*100*1000) between close() and open(). (It's still plenty slow). So sync() is a convenient way to simulate something other than our own fsync() writing out the dirty buffer.

If I omit the sync() then we get the error reported by fsync() once when we re open() the file and fsync() it, because the buffers weren't written out yet, so the error wasn't generated until we re-open()ed the file. But I doubt that'll happen much in practice because dirty writeback will get to it first so the error will be seen and discarded before we reopen the file in the checkpointer.

In other words, it looks like even with a new kernel with the error reporting bug fixes, if I understand how the backends and checkpointer interact when it comes to file descriptors, we're unlikely to notice I/O errors and fail a checkpoint. We may notice I/O errors if a backend does its own eager writeback for large I/O operations, or if the checkpointer fsync()s a file before the kernel's dirty writeback gets around to trying to flush the pages that will fail.

I haven't tested anything with multiple processes / multiple FDs yet, where we keep one fd open while writing on another.

But at this point I don't see any way to make Pg reliably detect I/O errors and fail a checkpoint then redo and retry. To even fix this by PANICing like I proposed originally, we need to know we have to PANIC.

AFAICS it's completely unsafe to write(), close(), open() and fsync() and expect that the fsync() makes any promises about the write(). Which if I read Pg's low level storage code right, makes it completely unable to reliably detect I/O errors.

When put it that way, it sounds fair enough too. How long is the kernel meant to remember that there was a write error on the file triggered by a write initiated by some seemingly unrelated process, some unbounded time ago, on a since-closed file?

But it seems to put Pg on the fast track to O_DIRECT.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-05 19:33:14

On Thu, Apr 5, 2018 at 03:09:57PM +0800, Craig Ringer wrote:

ENOSPC doesn't seem to be a concern during normal operation of major file systems (ext3, ext4, btrfs, xfs) because they reserve space before returning from write(). But if a buffered write does manage to fail due to ENOSPC we'll definitely see the same problems. This makes ENOSPC on NFS a potentially data corrupting condition since NFS doesn't preallocate space before returning from write().

This does explain why NFS has a reputation for unreliability for Postgres.


From:Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>
Date:2018-04-05 23:37:42

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-06 01:27:05

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is

Meanwhile, I've extended my test to run postgres on a deliberately faulty volume and confirmed my results there.

2018-04-06 01:11:40.555 UTC [58] LOG:  checkpoint starting: immediate force
wait
2018-04-06 01:11:40.567 UTC [58] ERROR:  could not fsync file
"base/12992/16386": Input/output error
2018-04-06 01:11:40.655 UTC [66] ERROR:  checkpoint request failed
2018-04-06 01:11:40.655 UTC [66] HINT:  Consult recent messages in the
server log for details.
2018-04-06 01:11:40.655 UTC [66] STATEMENT:  CHECKPOINT



Checkpoint failed with checkpoint request failed
HINT:  Consult recent messages in the server log for details.



Retrying



2018-04-06 01:11:41.568 UTC [58] LOG:  checkpoint starting: immediate force
wait
2018-04-06 01:11:41.614 UTC [58] LOG:  checkpoint complete: wrote 0 buffers
(0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s,
sync=0.000 s, total=0.046 s; sync files=3, longest=0.000 s, average=0.000
s; distance=2727 kB, estimate=2779 kB

Given your report, now I have to wonder if we even reissued the fsync() at all this time. 'perf' time. OK, with

sudo perf record -e syscalls:sys_enter_fsync,syscalls:sys_exit_fsync -a
sudo perf script

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

        postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:
0x00000005
        postgres  9602 [003] 72380.325931:  syscalls:sys_exit_fsync:
0xfffffffffffffffb
...
        postgres  9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd:
0x00000005
        postgres  9602 [000] 72381.336840:  syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752
[72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-06 02:53:56

On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer wrote:

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is

Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

    postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:

0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched. One thing you could try to confirm our understand of the Linux 4.13+ policy would be to hack PostgreSQL so that it reopens the file descriptor every time in mdsync(). See attached.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-06 03:20:22

On 6 April 2018 at 10:53, Thomas Munro wrote:

On Fri, Apr 6, 2018 at 1:27 PM, Craig Ringer wrote:

On 6 April 2018 at 07:37, Andrew Gierth wrote:

Note: as I've brought up in another thread, it turns out that PG is not handling fsync errors correctly even when the OS does do the right thing (discovered by testing on FreeBSD).

Yikes. For other readers, the related thread for this is news-spur.riddles.org.uk

Yeah. That's really embarrassing, especially after beating up on various operating systems all week. It's also an independent issue -- let's keep that on the other thread and get it fixed.

I see the failed fync, then the same fd being fsync()d without error on the next checkpoint, which succeeds.

    postgres  9602 [003] 72380.325817: syscalls:sys_enter_fsync: fd:

0x00000005 postgres 9602 [003] 72380.325931: syscalls:sys_exit_fsync: 0xfffffffffffffffb ... postgres 9602 [000] 72381.336767: syscalls:sys_enter_fsync: fd: 0x00000005 postgres 9602 [000] 72381.336840: syscalls:sys_exit_fsync: 0x0

... and Pg continues merrily on its way without realising it lost data:

[72379.834872] XFS (dm-0): writeback error on sector 118752 [72380.324707] XFS (dm-0): writeback error on sector 118688

In this test I set things up so the checkpointer would see the first fsync() error. But if I make checkpoints less frequent, the bgwriter aggressive, and kernel dirty writeback aggressive, it should be possible to have the failure go completely unobserved too. I'll try that next, because we've already largely concluded that the solution to the issue above is to PANIC on fsync() error. But if we don't see the error at all we're in trouble.

I suppose you only see errors because the file descriptors linger open in the virtual file descriptor cache, which is a matter of luck depending on how many relation segment files you touched.

In this case I think it's because the kernel didn't get around to doing the writeback before the eagerly forced checkpoint fsync()'d it. Or we didn't even queue it for writeback from our own shared_buffers until just before we fsync()'d it. After all, it's a contrived test case that tries to reproduce the issue rapidly with big writes and frequent checkpoints.

So the checkpointer had the relation open to fsync() it, and it was the checkpointer's fsync() that did writeback on the dirty page and noticed the error.

If we the kernel had done the writeback before the checkpointer opened the relation to fsync() it, we might not have seen the error at all - though as you note this depends on the file descriptor cache. You can see the silent-error behaviour in my standalone test case where I confirmed the post-4.13 behaviour. (I'm on 4.14 here).

I can try to reproduce it with postgres too, but it not only requires closing and reopening the FDs, it also requires forcing writeback before opening the fd. To make it occur in a practical timeframe I have to make my kernel writeback settings insanely aggressive and/or call sync() before re-open()ing. I don't really think it's worth it, since I've confirmed the behaviour already with the simpler test in standalone/ in the rest repo. To try it yourself, clone

https://github.com/ringerc/scrapcode

and in the master branch

cd testcases/fsync-error-clear
less README
make REOPEN=reopen standalone-run

See https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear/standalone/fsync-error-clear.c#L118 .

I've pushed the postgres test to that repo too; "make postgres-run".

You'll need docker, and be warned, it's using privileged docker containers and messing with dmsetup.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-08 02:16:07

So, what can we actually do about this new Linux behaviour?

Idea 1:

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-08 02:33:37

On Sun, Apr 8, 2018 at 02:16:07PM +1200, Thomas Munro wrote:

So, what can we actually do about this new Linux behaviour?

Idea 1:

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 02:37:47

On Apr 7, 2018, at 19:33, Bruce Momjian wrote: Idea 4 would be for people to assume their database is corrupt if their server logs report any I/O error on the file systems Postgres uses.

Pragmatically, that's where we are right now. The best answer in this bad situation is (a) fix the error, then (b) replay from a checkpoint before the error occurred, but it appears we can't even guarantee that a PostgreSQL process will be the one to see the error.

-- -- Christophe Pettus xof(at)thebuild(dot)com

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 03:27:45

On 8 April 2018 at 10:16, Thomas Munro wrote:

So, what can we actually do about this new Linux behaviour?

Yeah, I've been cooking over that myself.

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

That way we can use something like your #1 (which is what I was also thinking about then rejecting previously), but reduce the pain by reducing the FD count drastically so exhausting FDs stops being a problem.

Previously I was leaning toward what you've described here:

Maybe it could be made to work, but sheesh, that seems horrible. Is there some simpler idea along these lines that could make sure that fsync() is only ever called on file descriptors that were opened before all unflushed writes, or file descriptors cloned from such file descriptors?

... and got stuck on "yuck, that's awful".

I was assuming we'd force early checkpoints if the checkpointer hit its fd limit, but that's even worse.

We'd need to urgently do away with segmented relations, and partitions would start to become a hinderance.

Even then it's going to be an unworkable nightmare with heavily partitioned systems, systems that use schema-sharding, etc. And it'll mean we need to play with process limits and, often, system wide limits on FDs. I imagine the performance implications won't be pretty.

Idea 2:

Give up, complain that this implementation is defective and unworkable, both on POSIX-compliance grounds and on POLA grounds, and campaign to get it fixed more fundamentally (actual details left to the experts, no point in speculating here, but we've seen a few approaches that work on other operating systems including keeping buffers dirty and marking the whole filesystem broken/read-only).

This appears to be what SQLite does AFAICS.

https://www.sqlite.org/atomiccommit.html

though it has the huge luxury of a single writer, so it's probably only subject to the original issue not the multiprocess / checkpointer issues we face.

Idea 3:

Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.

That seems to be what the kernel folks will expect. But that's going to KILL performance. We'll need writer threads to have any hope of it not totally sucking, because otherwise simple things like updating a heap tuple and two related indexes will incur enormous disk latencies.

But I suspect it's the path forward.

Goody.

Any other ideas?

For a while I considered suggesting an idea which I now think doesn't work. I thought we could try asking for a new fcntl interface that spits out wb_err counter. Call it an opaque error token or something. Then we could store it in our fsync queue and safely close the file. Check again before fsync()ing, and if we ever see a different value, PANIC because it means a writeback error happened while we weren't looking. Sadly I think it doesn't work because AIUI inodes are not pinned in kernel memory when no one has the file open and there are no dirty buffers, so I think the counters could go away and be reset. Perhaps you could keep inodes pinned by keeping the associated buffers dirty after an error (like FreeBSD), but if you did that you'd have solved the problem already and wouldn't really need the wb_err system at all. Is there some other idea long these lines that could work?

I think our underlying data syncing concept is fundamentally broken, and it's not really the kernel's fault.

We assume that we can safely:

procA: open()
procA: write()
procA: close()

... some long time later, unbounded as far as the kernel is concerned ...

procB: open()
procB: fsync()
procB: close()

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

I never really clicked to the fact that we closed relations with pending buffered writes, left them closed, then reopened them to fsync. That's .... well, the kernel isn't the only thing doing crazy things here.

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Fun times.

This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST make sure you don't run out of disk. Because the usual safeguard of space reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested this with nfsv3 in sync,hard,nointr mode yet, maybe that's safe, but I doubt it). The same applies to thin-provisioned storage. Just. Don't.

This helps explain various reports of corruption in Docker and various other tools that use various sorts of thin provisioning. If you hit ENOSPC in fsync(), bye bye data.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-08 03:37:06

On Sat, Apr 7, 2018 at 8:27 PM, Craig Ringer wrote:

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it should be.

We can virtualize relfilenodes into storage extents in relatively few big files. We could use sparse regions to make the addressing more convenient, but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for indexes, another for each fork type.

I'm not sure that we can do that now, since it would break the new "Optimize btree insertions for common case of increasing values" optimization. (I did mention this before it went in.)

I've asked Pavan to at least add a note to the nbtree README that explains the high level theory behind the optimization, as part of post-commit clean-up. I'll ask him to say something about how it might affect extent-based storage, too.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 03:46:17

On Apr 7, 2018, at 20:27, Craig Ringer wrote:

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.


From:Andreas Karlsson <andreas(at)proxel(dot)se>
Date:2018-04-08 09:41:06

On 04/08/2018 05:27 AM, Craig Ringer wrote:>

More below, but here's an idea #5: decide InnoDB has the right idea, and go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 10:30:31

On 8 April 2018 at 11:46, Christophe Pettus wrote:

On Apr 7, 2018, at 20:27, Craig Ringer wrote:

Right now I think we're at option (4): If you see anything that smells like a write error in your kernel logs, hard-kill postgres with -m immediate (do NOT let it do a shutdown checkpoint). If it did a checkpoint since the logs, fake up a backup label to force redo to start from the last checkpoint before the error. Otherwise, it's safe to just let it start up again and do redo again.

Before we spiral down into despair and excessive alcohol consumption, this is basically the same situation as a checksum failure or some other kind of uncorrected media-level error. The bad part is that we have to find out from the kernel logs rather than from PostgreSQL directly. But this does not strike me as otherwise significantly different from, say, an infrequently-accessed disk block reporting an uncorrectable error when we finally get around to reading it.

I don't entirely agree - because it affects ENOSPC, I/O errors on thin provisioned storage, I/O errors on multipath storage, etc. (I identified the original issue on a thin provisioned system that ran out of backing space, mangling PostgreSQL in a way that made no sense at the time).

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-08 10:31:24

On 8 April 2018 at 17:41, Andreas Karlsson wrote:

On 04/08/2018 05:27 AM, Craig Ringer wrote:> More below, but here's an idea #5: decide InnoDB has the right idea, and

go to using a single massive blob file, or a few giant blobs.

FYI: MySQL has by default one file per table these days. The old approach with one massive file was a maintenance headache so they change the default some releases ago.

https://dev.mysql.com/doc/refman/8.0/en/innodb-multiple-tablespaces.html

Huh, thanks for the update.

We should see how they handle reliable flushing and see if they've looked into it. If they haven't, we should give them a heads-up and if they have, lets learn from them.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 16:38:03

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-08 21:23:21

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro wrote:

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.

The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.

Going to DIRECTIO is basically recognizing this. That the kernel filesystem buffer provides no reliable interface so we need to reimplement it ourselves in user space.

It's rather disheartening. Aside from having to do all that work we have the added barrier that we don't have as much information about the hardware as the kernel has. We don't know where raid stripes begin and end, how big the memory controller buffers are or how to tell when they're full or empty or how to flush them. etc etc. We also don't know what else is going on on the machine.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 21:28:43

On Apr 8, 2018, at 14:23, Greg Stark wrote:

They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

That's not an irrational position. File system buffers are not dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-08 21:47:04

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro wrote:

If the kernel does writeback in the middle, how on earth is it supposed to know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every caller? In that case how could we recover? We'd need some new API to say "yeah, ok already, I'm redoing all my work since the last good fsync() so you can clear the error flag now". Otherwise it'd keep reporting an error after we did redo to recover, too.

There is no spoon^H^H^H^H^Herror flag. We don't need fsync to keep track of any errors. We just need fsync to accurately report whether all the buffers in the file have been written out. When you call fsync

Instead, fsync() reports when some of the buffers have not been written out, due to reasons outlined before. As such it may make some sense to maintain some tracking regarding errors even after marking failed dirty pages as clean (in fact it has been proposed, but this introduces memory overhead).

again the kernel needs to initiate i/o on all the dirty buffers and block until they complete successfully. If they complete successfully then nobody cares whether they had some failure in the past when i/o was initiated at some point in the past.

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

The problem is not that errors aren't been tracked correctly. The problem is that dirty buffers are being marked clean when they haven't been written out. They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

As long as any error means the kernel has discarded writes then there's no real hope of any reliable operation through that interface.

This does not necessarily follow. Whether the kernel discards writes or not would not really help (see above). It is more a matter of proper "reporting contract" between userspace and kernel, and tracking would be a way for facilitating this vs. having a more complex userspace scheme (as described by others in this thread) where synchronization for fsync() is required in a multi-process application.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-08 22:29:16

On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 23:10:24

On Apr 8, 2018, at 15:29, Bruce Momjian wrote: I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.

Yeah, it's bad. In the short term, the best advice to installations is to monitor their kernel logs for errors (which very few do right now), and make sure they have a backup strategy which can encompass restoring from an error like this. Even Craig's smart fix of patching the backup label to recover from a previous checkpoint doesn't do much good if we don't have WAL records back that far (or one of the required WAL records also took a hit).

In the longer term... O_DIRECT seems like the most plausible way out of this, but that might be popular with people running on file systems or OSes that don't have this issue. (Setting aside the daunting prospect of implementing that.)


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-08 23:16:25

On 2018-04-08 18:29:16 -0400, Bruce Momjian wrote:

On Sun, Apr 8, 2018 at 09:38:03AM -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 03:30, Craig Ringer wrote:

These are way more likely than bit flips or other storage level corruption, and things that we previously expected to detect and fail gracefully for.

This is definitely bad, and it explains a few otherwise-inexplicable corruption issues we've seen. (And great work tracking it down!) I think it's important not to panic, though; PostgreSQL doesn't have a reputation for horrible data integrity. I'm not sure it makes sense to do a major rearchitecting of the storage layer (especially with pluggable storage coming along) to address this. While the failure modes are more common, the solution (a PITR backup) is one that an installation should have anyway against media failures.

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost. If we could stop Postgres when such errors happen, at least the administrator could fix the problem of fail-over to a standby.

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

I think the danger presented here is far smaller than some of the statements in this thread might make one think. In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed. Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There's a lot of not so great things here, but I don't think there's any need to panic.

We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.

I think there's pretty good reasons to go to direct IO where supported, but error handling doesn't strike me as a particularly good reason for the move.


From:Christophe Pettus <xof(at)thebuild(dot)com>
Date:2018-04-08 23:27:57

On Apr 8, 2018, at 16:16, Andres Freund wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.

That being said...

There's a lot of not so great things here, but I don't think there's any need to panic.

No reason to panic, yes. We can assume that if this was a very big persistent problem, it would be much more widely reported. It would, however, be good to find a way to get the error surfaced back up to the client in a way that is not just monitoring the kernel logs.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 01:31:56

On 9 April 2018 at 05:28, Christophe Pettus wrote:

On Apr 8, 2018, at 14:23, Greg Stark wrote:

They consider dirty filesystem buffers when there's hardware failure preventing them from being written "a memory leak".

That's not an irrational position. File system buffers are not dedicated memory for file system caching; they're being used for that because no one has a better use for them at that moment. If an inability to flush them to disk meant that they suddenly became pinned memory, a large copy operation to a yanked USB drive could result in the system having no more allocatable memory. I guess in theory that they could swap them, but swapping out a file system buffer in hopes that sometime in the future it could be properly written doesn't seem very architecturally sound to me.

Yep.

Another example is a write to an NFS or iSCSI volume that goes away forever. What if the app keeps write()ing in the hopes it'll come back, and by the time the kernel starts reporting EIO for write(), it's already saddled with a huge volume of dirty writeback buffers it can't get rid of because someone, one day, might want to know about them?

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok? What if it's remounted again? That'd be really bad too, for someone expecting write reliability.

You can coarsen from dirty buffer tracking to marking the FD(s) bad, but what if there's no FD to mark because the file isn't open at the moment?

You can mark the inode cache entry and pin it, I guess. But what if your app triggered I/O errors over vast numbers of small files? Again, the kernel's left holding the ball.

It doesn't know if/when an app will return to check. It doesn't know how long to remember the failure for. It doesn't know when all interested clients have been informed and it can treat the fault as cleared/repaired, either, so it'd have to keep on reporting EIO for PostgreSQL's own writes and fsyncs() indefinitely, even once we do recovery.

The only way it could avoid that would be to keep the dirty writeback pages around and flagged bad, then clear the flag when a new write() replaces the same file range. I can't imagine that being practical.

Blaming the kernel for this sure is the easy way out.

But IMO we cannot rationally expect the kernel to remember error state forever for us, then forget it when we expect, all without actually telling it anything about our activities or even that we still exist and are still interested in the files/writes. We've closed the files and gone away.

Whatever we do, it's likely going to have to involve not doing that anymore.

Even if we can somehow convince the kernel folks to add a new interface for us that reports I/O errors to some listener, like an inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting errors in buffered async writes, we won't be able to rely on having it for 5-10 years, and only on Linux.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 01:35:06

On 9 April 2018 at 06:29, Bruce Momjian wrote:

I think the big problem is that we don't have any way of stopping Postgres at the time the kernel reports the errors to the kernel log, so we are then returning potentially incorrect results and committing transactions that might be wrong or lost.

Right.

Specifically, we need a way to ask the kernel at checkpoint time "was everything written to [this set of files] flushed successfully since the last time I asked, no matter who did the writing and no matter how the writes were flushed?"

If the result is "no" we PANIC and redo. If the hardware/volume is screwed, the user can fail over to a standby, do PITR, etc.

But we don't have any way to ask that reliably at present.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 01:55:10

Hi,

On 2018-04-08 16:27:57 -0700, Christophe Pettus wrote:

On Apr 8, 2018, at 16:16, Andres Freund wrote: We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

There is a distinction to be drawn there, though, because we immediately pass an error back to the client on a read, but a write problem in this situation can be masked for an extended period of time.

Only if you're "lucky" enough that your clients actually read that data, and then you're somehow able to figure out across the whole stack that these 0.001% of transactions that fail are due to IO errors. Or you also need to do log analysis.

If you want to solve things like that you need regular reads of all your data, including verifications etc.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 02:00:41

On 9 April 2018 at 07:16, Andres Freund wrote:

I think the danger presented here is far smaller than some of the statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

We have to look at what checkpoints are and are not supposed to promise, and whether this is a problem we just define away as "not our problem, the lower level failed, we're not obliged to detect this and fail gracefully."

We can choose to say that checkpoints are required to guarantee crash/power loss safety ONLY and do not attempt to protect against I/O errors of any sort. In fact, I think we should likely amend the documentation for release versions to say just that.

In all likelihood, once you've got an IO error that kernel level retries don't fix, your database is screwed.

Your database is going to be down or have interrupted service. It's possible you may have some unreadable data. This could result in localised damage to one or more relations. That could affect FK relationships, indexes, all sorts. If you're really unlucky you might lose something critical like pg_clog/ contents.

But in general your DB should be repairable/recoverable even in those cases.

And in many failure modes there's no reason to expect any data loss at all, like:

Except for the ENOSPC on NFS, all the rest of the cases can be handled by expecting the kernel to retry forever and not return until the block is written or we reach the heat death of the universe. And NFS, well...

Part of the trouble is that the kernel won't retry forever in all these cases, and doesn't seem to have a way to ask it to in all cases.

And if the user hasn't configured it for the right behaviour in terms of I/O error resilience, we don't find out about it.

So it's not the end of the world, but it'd sure be nice to fix.

Whether fsync reports that or not is really somewhat besides the point. We don't panic that way when getting IO errors during reads either, and they're more likely to be persistent than errors during writes (because remapping on storage layer can fix issues, but not during reads).

That's because reads don't make promises about what's committed and synced. I think that's quite different.

We should fix things so that reported errors are treated with crash recovery, and for the rest I think there's very fair arguments to be made that that's far outside postgres's remit.

Certainly for current versions.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

The docs need an update to indicate that we explicitly disclaim responsibility for I/O errors on async writes, and that the kernel and I/O stack must be configured never to give up on buffered writes. If it does, that's not our problem anymore.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 02:06:12

On 2018-04-09 10:00:41 +0800, Craig Ringer wrote:

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

Agreed on that, but I think that's FAR more likely to be things like multixacts, index structure corruption due to logic bugs etc.

I've already been very surprised there when I learned that PostgreSQL completely ignores wholly absent relfilenodes. Specifically, if you unlink() a relation's backing relfilenode while Pg is down and that file has writes pending in the WAL. We merrily re-create it with uninitalized pages and go on our way. As Andres pointed out in an offlist discussion, redo isn't a consistency check, and it's not obliged to fail in such cases. We can say "well, don't do that then" and define away file losses from FS corruption etc as not our problem, the lower levels we expect to take care of this have failed.

And it'd be a realy bad idea to behave differently.

And in many failure modes there's no reason to expect any data loss at all, like:

That definitely should be treated separately.

Those should be the same as the above.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least two orders of magnitude.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 03:15:01

On 9 April 2018 at 10:06, Andres Freund wrote:

And in many failure modes there's no reason to expect any data loss at all, like:

  • Local disk fills up (seems to be safe already due to space reservation at write() time)

That definitely should be treated separately.

It is, because all the FSes I looked at reserve space before returning from write(), even if they do delayed allocation. So they won't fail with ENOSPC at fsync() time or silently due to lost errors on background writeback. Otherwise we'd be hearing a LOT more noise about this.

  • Thin-provisioned storage backing local volume iSCSI or paravirt block device fills up
  • NFS volume fills up

Those should be the same as the above.

Unfortunately, they aren't.

AFAICS NFS doesn't reserve space with the other end before returning from write(), even if mounted with the sync option. So we can get ENOSPC lazily when the buffer writeback fails due to a full backing file system. This then travels the same paths as EIO: we fsync(), ERROR, retry, appear to succeed, and carry on with life losing the data. Or we never hear about the error in the first place.

(There's a proposed extension that'd allow this, see https://tools.ietf.org/html/draft-iyer-nfsv4-space-reservation-ops-02#page-5, but I see no mention of it in fs/nfs. All the reserve_space / xdr_reserve_space stuff seems to be related to space in protocol messages at a quick read.)

Thin provisioned storage could vary a fair bit depending on the implementation. But the specific failure case I saw, prompting this thread, was on a volume using the stack:

xfs -> lvm2 -> multipath -> ??? -> SAN

(the HBA/iSCSI/whatever was not recorded by the looks, but IIRC it was iSCSI. I'm checking.)

The SAN ran out of space. Due to use of thin provisioning, Linux thought there was plenty of space on the volume; LVM thought it had plenty of physical extents free and unallocated, XFS thought there was tons of free space, etc. The space exhaustion manifested as I/O errors on flushes of writeback buffers.

The logs were like this:

kernel: sd 2:0:0:1: [sdd] Unhandled sense code
kernel: sd 2:0:0:1: [sdd]
kernel: Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
kernel: sd 2:0:0:1: [sdd]
kernel: Sense Key : Data Protect [current]
kernel: sd 2:0:0:1: [sdd]
kernel: Add. Sense: Space allocation failed write protect
kernel: sd 2:0:0:1: [sdd] CDB:
kernel: Write(16): **HEX-DATA-CUT-OUT**
kernel: Buffer I/O error on device dm-0, logical block 3098338786
kernel: lost page write due to I/O error on dm-0
kernel: Buffer I/O error on device dm-0, logical block 3098338787

The immediate cause was that Linux's multipath driver didn't seem to recognise the sense code as retryable, so it gave up and reported it to the next layer up (LVM). LVM and XFS both seem to think that the lower layer is responsible for retries, so they toss the write away, and tell any interested writers if they feel like it, per discussion upthread.

In this case Pg did get the news and reported fsync() errors on checkpoints, but it only reported an error once per relfilenode. Once it ran out of failed relfilenodes to cause the checkpoint to ERROR, it "completed" a "successful" checkpoint and kept on running until the resulting corruption started to manifest its self and it segfaulted some time later. As we've now learned, there's no guarantee we'd even get the news about the I/O errors at all.

WAL was on a separate volume that didn't run out of room immediately, so we didn't PANIC on WAL write failure and prevent the issue.

In this case if Pg had PANIC'd (and been able to guarantee to get the news of write failures reliably), there'd have been no corruption and no data loss despite the underlying storage issue.

If, prior to seeing this, you'd asked me "will my PostgreSQL database be corrupted if my thin-provisioned volume runs out of space" I'd have said "Surely not. PostgreSQL won't be corrupted by running out of disk space, it orders writes carefully and forces flushes so that it will recover gracefully from write failures."

Except not. I was very surprised.

BTW, it also turns out that the default for multipath is to give up on errors anyway; see the queue_if_no_path option and no_path_retries options. (Hint: run PostgreSQL with no_path_retries=queue). That's a sane default if you use O_DIRECT|O_SYNC, and otherwise pretty much a data-eating setup.

I regularly see rather a lot of multipath systems, iSCSI systems, SAN backed systems, etc. I think we need to be pretty clear that we expect them to retry indefinitely, and if they report an I/O error we cannot reliably handle it. We need to patch Pg to PANIC on any fsync() failure and document that Pg won't notice some storage failure modes that might otherwise be considered nonfatal or transient, so very specific storage configuration and testing is required. (Not that anyone will do it). Also warn against running on NFS even with "hard,sync,nointr".

It'd be interesting to have a tool that tested error handling, allowing people to do iSCSI plug-pull tests, that sort of thing. But as far as I can tell nobody ever tests their storage stack anyway, so I don't plan on writing something that'll never get used.

I think we need to think about a more robust path in future. But it's certainly not "stop the world" territory.

I think you're underestimating the complexity of doing that by at least two orders of magnitude.

Oh, it's just a minor total rewrite of half Pg, no big deal ;)

I'm sure that no matter how big I think it is, I'm still underestimating it.

The most workable option IMO would be some sort of fnotify/dnotify/whatever that reports all I/O errors on a volume. Some kind of error reporting handle we can keep open on a volume level that we can check for each volume/tablespace after we fsync() everything to see if it all really worked. If we PANIC if that gives us a bad answer, and PANIC on fsync errors, we guard against the great majority of these sorts of should-be-transient-if-the-kernel-didn't-give-up-and-throw-away-our-data errors.

Even then, good luck getting those events from an NFS volume in which the backing volume experiences an issue.

And it's kind of moot because AFAICS no such interface exists.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-09 08:45:40

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future. But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?

But just to underline the point. "pointless to keep them dirty" is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the really last thing user space expects is for this to happen and return no error.)


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 10:50:41

On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

On Sun, Apr 08, 2018 at 10:23:21PM +0100, Greg Stark wrote:

On 8 April 2018 at 04:27, Craig Ringer wrote:

On 8 April 2018 at 10:16, Thomas Munro

The question is, what should the kernel and application do in cases where this is simply not possible (according to freebsd that keeps dirty pages around after failure, for example, -EIO from the block layer is a contract for unrecoverable errors so it is pointless to keep them dirty). You'd need a specialized interface to clear-out the errors (and drop the dirty pages), or potentially just remount the filesystem.

Well firstly that's not necessarily the question. ENOSPC is not an unrecoverable error. And even unrecoverable errors for a single write doesn't mean the write will never be able to succeed in the future.

To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack. Mind you, I am not claiming that this contract is either documented or necessarily respected (in fact there have been studies on the error propagation and handling of the block layer, see [1]). Let us assume that this is the design contract though (which appears to be the case across a number of open-source kernels), and if not - it's a bug. In this case, indeed the specific write()s will never be able to succeed in the future, at least not as long as the BIOs are allocated to the specific failing LBAs.

But secondly doesn't such an interface already exist? When the device is dropped any dirty pages already get dropped with it. What's the point in dropping them but keeping the failing device?

I think there are degrees of failure. There are certainly cases where one may encounter localized unrecoverable medium errors (specific to certain LBAs) that are non-maskable from the block layer and below. That does not mean that the device is dropped at all, so it does make sense to continue all other operations to all other regions of the device that are functional. In cases of total device failure, then the filesystem will prevent you from proceeding anyway.

But just to underline the point. "pointless to keep them dirty" is exactly backwards from the application's point of view. If the error writing to persistent media really is unrecoverable then it's all the more critical that the pages be kept so the data can be copied to some other device. The last thing user space expects to happen is if the data can't be written to persistent storage then also immediately delete it from RAM. (And the really last thing user space expects is for this to happen and return no error.)

Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back. This in turn means that there needs to be an association between the fsync() caller and the specific dirtied pages that the caller intents to drain (for which we'd need an fsync_range(), among other things). BTW, currently the failed writebacks are not dropped from memory, but rather marked clean. They could be lost though due to memory pressure or due to explicit request (e.g. proc drop_caches), unless mlocked.

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

[1] https://www.usenix.org/legacy/event/fast08/tech/full_papers/gunawi/gunawi.pdf


From:Geoff Winkless <pgsqladmin(at)geoff(dot)dj>
Date:2018-04-09 12:03:28

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-09 12:16:38

On 9 April 2018 at 18:50, Anthony Iliopoulos wrote:

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That's what Pg appears to assume now, yes.

Whether that's reasonable is a whole different topic.

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

Some keen person who wants to later could optimise it by adding a fsync worker thread pool in backends, so we don't block the main thread. Frankly that might be a nice thing to have in the checkpointer anyway. But it's out of scope for fixing this in durability terms.

I'm partway through a patch that makes fsync panic on errors now. Once that's done, the next step will be to force fsync on close() in md and see how we go with that.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 12:31:27

On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them. Good luck convincing any OS kernel upstream to go with this design.

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.

No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.

It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a "portable" fashion.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 12:54:16

On Mon, Apr 09, 2018 at 08:16:38PM +0800, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

I see what you are saying. So basically you'd always maintain the notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to). The kernel wouldn't even have to maintain per-page bits to trace the errors, since they will be consumed by the process that reads the events (or discarded, when the notification fd is closed).

Assuming this would be possible, wouldn't Pg still need to deal with synchronizing writers and related issues (since this would be merely a notification mechanism - not prevent any process from continuing), which I understand would be rather intrusive for the current Pg multi-process design.

But other than that, similarly this interface could in principle be similarly implemented in the BSDs via kqueue(), I suppose, to provide what you need.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:33:18

On 04/09/2018 02:31 PM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 01:03:28PM +0100, Geoff Winkless wrote:

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

What you seem to be asking for is the capability of dropping buffers over the (kernel) fence and idemnifying the application from any further responsibility, i.e. a hard assurance that either the kernel will persist the pages or it will keep them around till the application recovers them asynchronously, the filesystem is unmounted, or the system is rebooted.

That seems like a perfectly reasonable position to take, frankly.

Indeed, as long as you are willing to ignore the consequences of this design decision: mainly, how you would recover memory when no application is interested in clearing the error. At which point other applications with different priorities will find this position rather unreasonable since there can be no way out of it for them.

Sure, but the question is whether the system can reasonably operate after some of the writes failed and the data got lost. Because if it can't, then recovering the memory is rather useless. It might be better to stop the system in that case, forcing the system administrator to resolve the issue somehow (fail-over to a replica, perform recovery from the last checkpoint, ...).

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Good luck convincing any OS kernel upstream to go with this design.

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

The question is whether the current design makes it any easier for user-space developers to build reliable systems. We have tried using it, and unfortunately the answers seems to be "no" and "Use direct I/O and manage everything on your own!"

The whole point of an Operating System should be that you can do exactly that. As a developer I should be able to call write() and fsync() and know that if both calls have succeeded then the result is on disk, no matter what another application has done in the meantime. If that's a "difficult" problem then that's the OS's problem, not mine. If the OS doesn't do that, it's _not_doing_itsjob.

No OS kernel that I know of provides any promises for atomicity of a write()+fsync() sequence, unless one is using O_SYNC. It doesn't provide you with isolation either, as this is delegated to userspace, where processes that share a file should coordinate accordingly.

We can (and do) take care of the atomicity and isolation. Implementation of those parts is obviously very application-specific, and we have WAL and locks for that purpose. I/O on the other hand seems to be a generic service provided by the OS - at least that's how we saw it until now.

It's not a difficult problem, but rather the kernels provide a common denominator of possible interfaces and designs that could accommodate a wider range of potential application scenarios for which the kernel cannot possibly anticipate requirements. There have been plenty of experimental works for providing a transactional (ACID) filesystem interface to applications. On the opposite end, there have been quite a few commercial databases that completely bypass the kernel storage stack. But I would assume it is reasonable to figure out something between those two extremes that can work in a "portable" fashion.

Users ask us about this quite often, actually. The question is usually about "RAW devices" and performance, but ultimately it boils down to buffered vs. direct I/O. So far our answer was we rely on kernel to do this reliably, because they know how to do that correctly and we simply don't have the manpower to implement it (portable, reliable, handling different types of storage, ...).

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:42:35

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.


From:Abhijit Menon-Sen <ams(at)2ndQuadrant(dot)com>
Date:2018-04-09 13:47:03

At 2018-04-09 15:42:35 +0200, tomas(dot)vondra(at)2ndquadrant(dot)com wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way.

Not least because Craig's tests showed that you can't rely on always getting an error message in the logs.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 13:54:19

On 04/09/2018 04:00 AM, Craig Ringer wrote:

On 9 April 2018 at 07:16, Andres Freund <andres(at)anarazel(dot)de

I think the danger presented here is far smaller than some of the statements in this thread might make one think.

Clearly it's not happening a huge amount or we'd have a lot of noise about Pg eating people's data, people shouting about how unreliable it is, etc. We don't. So it's not some earth shattering imminent threat to everyone's data. It's gone unnoticed, or the root cause unidentified, for a long time.

Yeah, it clearly isn't the case that everything we do suddenly got pointless. It's fairly annoying, though.

I suspect we've written off a fair few issues in the past as "it'd bad hardware" when actually, the hardware fault was the trigger for a Pg/kernel interaction bug. And blamed containers for things that weren't really the container's fault. But even so, if it were happening tons, we'd hear more noise.

Right. Write errors are fairly rare, and we've probably ignored a fair number of cases demonstrating this issue. It kinda reminds me the wisdom that not seeing planes with bullet holes in the engine does not mean engines don't need armor [1].

[1] https://medium.com/@penguinpress/an-excerpt-from-how-not-to-be-wrong-by-jordan-ellenberg-664e708cfc3d


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 14:22:06

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-09 15:29:36

On 9 April 2018 at 15:22, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner.

Surely this is exactly what the kernel is there to manage. It has to control how much memory is allowed to be full of dirty buffers in the first place to ensure that the system won't get memory starved if it can't clean them fast enough. That isn't even about persistent hardware errors. Even when the hardware is working perfectly it can only flush buffers so fast. The whole point of the kernel is to abstract away shared resources. It's not like user space has any better view of the situation here. If Postgres implemented all this in DIRECT_IO it would have exactly the same problem only with less visibility into what the rest of the system is doing. If every application implemented its own buffer cache we would be back in the same boat only with a fragmented memory allocation.

This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error.

I still think we're speaking two different languages. There's no application anywhere that's going to "clear the error". The application has done the writes and if it's calling fsync it wants to wait until the filesystem can arrange for the write to be persisted. If the application could manage without the persistence then it wouldn't have called fsync.

The only way to "clear out" the error would be by having the writes succeed. There's no reason to think that wouldn't be possible sometime. The filesystem could remap blocks or an administrator could replace degraded raid device components. The only thing Postgres could do to recover would be create a new file and move the data (reading from the dirty buffer in memory!) to a new file anyways so we would "clear the error" by just no longer calling fsync on the old file.

We always read fsync as a simple write barrier. That's what the documentation promised and it's what Postgres always expected. It sounds like the kernel implementors looked at it as some kind of communication channel to communicate status report for specific writes back to user-space. That's a much more complex problem and would have entirely different interface. I think this is why we're having so much difficulty communicating.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Well if they're writing to the same file that had a previous error I doubt there are many applications that would be happy to consider their writes "persisted" when the file was corrupt. Ironically the earlier discussion quoted talked about how applications that wanted more granular communication would be using O_DIRECT -- but what we have is fsync trying to be too granular such that it's impossible to get any strong guarantees about anything with it.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO).

Honestly I don't think there's any way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-09 16:45:00

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.

Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-09 17:26:24

On 04/09/2018 09:45 AM, Robert Haas wrote:

On Mon, Apr 9, 2018 at 8:16 AM, Craig Ringer wrote:

In the mean time, I propose that we fsync() on close() before we age FDs out of the LRU on backends. Yes, that will hurt throughput and cause stalls, but we don't seem to have many better options. At least it'll only flush what we actually wrote to the OS buffers not what we may have in shared_buffers. If the bgwriter does the same thing, we should be 100% safe from this problem on 4.13+, and it'd be trivial to make it a GUC much like the fsync or full_page_writes options that people can turn off if they know the risks / know their storage is safe / don't care.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.

I don't have a better option but whatever we do, it should be an optional (GUC) change. We have plenty of YEARS of people not noticing this issue and Robert's correct, if we go back to an era of things like stalls it is going to look bad on us no matter how we describe the problem.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-09 18:02:21

On 09. 04. 2018 15:42, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

regards

For a bit less (or more) crazy idea, I'd imagine creating a Linux kernel module with kprobe/kretprobe capturing the file passed to fsync or even byte range within file and corresponding return value shouldn't be that hard. Kprobe has been a part of Linux kernel for a really long time, and from first glance it seems like it could be backported to 2.6 too.

Then you could have stable log messages or implement some kind of "fsync error log notification" via whatever is the most sane way to get this out of kernel.

If the kernel is new enough and has eBPF support (seems like >=4.4), using bcc-tools[1] should enable you to write a quick script to get exactly that info via perf events[2].

Obviously, that's a stopgap solution ...

[1] https://github.com/iovisor/bcc [2] https://blog.yadutaf.fr/2016/03/30/turn-any-syscall-into-event-introducing-ebpf-kernel-probes/


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 18:29:42

On Apr 9, 2018, at 10:26 AM, Joshua D. Drake wrote:

We have plenty of YEARS of people not noticing this issue

I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-09 19:02:11

On Mon, Apr 9, 2018 at 12:45 PM, Robert Haas wrote:

Ouch. If a process exits -- say, because the user typed \q into psql -- then you're talking about potentially calling fsync() on a really large number of file descriptor flushing many gigabytes of data to disk. And it may well be that you never actually wrote any data to any of those file descriptors -- those writes could have come from other backends. Or you may have written a little bit of data through those FDs, but there could be lots of other data that you end up flushing incidentally. Perfectly innocuous things like starting up a backend, running a few short queries, and then having that backend exit suddenly turn into something that could have a massive system-wide performance impact.

Also, if a backend ever manages to exit without running through this code, or writes any dirty blocks afterward, then this still fails to fix the problem completely. I guess that's probably avoidable -- we can put this late in the shutdown sequence and PANIC if it fails.

I have a really tough time believing this is the right way to solve the problem. We suffered for years because of ext3's desire to flush the entire page cache whenever any single file was fsync()'d, which was terrible. Eventually ext4 became the norm, and the problem went away. Now we're going to deliberately insert logic to do a very similar kind of terrible thing because the kernel developers have decided that fsync() doesn't have to do what it says on the tin? I grant that there doesn't seem to be a better option, but I bet we're going to have a lot of really unhappy users if we do this.

What about the bug we fixed in https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=2ce439f3379aed857517c8ce207485655000fc8e ? Say somebody does something along the lines of:

ps uxww | grep postgres | grep -v grep | awk '{print $2}' | xargs kill -9

...and then restarts postgres. Craig's proposal wouldn't cover this case, because there was no opportunity to run fsync() after the first crash, and there's now no way to go back and fsync() any stuff we didn't fsync() before, because the kernel may have already thrown away the error state, or may lie to us and tell us everything is fine (because our new fd wasn't opened early enough). I can't find the original discussion that led to that commit right now, so I'm not exactly sure what scenarios we were thinking about. But I think it would at least be a problem if full_page_writes=off or if you had previously started the server with fsync=off and now wish to switch to fsync=on after completing a bulk load or similar. Recovery can read a page, see that it looks OK, and continue, and then a later fsync() failure can revert that page to an earlier state and now your database is corrupted -- and there's absolute no way to detect this because write() gives you the new page contents later, fsync() doesn't feel obliged to tell you about the error because your fd wasn't opened early enough, and eventually the write can be discarded and you'll revert back to the old page version with no errors ever being reported anywhere.

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

What's being presented to us as the API contract that we should expect from buffered I/O is that if you open a file and read() from it, call fsync(), and get no error, the kernel may nevertheless decide that some previous write that it never managed to flush can't be flushed, and then revert the page to the contents it had at some point in the past. That's mostly or less equivalent to letting a malicious adversary randomly overwrite database pages plausible-looking but incorrect contents without notice and hoping you can still build a reliable system. You can avoid the problem if you can always open an fd for every file you want to modify before it's written and hold on to it until after it's fsync'd, but that's pretty hard to guarantee in the face of kill -9.

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:13:14

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:22:58

On 04/09/2018 08:29 PM, Mark Dilger wrote:

On Apr 9, 2018, at 10:26 AM, Joshua D. Drake wrote: We have plenty of YEARS of people not noticing this issue

I disagree. I have noticed this problem, but blamed it on other things. For over five years now, I have had to tell customers not to use thin provisioning, and I have had to add code to postgres to refuse to perform inserts or updates if the disk volume is more than 80% full. I have lost count of the number of customers who are running an older version of the product (because they refuse to upgrade) and come back with complaints that they ran out of disk and now their database is corrupt. All this time, I have been blaming this on virtualization and thin provisioning.

Yeah. There's a big difference between not noticing an issue because it does not happen very often vs. attributing it to something else. If we had the ability to revisit past data corruption cases, we would probably discover a fair number of cases caused by this.

The other thing we probably need to acknowledge is that the environment changes significantly - things like thin provisioning are likely to get even more common, increasing the incidence of these issues.


From:Peter Geoghegan <pg(at)bowt(dot)ie>
Date:2018-04-09 19:25:33

On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

+1

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.

[1] https://danluu.com/file-consistency/ -- "Filesystem correctness"


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:26:21

On Mon, Apr 09, 2018 at 04:29:36PM +0100, Greg Stark wrote:

Honestly I don't think there's any way to use the current interface to implement reliable operation. Even that embedded database using a single process and keeping every file open all the time (which means file descriptor limits limit its scalability) can be having silent corruption whenever some other process like a backup program comes along and calls fsync (or even sync?).

That is indeed true (sync would induce fsync on open inodes and clear the error), and that's a nasty bug that apparently went unnoticed for a very long time. Hopefully the errseq_t linux 4.13 fixes deal with at least this issue, but similar fixes need to be adopted by many other kernels (all those that mark failed pages as clean).

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

What about having buffered IO with implied fsync() atomicity via O_SYNC? This would probably necessitate some helper threads that mask the latency and present an async interface to the rest of PG, but sounds less intrusive than going for DIO.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:29:16

On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:

What about having buffered IO with implied fsync() atomicity via O_SYNC?

You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:37:03

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.


From:Justin Pryzby <pryzby(at)telsasoft(dot)com>
Date:2018-04-09 19:41:19

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:44:31

On Mon, Apr 09, 2018 at 12:29:16PM -0700, Andres Freund wrote:

On 2018-04-09 21:26:21 +0200, Anthony Iliopoulos wrote:

What about having buffered IO with implied fsync() atomicity via O_SYNC?

You're kidding, right? We could also just add sleep(30)'s all over the tree, and hope that that'll solve the problem. There's a reason we don't permanently fsync everything. Namely that it'll be way too slow.

I am assuming you can apply the same principle of selectively using O_SYNC at times and places that you'd currently actually call fsync().

Also assuming that you'd want to have a backwards-compatible solution for all those kernels that don't keep the pages around, irrespective of future fixes. Short of loading a kernel module and dealing with the problem directly, the only other available options seem to be either O_SYNC, O_DIRECT or ignoring the issue.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:47:44

On 04/09/2018 04:22 PM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

We already have dirty_bytes and dirty_background_bytes, for example. I don't see why there couldn't be another limit defining how much dirty data to allow before blocking writes altogether. I'm sure it's not that simple, but you get the general idea - do not allow using all available memory because of writeback issues, but don't throw the data away in case it's just a temporary issue.

Sure, there could be knobs for limiting how much memory such "zombie" pages may occupy. Not sure how helpful it would be in the long run since this tends to be highly application-specific, and for something with a large data footprint one would end up tuning this accordingly in a system-wide manner. This has the potential to leave other applications running in the same system with very little memory, in cases where for example original application crashes and never clears the error. Apart from that, further interfaces would need to be provided for actually dealing with the error (again assuming non-transient issues that may not be fixed transparently and that temporary issues are taken care of by lower layers of the stack).

I don't quite see how this is any different from other possible issues when running multiple applications on the same system. One application can generate a lot of dirty data, reaching dirty_bytes and forcing the other applications on the same host to do synchronous writes.

Of course, you might argue that is a temporary condition - it will resolve itself once the dirty pages get written to storage. In case of an I/O issue, it is a permanent impact - it will not resolve itself unless the I/O problem gets fixed.

Not sure what interfaces would need to be written? Possibly something that says "drop dirty pages for these files" after the application gets killed or something. That makes sense, of course.

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

In my experience when you have a persistent I/O error on a device, it likely affects all applications using that device. So unmounting the fs to clear the dirty pages seems like an acceptable solution to me.

I don't see what else the application should do? In a way I'm suggesting applications don't really want to be responsible for recovering (cleanup or dirty pages etc.). We're more than happy to hand that over to kernel, e.g. because each kernel will do that differently. What we however do want is reliable information about fsync outcome, which we need to properly manage WAL, checkpoints etc.

Ideally, you'd want a (potentially persistent) indication of error localized to a file region (mapping the corresponding failed writeback pages). NetBSD is already implementing fsync_ranges(), which could be a step in the right direction.

One has to wonder how many applications actually use this correctly, considering PostgreSQL cares about data durability/consistency so much and yet we've been misunderstanding how it works for 20+ years.

I would expect it would be very few, potentially those that have a very simple process model (e.g. embedded DBs that can abort a txn on fsync() EIO). I think that durability is a rather complex cross-layer issue which has been grossly misunderstood similarly in the past (e.g. see [1]). It seems that both the OS and DB communities greatly benefit from a periodic reality check, and I see this as an opportunity for strengthening the IO stack in an end-to-end manner.

Right. What I was getting to is that perhaps the current fsync() behavior is not very practical for building actual applications.

Best regards, Anthony

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Thanks. The paper looks interesting.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-09 19:51:12

On Mon, Apr 09, 2018 at 12:37:03PM -0700, Andres Freund wrote:

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.

As discussed before, I think this could be acceptable, especially if you pair it with an opt-in mechanism (only applications that care to deal with this will have to), and would give it a shot.

Still need a way to deal with all other systems and prior kernel releases that are eating fsync() writeback errors even over sync().


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 19:54:05

On 04/09/2018 09:37 PM, Andres Freund wrote:

On April 9, 2018 12:26:21 PM PDT, Anthony Iliopoulos wrote:

I honestly do not expect that keeping around the failed pages will be an acceptable change for most kernels, and as such the recommendation will probably be to coordinate in userspace for the fsync().

Why is that required? You could very well just keep per inode information about fatal failures that occurred around. Report errors until that bit is explicitly cleared. Yes, that keeps some memory around until unmount if nobody clears it. But it's orders of magnitude less, and results in usable semantics.

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Of course, it's also possible to do what you suggested, and simply mark the inode as failed. In which case the next fsync can't possibly retry the writes (e.g. after freeing some space on thin-provisioned system), but we'd get reliable failure mode.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 19:59:34

On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Even better, we could do so via a dedicated worker. That'd quite possibly end up as a performance benefit.

I was going to say that's fine for postgres, since it chdir()s into its basedir, but actually not fine for nondefault tablespaces..

I think it'd be fair to open PG_VERSION of all created tablespaces. Would require some hangups to signal checkpointer (or whichever process) to do so when creating one, but it shouldn't be too hard. Some people would complain because they can't do some nasty hacks anymore, but it'd also save peoples butts by preventing them from accidentally unmounting.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:04:20

Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful expectation.

We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).

What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.

I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 20:25:54

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 20:30:00

On 04/09/2018 10:04 PM, Andres Freund wrote:

Hi,

On 2018-04-09 21:54:05 +0200, Tomas Vondra wrote:

Isn't the expectation that when a fsync call fails, the next one will retry writing the pages in the hope that it succeeds?

Some people expect that, I personally don't think it's a useful expectation.

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

And most importantly, it's rather delusional to think the kernel developers are going to be enthusiastic about that approach ...

We should just deal with this by crash-recovery. The big problem I see is that you always need to keep an file descriptor open for pretty much any file written to inside and outside of postgres, to be guaranteed to see errors. And that'd solve that. Even if retrying would work, I'd advocate for that (I've done so in the past, and I've written code in pg that panics on fsync failure...).

Sure. And it's likely way less invasive from kernel perspective.

What we'd need to do however is to clear that bit during crash recovery... Which is interesting from a policy perspective. Could be that other apps wouldn't want that.

IMHO it'd be enough if a remount clears it.

I also wonder if we couldn't just somewhere read each relevant mounted filesystem's errseq value. Whenever checkpointer notices before finishing a checkpoint that it has changed, do a crash restart.

Hmmmm, that's an interesting idea, and it's about the only thing that would help us on older kernels. There's a wb_err in adress_space, but that's at inode level. Not sure if there's something at fs level.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:34:15

Hi,

On 2018-04-09 13:25:54 -0700, Mark Dilger wrote:

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.

I don't see that as a real problem here. For one the problematic scenarios shouldn't readily apply, for another WAL is checksummed.

There's the problem that a new basebackup would potentially become corrupted however. And similarly pg_rewind.

Note that I'm not saying that we and/or linux shouldn't change anything. Just that the apocalypse isn't here.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

I think that's basically right. There's cases where corruption could get propagated, but they're not straightforward.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 20:37:31

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 20:43:03

On 04/09/2018 10:25 PM, Mark Dilger wrote:

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?

That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL

The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 20:55:29

On Apr 9, 2018, at 1:43 PM, Tomas Vondra wrote:

On 04/09/2018 10:25 PM, Mark Dilger wrote:

On Apr 9, 2018, at 12:13 PM, Andres Freund wrote:

Hi,

On 2018-04-09 15:02:11 -0400, Robert Haas wrote:

I think the simplest technological solution to this problem is to rewrite the entire backend and all supporting processes to use O_DIRECT everywhere. To maintain adequate performance, we'll have to write a complete I/O scheduling system inside PostgreSQL. Also, since we'll now have to make shared_buffers much larger -- since we'll no longer be benefiting from the OS cache -- we'll need to replace the use of malloc() with an allocator that pulls from shared_buffers. Plus, as noted, we'll need to totally rearchitect several of our critical frontend tools. Let's freeze all other development for the next year while we work on that, and put out a notice that Linux is no longer a supported platform for any existing release. Before we do that, we might want to check whether fsync() actually writes the data to disk in a usable way even with O_DIRECT. If not, we should just de-support Linux entirely as a hopelessly broken and unsupportable platform.

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted. That seems a much bigger problem than merely having the master become corrupted in some unrecoverable way. It is a long standing expectation that serious hardware problems on the master can result in the master needing to be replaced. But there has not been an expectation that the one or more standby servers would be taken down along with the master, leaving all copies of the database unusable. If this bug corrupts the standby servers, too, then it is a whole different class of problem than the one folks have come to expect.

Your comment reads as if this is a problem isolated to whichever server has the problem, and will not get propagated to other servers. Am I reading that right?

Can anybody clarify this for non-core-hacker folks following along at home?

That's a good question. I don't see any guarantee it'd be isolated to the master node. Consider this example:

(0) checkpoint happens on the primary

(1) a page gets modified, a full-page gets written to WAL

(2) the page is written out to page cache

(3) writeback of that page fails (and gets discarded)

(4) we attempt to modify the page again, but we read the stale version

(5) we modify the stale version, writing the change to WAL

The standby will get the full-page, and then a WAL from the stale page version. That doesn't seem like a story with a happy end, I guess. But I might be easily missing some protection built into the WAL ...

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption. When choosing to have one standby, or two standbys, or ten standbys, one needs to be able to assume a certain amount of statistical independence between failures on one server and failures on another. If they are tightly correlated dependent variables, then the conclusion that the probability of all nodes failing simultaneously is vanishingly small becomes invalid.

From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-09 21:08:29

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.


From:Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Date:2018-04-09 21:25:52

On 04/09/2018 11:08 PM, Andres Freund wrote:

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.

In any case, that certainly does not count as data corruption spreading from the master to standby.


From:Mark Dilger <hornschnorter(at)gmail(dot)com>
Date:2018-04-09 21:33:29

On Apr 9, 2018, at 2:25 PM, Tomas Vondra wrote:

On 04/09/2018 11:08 PM, Andres Freund wrote:

Hi,

On 2018-04-09 13:55:29 -0700, Mark Dilger wrote:

I can also imagine a master and standby that are similarly provisioned, and thus hit an out of disk error at around the same time, resulting in corruption on both, even if not the same corruption.

I think it's a grave mistake conflating ENOSPC issues (which we should solve by making sure there's always enough space pre-allocated), with EIO type errors. The problem is different, the solution is different.

I'm happy to take your word for that.

In any case, that certainly does not count as data corruption spreading from the master to standby.

Maybe not from the point of view of somebody looking at the code. But a user might see it differently. If the data being loaded into the master and getting replicated to the standby "causes" both to get corrupt, then it seems like corruption spreading. I put "causes" in quotes because there is some argument to be made about "correlation does not prove cause" and so forth, but it still feels like causation from an arms length perspective. If there is a pattern of standby servers tending to fail more often right around the time that the master fails, you'll have a hard time comforting users, "hey, it's not technically causation." If loading data into the master causes the master to hit ENOSPC, and replicating that data to the standby causes the standby to hit ENOSPC, and if the bug abound ENOSPC has not been fixed, then this looks like corruption spreading.

I'm certainly planning on taking a hard look at the disk allocation on my standby servers right soon now.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-09 22:33:16

On Tue, Apr 10, 2018 at 2:22 AM, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 03:33:18PM +0200, Tomas Vondra wrote:

Well, there seem to be kernels that seem to do exactly that already. At least that's how I understand what this thread says about FreeBSD and Illumos, for example. So it's not an entirely insane design, apparently.

It is reasonable, but even FreeBSD has a big fat comment right there (since 2017), mentioning that there can be no recovery from EIO at the block layer and this needs to be done differently. No idea how an application running on top of either FreeBSD or Illumos would actually recover from this error (and clear it out), other than remounting the fs in order to force dropping of relevant pages. It does provide though indeed a persistent error indication that would allow Pg to simply reliably panic. But again this does not necessarily play well with other applications that may be using the filesystem reliably at the same time, and are now faced with EIO while their own writes succeed to be persisted.

Right. For anyone interested, here is the change you mentioned, and an interesting one that came a bit earlier last year:

Retrying may well be futile, but at least future fsync() calls won't report success bogusly. There may of course be more space-efficient ways to represent that state as the comment implies, while never lying to the user -- perhaps involving filesystem level or (pinned) inode level errors that stop all writes until unmounted. Something tells me they won't resort to flakey fsync() error reporting.

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

[1] https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf

Very interesting, thanks.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-10 00:32:20

On Tue, Apr 10, 2018 at 10:33 AM, Thomas Munro wrote:

I wonder if anyone can tell us what Windows, AIX and HPUX do here.

I created a wiki page to track what we know (or think we know) about fsync() on various operating systems:

https://wiki.postgresql.org/wiki/Fsync_Errors

If anyone has more information or sees mistakes, please go ahead and edit it.


From:Andreas Karlsson <andreas(at)proxel(dot)se>
Date:2018-04-10 00:41:10

On 04/09/2018 02:16 PM, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:44:59

On 10 April 2018 at 03:59, Andres Freund wrote:

On 2018-04-09 14:41:19 -0500, Justin Pryzby wrote:

On Mon, Apr 09, 2018 at 09:31:56AM +0800, Craig Ringer wrote:

You could make the argument that it's OK to forget if the entire file system goes away. But actually, why is that ok?

I was going to say that it'd be okay to clear error flag on umount, since any opened files would prevent unmounting; but, then I realized we need to consider the case of close()ing all FDs then opening them later..in another process.

On Mon, Apr 09, 2018 at 02:54:16PM +0200, Anthony Iliopoulos wrote:

notification descriptor open, where the kernel would inject events related to writeback failures of files under watch (potentially enriched to contain info regarding the exact failed pages and the file offset they map to).

For postgres that'd require backend processes to open() an file such that, following its close(), any writeback errors are "signalled" to the checkpointer process...

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.

We'd need a way to dup() the fd and pass it back to a backend when it needed to reopen it sometimes, or just make sure to keep the oldest copy of the fd when a backend reopens multiple times, but that's no biggie.

We'd still have to fsync() out early in the checkpointer if we ran out of space in our FD list, and initscripts would need to change our ulimit or we'd have to do it ourselves in the checkpointer. But neither seems insurmountable.

FWIW, I agree that this is a corner case, but it's getting to be a pretty big corner with the spread of overcommitted, dedupliating SANs, cloud storage, etc. Not all I/O errors indicate permanent hardware faults, disk failures, etc, as I outlined earlier. I'm very curious to know what AWS EBS's error semantics are, and other cloud network block stores. (I posted on Amazon forums https://forums.aws.amazon.com/thread.jspa?threadID=279274&tstart=0 but nothing so far).

I'm also not particularly inclined to trust that all file systems will always reliably reserve space without having some cases where they'll fail writeback on space exhaustion.

So we don't need to panic and freak out, but it's worth looking at the direction the storage world is moving in, and whether this will become a bigger issue over time.


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-10 01:52:21

On Tue, Apr 10, 2018 at 1:44 PM, Craig Ringer wrote:

On 10 April 2018 at 03:59, Andres Freund wrote:

I don't think that's as hard as some people argued in this thread. We could very well open a pipe in postmaster with the write end open in each subprocess, and the read end open only in checkpointer (and postmaster, but unused there). Whenever closing a file descriptor that was dirtied in the current process, send it over the pipe to the checkpointer. The checkpointer then can receive all those file descriptors (making sure it's not above the limit, fsync(), close() ing to make room if necessary). The biggest complication would presumably be to deduplicate the received filedescriptors for the same file, without loosing track of any errors.

Yep. That'd be a cheaper way to do it, though it wouldn't work on Windows. Though we don't know how Windows behaves here at all yet.

Prior discussion upthread had the checkpointer open()ing a file at the same time as a backend, before the backend writes to it. But passing the fd when the backend is done with it would be better.

How would that interlock with concurrent checkpoints?

I can see how to make that work if the share-fd-or-fsync-now logic happens in smgrwrite() when called by FlushBuffer() while you hold io_in_progress, but not if you defer it to some random time later.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:54:30

On 10 April 2018 at 04:25, Mark Dilger wrote:

I was reading this thread up until now as meaning that the standby could receive corrupt WAL data and become corrupted.

Yes, it can, but not directly through the first error.

What can happen is that we think a block got written when it didn't.

If our in memory state diverges from our on disk state, we can make subsequent WAL writes based on that wrong information. But that's actually OK, since the standby will have replayed the original WAL correctly.

I think the only time we'd run into trouble is if we evict the good (but not written out) data from s_b and the fs buffer cache, then later read in the old version of a block we failed to overwrite. Data checksums (if enabled) might catch it unless the write left the whole block stale. In that case we might generate a full page write with the stale block and propagate that over WAL to the standby.

So I'd say standbys are relatively safe - very safe if the issue is caught promptly, and less so over time. But AFAICS WAL-based replication (physical or logical) is not a perfect defense for this.

However, remember, if your storage system is free of any sort of overprovisioning, is on a non-network file system, and doesn't use multipath (or sets it up right) this issue is exceptionally unlikely to affect you.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 01:59:03

On 10 April 2018 at 04:37, Andres Freund wrote:

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-10 02:00:59

On April 9, 2018 6:59:03 PM PDT, Craig Ringer wrote:

On 10 April 2018 at 04:37, Andres Freund wrote:

Hi,

On 2018-04-09 22:30:00 +0200, Tomas Vondra wrote:

Maybe. I'd certainly prefer automated recovery from an temporary I/O issues (like full disk on thin-provisioning) without the database crashing and restarting. But I'm not sure it's worth the effort.

Oh, I agree on that one. But that's more a question of how we force the kernel's hand on allocating disk space. In most cases the kernel allocates the disk space immediately, even if delayed allocation is in effect. For the cases where that's not the case (if there are current ones, rather than just past bugs), we should be able to make sure that's not an issue by pre-zeroing the data and/or using fallocate.

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

EXT4 and XFS don't allocate until later, it by performing actual writes to FS metadata, initializing disk blocks, etc. So we won't notice errors that are only detectable at actual time of allocation, like thin provisioning problems, until after write() returns and we face the same writeback issues.

So I reckon you're safe from space-related issues if you're not on NFS (and whyyy would you do that?) and not thinly provisioned. I'm sure there are other corner cases, but I don't see any reason to expect space-exhaustion-related corruption problems on a sensible FS backed by a sensible block device. I haven't tested things like quotas, verified how reliable space reservation is under concurrency, etc as yet.

How's that not solved by pre zeroing and/or fallocate as I suggested above?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 02:02:48

On 10 April 2018 at 08:41, Andreas Karlsson wrote:

On 04/09/2018 02:16 PM, Craig Ringer wrote:

I'd like a middle ground where the kernel lets us register our interest and tells us if it lost something, without us having to keep eight million FDs open for some long period. "Tell us about anything that happens under pgdata/" or an inotify-style per-directory-registration option. I'd even say that's ideal.

Could there be a risk of a race condition here where fsync incorrectly returns success before we get the notification of that something went wrong?

We'd examine the notification queue only once all our checkpoint fsync()s had succeeded, and before we updated the control file to advance the redo position.

I'm intrigued by the suggestion upthread of using a kprobe or similar to achieve this. It's a horrifying unportable hack that'd make kernel people cry, and I don't know if we have any way to flush buffered probe data to be sure we really get the news in time, but it's a cool idea too.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-10 05:04:13

On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 05:37:19

On 10 April 2018 at 13:04, Michael Paquier wrote:

On Mon, Apr 09, 2018 at 03:02:11PM -0400, Robert Haas wrote:

Another consequence of this behavior that initdb -S is never reliable, so pg_rewind's use of it doesn't actually fix the problem it was intended to solve. It also means that initdb itself isn't crash-safe, since the data file changes are made by the backend but initdb itself is doing the fsyncs, and initdb has no way of knowing what files the backend is going to create and therefore can't -- even theoretically -- open them first.

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.


From:Michael Paquier <michael(at)paquier(dot)xyz>
Date:2018-04-10 06:10:21

On Tue, Apr 10, 2018 at 01:37:19PM +0800, Craig Ringer wrote:

On 10 April 2018 at 13:04, Michael Paquier wrote:

And pg_basebackup. And pg_dump. And pg_dumpall. Anything using initdb -S or fsync_pgdata would enter in those waters.

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

Sure.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-10 12:15:15

On 10 April 2018 at 14:10, Michael Paquier wrote:

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.

For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.

AttachmentContent-TypeSize v1-0001-PANIC-when-we-detect-a-possible-fsync-I-O-error-i.patchtext/x-patch10.3 KB


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-10 15:15:46

On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

I think that reliable error reporting is more than "nice" -- I think it's essential. The only argument for the current Linux behavior that has been so far advanced on this thread, at least as far as I can see, is that if it kept retrying the buffers forever, it would be pointless and might run the machine out of memory, so we might as well discard them. But previous comments have already illustrated that the kernel is not really up against a wall there -- it could put individual inodes into a permanent failure state when it discards their dirty data, as you suggested, or it could do what others have suggested, and what I think is better, which is to put the whole filesystem into a permanent failure state that can be cleared by remounting the FS. That could be done on an as-needed basis -- if the number of dirty buffers you're holding onto for some filesystem becomes too large, put the filesystem into infinite-fail mode and discard them all. That behavior would be pretty easy for administrators to understand and would resolve the entire problem here provided that no PostgreSQL processes survived the eventual remount.

I also don't really know what we mean by an "unresolvable" error. If the drive is beyond all hope, then it doesn't really make sense to talk about whether the database stored on it is corrupt. In general we can't be sure that we'll even get an error - e.g. the system could be idle and the drive could be on fire. Maybe this is the case you meant by "it'd be nice if we could report it reliably". But at least in my experience, that's typically not what's going on. You get some I/O errors and so you remount the filesystem, or reboot, or rebuild the array, or ... something. And then the errors go away and, at that point, you want to run recovery and continue using your database. In this scenario, it matters quite a bit what the error reporting was like during the period when failures were occurring. In particular, if the database was allowed to think that it had successfully checkpointed when it didn't, you're going to start recovery from the wrong place.

I'm going to shut up now because I'm telling you things that you obviously already know, but this doesn't sound like a "near irresolvable corner case". When the storage goes bonkers, either PostgreSQL and the kernel can interact in such a way that a checkpoint can succeed without all of the relevant data getting persisted, or they don't. It sounds like right now they do, and I'm not really clear that we have a reasonable idea how to fix that. It does not sound like a PANIC is sufficient.


From:Robert Haas <robertmhaas(at)gmail(dot)com>
Date:2018-04-10 15:28:07

On Tue, Apr 10, 2018 at 1:37 AM, Craig Ringer wrote:

... but only if they hit an I/O error or they're on a FS that doesn't reserve space and hit ENOSPC.

It still does 99% of the job. It still flushes all buffers to persistent storage and maintains write ordering. It may not detect and report failures to the user how we'd expect it to, yes, and that's not great. But it's hardly throw up our hands and give up territory either. Also, at least for initdb, we can make initdb fsync() its own files before close(). Annoying but hardly the end of the world.

I think we'd need every child postgres process started by initdb to do that individually, which I suspect would slow down initdb quite a lot. Now admittedly for anybody other than a PostgreSQL developer that's only a minor issue, and our regression tests mostly run with fsync=off anyway. But I have a strong suspicion that our assumptions about how fsync() reports errors are baked into an awful lot of parts of the system, and by the time we get unbaking them I think it's going to be really surprising if we haven't done real harm to overall system performance.

BTW, I took a look at the MariaDB source code to see whether they've got this problem too and it sure looks like they do. os_file_fsync_posix() retries the fsync in a loop with an 0.2 second sleep after each retry. It warns after 100 failures and fails an assertion after 1000 failures. It is hard to understand why they would have written the code this way unless they expect errors reported by fsync() to continue being reported until the underlying condition is corrected. But, it looks like they wouldn't have the problem that we do with trying to reopen files to fsync() them later -- I spot checked a few places where this code is invoked and in all of those it looks like the file is already expected to be open.


From:Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Date:2018-04-10 15:40:05

Hi Robert,

On Tue, Apr 10, 2018 at 11:15:46AM -0400, Robert Haas wrote:

On Mon, Apr 9, 2018 at 3:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

Well, I admit that I wasn't entirely serious about that email, but I wasn't entirely not-serious either. If you can't find reliably find out whether the contents of the file on disk are the same as the contents that the kernel is giving you when you call read(), then you are going to have a heck of a time building a reliable system. If the kernel developers are determined to insist on these semantics (and, admittedly, I don't know whether that's the case - I've only read Anthony's remarks), then I don't really see what we can do except give up on buffered I/O (or on Linux).

I think it would be interesting to get in touch with some of the respective linux kernel maintainers and open up this topic for more detailed discussions. LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-10 16:38:27

On 9 April 2018 at 11:50, Anthony Iliopoulos wrote:

On Mon, Apr 09, 2018 at 09:45:40AM +0100, Greg Stark wrote:

On 8 April 2018 at 22:47, Anthony Iliopoulos wrote:

To make things a bit simpler, let us focus on EIO for the moment. The contract between the block layer and the filesystem layer is assumed to be that of, when an EIO is propagated up to the fs, then you may assume that all possibilities for recovering have been exhausted in lower layers of the stack.

Well Postgres is using the filesystem. The interface between the block layer and the filesystem may indeed need to be more complex, I wouldn't know.

But I don't think "all possibilities" is a very useful concept. Neither layer here is going to be perfect. They can only promise that all possibilities that have actually been implemented have been exhausted. And even among those only to the degree they can be done automatically within the engineering tradeoffs and constraints. There will always be cases like thin provisioned devices that an operator can expand, or degraded raid arrays that can be repaired after a long operation and so on. A network device can't be sure whether a remote server may eventually come back or not and have to be reconfigured by a human or system automation tool to point to the new server or new network configuration.

Right. This implies though that apart from the kernel having to keep around the dirtied-but-unrecoverable pages for an unbounded time, that there's further an interface for obtaining the exact failed pages so that you can read them back.

No, the interface we have is fsync which gives us that information with the granularity of a single file. The database could in theory recognize that fsync is not completing on a file and read that file back and write it to a new file. More likely we would implement a feature Oracle has of writing key files to multiple devices. But currently in practice that's not what would happen, what would happen would be a human would recognize that the database has stopped being able to commit and there are hardware errors in the log and would stop the database, take a backup, and restore onto a new working device. The current interface is that there's one error and then Postgres would pretty much have to say, "sorry, your database is corrupt and the data is gone, restore from your backups". Which is pretty dismal.

There is a clear responsibility of the application to keep its buffers around until a successful fsync(). The kernels do report the error (albeit with all the complexities of dealing with the interface), at which point the application may not assume that the write()s where ever even buffered in the kernel page cache in the first place.

Postgres cannot just store the entire database in RAM. It writes things to the filesystem all the time. It calls fsync only when it needs a write barrier to ensure consistency. That's only frequent on the transaction log to ensure it's flushed before data modifications and then periodically to checkpoint the data files. The amount of data written between checkpoints can be arbitrarily large and Postgres has no idea how much memory is available as filesystem buffers or how much i/o bandwidth is available or other memory pressure there is. What you're suggesting is that the application should have to babysit the filesystem buffer cache and reimplement all of it in user-space because the filesystem is free to throw away any data any time it chooses?

The current interface to throw away filesystem buffer cache is unmount. It sounds like the kernel would like a more granular way to discard just part of a device which makes a lot of sense in the age of large network block devices. But I don't think just saying that the filesystem buffer cache is now something every application needs to re-implement in user-space really helps with that, they're going to have the same problems to solve.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-10 16:54:40

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 18:58:37

-hackers,

I reached out to the Linux ext4 devs, here is tytso(at)mit(dot)edu response:

""" Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...

- Ted

"""


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 19:51:01

-hackers,

The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?

Thanks,


From:"Joshua D(dot) Drake" <jd(at)commandprompt(dot)com>
Date:2018-04-10 20:57:34

On 04/10/2018 12:51 PM, Joshua D. Drake wrote:

-hackers,

The thread is picking up over on the ext4 list. They don't update their archives as often as we do, so I can't link to the discussion. What would be the preferred method of sharing the info?

Thanks to Anthony for this link:

http://lists.openwall.net/linux-ext4/2018/04/10/33

It isn't quite real time but it keeps things close enough.


From:Jonathan Corbet <corbet(at)lwn(dot)net>
Date:2018-04-11 12:05:27

On Tue, 10 Apr 2018 17:40:05 +0200 Anthony Iliopoulos wrote:

LSF/MM'18 is upcoming and it would have been the perfect opportunity but it's past the CFP deadline. It may still worth contacting the organizers to bring forward the issue, and see if there is a chance to have someone from Pg invited for further discussions.

FWIW, it is my current intention to be sure that the development community is at least aware of the issue by the time LSFMM starts.

The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.


From:Greg Stark <stark(at)mit(dot)edu>
Date:2018-04-11 12:23:49

On 10 April 2018 at 19:58, Joshua D. Drake wrote:

You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off.

I always wondered why Linux didn't implement umount -f. It's been in BSD since forever and it's a major annoyance that it's missing in Linux. Even without leaking memory it still leaks other resources, causes confusion and awkward workarounds in UI and automation software.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-11 14:29:09

Hi,

On 2018-04-11 06:05:27 -0600, Jonathan Corbet wrote:

The event is April 23-25 in Park City, Utah. I bet that room could be found for somebody from the postgresql community, should there be somebody who would like to represent the group on this issue. Let me know if an introduction or advocacy from my direction would be helpful.

If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.

Thanks for chiming in,


From:Jonathan Corbet <corbet(at)lwn(dot)net>
Date:2018-04-11 14:40:31

On Wed, 11 Apr 2018 07:29:09 -0700 Andres Freund wrote:

If that room can be found, I might be able to make it. Being in SF, I'm probably the physically closest PG dev involved in the discussion.

OK, I've dropped the PC a note; hopefully you'll be hearing from them.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:19:53

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:29:17

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

My more-considered crazy idea is to have a postgresql.conf setting like archive_command that allows the administrator to specify a command that will be run after fsync but before the checkpoint is marked as complete. While we can have write flush errors before fsync and never see the errors during fsync, we will not have write flush errors after fsync that are associated with previous writes.

The script should check for I/O or space-exhaustion errors and return false in that case, in which case we can stop and maybe stop and crash recover. We could have an exit of 1 do the former, and an exit of 2 do the later.

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

    #wal_sync_method = fsync                # the default is the first option
                                            # supported by the operating system:
                                            #   open_datasync
                                     -->    #   fdatasync (default on Linux)
                                     -->    #   fsync
                                     -->    #   fsync_writethrough
                                            #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:32:45

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

On 04/09/2018 12:29 AM, Bruce Momjian wrote:

An crazy idea would be to have a daemon that checks the logs and stops Postgres when it seems something wrong.

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it. Does O_DIRECT work in such container cases?


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-17 21:34:53

On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

    > #wal_sync_method = fsync                # the default is the first option
    >                                         # supported by the operating system:
    >                                         #   open_datasync
    >                                  -->    #   fdatasync (default on Linux)
    >                                  -->    #   fsync
    >                                  -->    #   fsync_writethrough
    >                                         #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.

O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-17 21:41:42

On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.

There's not that many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.

Does O_DIRECT work in such container cases?

Yes.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-17 21:49:42

On Mon, Apr 9, 2018 at 12:25:33PM -0700, Peter Geoghegan wrote:

On Mon, Apr 9, 2018 at 12:13 PM, Andres Freund wrote:

Let's lower the pitchforks a bit here. Obviously a grand rewrite is absurd, as is some of the proposed ways this is all supposed to work. But I think the case we're discussing is much closer to a near irresolvable corner case than anything else.

+1

We're talking about the storage layer returning an irresolvable error. You're hosed even if we report it properly. Yes, it'd be nice if we could report it reliably. But that doesn't change the fact that what we're doing is ensuring that data is safely fsynced unless storage fails, in which case it's not safely fsynced anyway.

Right. We seem to be implicitly assuming that there is a big difference between a problem in the storage layer that we could in principle detect, but don't, and any other problem in the storage layer. I've read articles claiming that technologies like SMART are not really reliable in a practical sense [1], so it seems to me that there is reason to doubt that this gap is all that big.

That said, I suspect that the problems with running out of disk space are serious practical problems. I have personally scoffed at stories involving Postgres databases corruption that gets attributed to running out of disk space. Looks like I was dead wrong.

Yes, I think we need to look at user expectations here.

If the device has a hardware write error, it is true that it is good to detect it, and it might be permanent or temporary, e.g. NAS/NFS. The longer the error persists, the more likely the user will expect corruption. However, right now, any length outage could cause corruption, and it will not be reported in all cases.

Running out of disk space is also something you don't expect to corrupt your database --- you expect it to only prevent future writes. It seems NAS/NFS and any thin provisioned storage will have this problem, and again, not always reported.

So, our initial action might just be to educate users that write errors can cause silent corruption, and out-of-space errors on NAS/NFS and any thin provisioned storage can cause corruption.

Kernel logs (not just Postgres logs) should be monitored for these issues and fail-over/recovering might be necessary.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 09:52:22

On Tue, Apr 17, 2018 at 02:34:53PM -0700, Andres Freund wrote:

On 2018-04-17 17:29:17 -0400, Bruce Momjian wrote:

Also, if we are relying on WAL, we have to make sure WAL is actually safe with fsync, and I am betting only the O_DIRECT methods actually are safe:

> > #wal_sync_method = fsync                # the default is the first option
> >                                         # supported by the operating system:
> >                                         #   open_datasync
> >                                  -->    #   fdatasync (default on Linux)
> >                                  -->    #   fsync
> >                                  -->    #   fsync_writethrough
> >                                         #   open_sync

I am betting the marked wal_sync_method methods are not safe since there is time between the write and fsync.

Hm? That's not really the issue though? One issue is that retries are not necessarily safe in buffered IO, the other that fsync might not report an error if the fd was closed and opened.

Well, we have have been focusing on the delay between backend or checkpoint writes and checkpoint fsyncs. My point is that we have the same problem in doing a write, then fsync for the WAL. Yes, the delay is much shorter, but the issue still exists. I realize that newer Linux kernels will not have the problem since the file descriptor remains open, but the problem exists with older/common linux kernels.

O_DIRECT is only used if wal archiving or streaming isn't used, which makes it pretty useless anyway.

Uh, as doesn't 'open_datasync' and 'open_sync' fsync as part of the write, meaning we can't lose the error report like we can with the others?


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 10:04:30

On 18 April 2018 at 05:19, Bruce Momjian wrote:

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.

It should be sent if you're using sync mode.

From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to cover that properly.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 10:19:28

On 10 April 2018 at 20:15, Craig Ringer wrote:

On 10 April 2018 at 14:10, Michael Paquier wrote:

Well, I think that there is place for improving reporting of failure in file_utils.c for frontends, or at worst have an exit() for any kind of critical failures equivalent to a PANIC.

Yup.

In the mean time, speaking of PANIC, here's the first cut patch to make Pg panic on fsync() failures. I need to do some closer review and testing, but it's presented here for anyone interested.

I intentionally left some failures as ERROR not PANIC, where the entire operation is done as a unit, and an ERROR will cause us to retry the whole thing.

For example, when we fsync() a temp file before we move it into place, there's no point panicing on failure, because we'll discard the temp file on ERROR and retry the whole thing.

I've verified that it works as expected with some modifications to the test tool I've been using (pushed).

The main downside is that if we panic in redo, we don't try again. We throw our toys and shut down. But arguably if we get the same I/O error again in redo, that's the right thing to do anyway, and quite likely safer than continuing to ERROR on checkpoints indefinitely.

Patch attached.

To be clear, this patch only deals with the issue of us retrying fsyncs when it turns out to be unsafe. This does NOT address any of the issues where we won't find out about writeback errors at all.

Thinking about this some more, it'll definitely need a GUC to force it to continue despite a potential hazard. Otherwise we go backwards from the status quo if we're in a position where uptime is vital and correctness problems can be tolerated or repaired later. Kind of like zero_damaged_pages, we'll need some sort of continue_after_fsync_errors .

Without that, we'll panic once, enter redo, and if the problem persists we'll panic in redo and exit the startup process. That's not going to help users.

I'll amend the patch accordingly as time permits.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 11:46:15

On Wed, Apr 18, 2018 at 06:04:30PM +0800, Craig Ringer wrote:

On 18 April 2018 at 05:19, Bruce Momjian wrote:

On Tue, Apr 10, 2018 at 05:54:40PM +0100, Greg Stark wrote:

On 10 April 2018 at 02:59, Craig Ringer wrote:

Nitpick: In most cases the kernel reserves disk space immediately, before returning from write(). NFS seems to be the main exception here.

I'm kind of puzzled by this. Surely NFS servers store the data in the filesystem using write(2) or the in-kernel equivalent? So if the server is backed by a filesystem where write(2) preallocates space surely the NFS server must behave as if it'spreallocating as well? I would expect NFS to provide basically the same set of possible failures as the underlying filesystem (as long as you don't enable nosync of course).

I don't think the write is sent to the NFS at the time of the write, so while the NFS side would reserve the space, it might get the write request until after we return write success to the process.

It should be sent if you're using sync mode.

From my reading of the docs, if you're using async mode you're already open to so many potential corruptions you might as well not bother.

I need to look into this more re NFS and expand the tests I have to cover that properly.

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

So, what about thin provisioning? I can understand sharing free space among file systems, but once a write arrives I assume it reserves the space. Is the problem that many thin provisioning systems don't have a sync mode, so you can't force the write to appear on the device before an fsync?


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-18 11:56:57

On Tue, Apr 17, 2018 at 02:41:42PM -0700, Andres Freund wrote:

On 2018-04-17 17:32:45 -0400, Bruce Momjian wrote:

On Mon, Apr 9, 2018 at 03:42:35PM +0200, Tomas Vondra wrote:

That doesn't seem like a very practical way. It's better than nothing, of course, but I wonder how would that work with containers (where I think you may not have access to the kernel log at all). Also, I'm pretty sure the messages do change based on kernel version (and possibly filesystem) so parsing it reliably seems rather difficult. And we probably don't want to PANIC after I/O error on an unrelated device, so we'd need to understand which devices are related to PostgreSQL.

You can certainly have access to the kernel log in containers. I'd assume such a script wouldn't check various system logs but instead tail /dev/kmsg or such. Otherwise the variance between installations would be too big.

I was thinking 'dmesg', but the result is similar.

There's not that many different type of error messages and they don't change that often. If we'd just detect error for the most common FSs we'd probably be good. Detecting a few general storage layer message wouldn't be that hard either, most things have been unified over the last ~8-10 years.

It is hard to know exactly what the message format should be for each operating system because it is hard to generate them on demand, and we would need to filter based on Postgres devices.

The other issue is that once you see a message during a checkpoint and exit, you don't want to see that message again after the problem has been fixed and the server restarted. The simplest solution is to save the output of the last check and look for only new entries. I am attaching a script I run every 15 minutes from cron that emails me any unexpected kernel messages.

I am thinking we would need a contrib module with sample scripts for various operating systems.

Replying to your specific case, I am not sure how we would use a script to check for I/O errors/space-exhaustion if the postgres user doesn't have access to it.

Not sure what you mean?

Space exhaustiion can be checked when allocating space, FWIW. We'd just need to use posix_fallocate et al.

I was asking about cases where permissions prevent viewing of kernel messages. I think you can view them in containers, but in virtual machines you might not have access to the host operating system's kernel messages, and that might be where they are.

    AttachmentContent-TypeSize
    dmesg_checktext/plain574 bytes

From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-18 12:45:53

wrOn 18 April 2018 at 19:46, Bruce Momjian wrote:

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to allocate the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data.

... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.

So, what about thin provisioning? I can understand sharing free space among file systems

Most thin provisioning is done at the block level, not file system level. So the FS is usually unaware it's on a thin-provisioned volume. Usually the whole kernel is unaware, because the thin provisioning is done on the SAN end or by a hypervisor. But the same sort of thing may be done via LVM - see lvmthin. For example, you may make 100 different 1TB ext4 FSes, each on 1TB iSCSI volumes backed by SAN with a total of 50TB of concrete physical capacity. The SAN is doing block mapping and only allocating storage chunks to a given volume when the FS has written blocks to every previous free block in the previous storage chunk. It may also do things like block de-duplication, compression of storage chunks that aren't written to for a while, etc.

The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage even if you don't know it.

Think of it as a bit like VM overcommit, for storage. You can malloc() as much memory as you like and everything's fine until you try to actually use it. Then you go to dirty a page, no free pages are available, and boom.

The thing is, the SAN (or LVM) doesn't have any idea about the FS's internal in-memory free space counter and its space reservations. Nor does it understand any FS metadata. All it cares about is "has this LBA ever been written to by the FS?". If so, it must make sure backing storage for it exists. If not, it won't bother.

Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.

I don't know if posix_fallocate is a sufficient safeguard either. You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say

After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space.

... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.

It's reasonable enough to throw up our hands in this case and say "your setup is crazy, you're breaking the rules, don't do that". The truth is they AREN'T breaking the rules, but we can disclaim support for such configurations anyway.

After all, we tell people not to use Linux's VM overcommit too. How's that working for you? I see it enabled on the great majority of systems I work with, and some people are very reluctant to turn it off because they don't want to have to add swap.

If someone has a 50TB SAN and wants to allow for unpredictable space use expansion between various volumes, and we say "you can't do that, go buy a 100TB SAN instead" ... that's not going to go down too well either. Often we can actually say "make sure the 5TB volume PostgreSQL is using is eagerly provisioned, and expand it at need using online resize if required. We don't care about the rest of the SAN.".

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

There are file systems optimised for thin provisioning, etc, too. But that's more commonly done by having them do things like zero deallocated space so the thin provisioning system knows it can return it to the free pool, and now things like DISCARD provide much of that signalling in a standard way.


From:Mark Kirkwood <mark(dot)kirkwood(at)catalyst(dot)net(dot)nz>
Date:2018-04-18 23:31:50

On 19/04/18 00:45, Craig Ringer wrote:

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-19 00:44:33

On 19 April 2018 at 07:31, Mark Kirkwood wrote:

On 19/04/18 00:45, Craig Ringer wrote:

I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though.

Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof).

This still makes a lot of sense on AWS EBS, particularly when using a volume created from a non-empty snapshot. Performance of S3-snapshot based EBS volumes is spectacularly awful, since they're copy-on-read. Reading the whole volume helps a lot.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-20 20:49:08

On Wed, Apr 18, 2018 at 08:45:53PM +0800, Craig Ringer wrote:

wrOn 18 April 2018 at 19:46, Bruce Momjian wrote:

So, if sync mode passes the write to NFS, and NFS pre-reserves write space, and throws an error on reservation failure, that means that NFS will not corrupt a cluster on out-of-space errors.

Yeah. I need to verify in a concrete test case.

Thanks.

The thing is that write() is allowed to be asynchronous anyway. Most file systems choose to implement eager reservation of space, but it's not mandated. AFAICS that's largely a historical accident to keep applications happy, because FSes used to allocate the space at write() time too, and when they moved to delayed allocations, apps tended to break too easily unless they at least reserved space. NFS would have to do a round-trip on write() to reserve space.

The Linux man pages (http://man7.org/linux/man-pages/man2/write.2.html) say:

" A successful return from write() does not make any guarantee that data has been committed to disk. On some filesystems, including NFS, it does not even guarantee that space has successfully been reserved for the data. In this case, some errors might be delayed until a future write(2), fsync(2), or even close(2). The only way to be sure is to call fsync(2) after you are done writing all your data. "

... and I'm inclined to believe it when it refuses to make guarantees. Especially lately.

Uh, even calling fsync after write isn't 100% safe since the kernel could have flushed the dirty pages to storage, and failed, and the fsync would later succeed. I realize newer kernels have that fixed for files open during that operation, but that is the minority of installs.

The idea is that when the SAN's actual physically allocate storage gets to 40TB it starts telling you to go buy another rack of storage so you don't run out. You don't have to resize volumes, resize file systems, etc. All the storage space admin is centralized on the SAN and storage team, and your sysadmins, DBAs and app devs are none the wiser. You buy storage when you need it, not when the DBA demands they need a 200% free space margin just in case. Whether or not you agree with this philosophy or think it's sensible is kind of moot, because it's an extremely widespread model, and servers you work on may well be backed by thin provisioned storage even if you don't know it.

Most FSes only touch the blocks on dirty writeback, or sometimes lazily as part of delayed allocation. So if your SAN is running out of space and there's 100MB free, each of your 100 FSes may have decremented its freelist by 2MB and be happily promising more space to apps on write() because, well, as far as they know they're only 50% full. When they all do dirty writeback and flush to storage, kaboom, there's nowhere to put some of the data.

I see what you are saying --- that the kernel is reserving the write space from its free space, but the free space doesn't all exist. I am not sure how we can tell people to make sure the file system free space is real.

You'd have to actually force writes to each page through to the backing storage to know for sure the space existed. Yes, the docs say

" After a successful call to posix_fallocate(), subsequent writes to bytes in the specified range are guaranteed not to fail because of lack of disk space. "

... but they're speaking from the filesystem's perspective. If the FS doesn't dirty and flush the actual blocks, a thin provisioned storage system won't know.

Frankly, in what cases will a write fail for lack of free space? It could be a new WAL file (not recycled), or a pages added to the end of the table.

Is that it? It doesn't sound too terrible. If we can eliminate the corruption due to free space exxhaustion, it would be a big step forward.

The next most common failure would be temporary storage failure or storage communication failure.

Permanent storage failure is "game over" so we don't need to worry about that.


From:Gasper Zejn <zejn(at)owca(dot)info>
Date:2018-04-21 19:21:39

Just for the record, I tried the test case with ZFS on Ubuntu 17.10 host with ZFS on Linux 0.6.5.11.

ZFS does not swallow the fsync error, but the system does not handle the error nicely: the test case program hangs on fsync, the load jumps up and there's a bunch of z_wr_iss and z_null_int kernel threads belonging to zfs, eating up the CPU.

Even then I managed to reboot the system, so it's not a complete and utter mess.

The test case adjustments are here: https://github.com/zejn/scrapcode/commit/e7612536c346d59a4b69bedfbcafbe8c1079063c

Kind regards,


On 29. 03. 2018 07:25, Craig Ringer wrote:

On 29 March 2018 at 13:06, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com

On Thu, Mar 29, 2018 at 6:00 PM, Justin Pryzby
> The retries are the source of the problem ; the first fsync() can return EIO,
> and also *clears the error* causing a 2nd fsync (of the same data) to return
> success.

> What I'm failing to grok here is how that error flag even matters,
> whether it's a single bit or a counter as described in that patch.  If
> write back failed, *the page is still dirty*.  So all future calls to
> fsync() need to try to try to flush it again, and (presumably) fail
> again (unless it happens to succeed this time around).

You'd think so. But it doesn't appear to work that way. You can see yourself with the error device-mapper destination mapped over part of a volume.

I wrote a test case here.

https://github.com/ringerc/scrapcode/blob/master/testcases/fsync-error-clear.c

I don't pretend the kernel behaviour is sane. And it's possible I've made an error in my analysis. But since I've observed this in the wild, and seen it in a test case, I strongly suspect that's what I've described is just what's happening, brain-dead or no.

Presumably the kernel marks the page clean when it dispatches it to the I/O subsystem and doesn't dirty it again on I/O error? I haven't dug that deep on the kernel side. See the stackoverflow post for details on what I found in kernel code analysis.


From:Andres Freund <andres(at)anarazel(dot)de>
Date:2018-04-23 20:14:48

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.


From:Bruce Momjian <bruce(at)momjian(dot)us>
Date:2018-04-24 00:09:23

On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.


From:Craig Ringer <craig(at)2ndquadrant(dot)com>
Date:2018-04-26 02:16:52

On 24 April 2018 at 04:14, Andres Freund wrote:

I'm LSF/MM to discuss future behaviour of linux here, but that's how it is right now.

Interim LWN.net coverage of that can be found here: https://lwn.net/Articles/752613/


From:Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Date:2018-04-27 01:18:55

On Tue, Apr 24, 2018 at 12:09 PM, Bruce Momjian wrote:

On Mon, Apr 23, 2018 at 01:14:48PM -0700, Andres Freund wrote:

Hi,

On 2018-03-28 10:23:46 +0800, Craig Ringer wrote:

TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means "all writes since the last fsync have hit disk" but we assume it means "all writes since the last SUCCESSFUL fsync have hit disk".

But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() cleared the AS_EIO bad page flag.

Random other thing we should look at: Some filesystems (nfs yes, xfs ext4 no) flush writes at close(2). We check close() return code, just log it... So close() counts as an fsync for such filesystems().

Well, that's interesting. You might remember that NFS does not reserve space for writes like local file systems like ext4/xfs do. For that reason, we might be able to capture the out-of-space error on close and exit sooner for NFS.

It seems like some implementations flush on close and therefore discover ENOSPC problem at that point, unless they have NVSv4 (RFC 3050) "write delegation" with a promise from the server that a certain amount of space is available. It seems like you can't count on that in any way though, because it's the server that decides when to delegate and how much space to promise is preallocated, not the client. So in userspace you always need to be able to handle errors including ENOSPC returned by close(), and if you ignore that and you're using an operating system that immediately incinerates all evidence after telling you that (so that later fsync() doesn't fail), you're in trouble.

Some relevant code:

It looks like the bleeding edge of the NFS spec includes a new ALLOCATE operation that should be able to support posix_fallocate() (if we were to start using that for extending files):

https://tools.ietf.org/html/rfc7862#page-64

I'm not sure how reliable [posix_]fallocate is on NFS in general though, and it seems that there are fall-back implementations of posix_fallocate() that write zeros (or even just feign success?) which probably won't do anything useful here if not also flushed (that fallback strategy might only work on eager reservation filesystems that don't have direct fallocate support?) so there are several layers (libc, kernel, nfs client, nfs server) that'd need to be aligned for that to work, and it's not clear how a humble userspace program is supposed to know if they are.

I guess if you could find a way to amortise the cost of extending (like Oracle et al do by extending big container datafiles 10MB at a time or whatever), then simply writing zeros and flushing when doing that might work out OK, so you wouldn't need such a thing? (Unless of course it's a COW filesystem, but that's a different can of worms.)


This thread continues on the ext4 mailing list:


From:   "Joshua D. Drake" <jd@...mandprompt.com>
Subject: fsync() errors is unsafe and risks data loss
Date:   Tue, 10 Apr 2018 09:28:15 -0700

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.


From:   "Darrick J. Wong" <darrick.wong@...cle.com>
Date:   Tue, 10 Apr 2018 09:54:43 -0700

On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

You might try the XFS list (linux-xfs@...r.kernel.org) seeing as the initial complaint is against xfs behaviors...


From:   "Joshua D. Drake" <jd@...mandprompt.com>
Date:   Tue, 10 Apr 2018 09:58:21 -0700

On 04/10/2018 09:54 AM, Darrick J. Wong wrote:

On Tue, Apr 10, 2018 at 09:28:15AM -0700, Joshua D. Drake wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

You might try the XFS list (linux-xfs@...r.kernel.org) seeing as the initial complaint is against xfs behaviors...

Later in the thread it becomes apparent that it applies to ext4 (NFS too) as well. I picked ext4 because I assumed it is the most populated of the lists since its the default filesystem for most distributions.


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Tue, 10 Apr 2018 14:43:56 -0400

Hi Joshua,

This isn't actually an ext4 issue, but a long-standing VFS/MM issue.

There are going to be multiple opinions about what the right thing to do. I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Which is why after a while, one can get quite paranoid and assume that the only way you can guarantee data robustness is to store multiple copies and/or use erasure encoding, with some of the copies or shards written to geographically diverse data centers.

Secondly, I think it's fair to say that the vast majority of the companies who require data robustness, and are either willing to pay $$$ to an enterprise distro company like Red Hat, or command a large enough paying customer base that they can afford to dictate terms to an enterprise distro, or hire a consultant such as Christoph, or have their own staffed Linux kernel teams, have tended to use O_DIRECT. So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I can think of things that could be done --- for example, it could be switchable on a per-block device basis (or maybe a per-mount basis) whether or not the dirty bit gets cleared after the error is reported to userspace. And perhaps there could be a new unmount flag that causes all dirty pages to be wiped out, which could be used to recover after a permanent loss of the block device. But the question is who is going to invest the time to make these changes? If there is a company who is willing to pay to comission this work, it's almost certainly soluble. Or if a company which has a kernel on staff is willing to direct an engineer to work on it, it certainly could be solved. But again, of the companies who have client code where we care about robustness and proper handling of failed disk drives, and which have a kernel team on staff, pretty much all of the ones I can think of (e.g., Oracle, Google, etc.) use O_DIRECT and they don't try to make buffered writes and error reporting via fsync(2) work well.

In general these companies want low-level control over buffer cache eviction algorithms, which drives them towards the design decision of effectively implementing the page cache in userspace, and using O_DIRECT reads/writes.

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work. Let me know off-line if that's the case...


From:   Andreas Dilger <adilger@...ger.ca>
Date:   Tue, 10 Apr 2018 13:44:48 -0600

On Apr 10, 2018, at 10:50 AM, Joshua D. Drake jd@...mandprompt.com wrote:

-ext4,

If this is not the appropriate list please point me in the right direction. I am a PostgreSQL contributor and we have come across a reliability problem with writes and fsync(). You can see the thread here:

https://www.postgresql.org/message-id/flat/20180401002038.GA2211%40paquier.xyz#20180401002038.GA2211@paquier.xyz

The tl;dr; in the first message doesn't quite describe the problem as we started to dig into it further.

Yes, this is a very long thread. The summary is Postgres is unhappy that fsync() on Linux (and also other OSes) returns an error once if there was a prior write() failure, instead of keeping dirty pages in memory forever and trying to rewrite them.

This behaviour has existed on Linux forever, and (for better or worse) is the only reasonable behaviour that the kernel can take. I've argued for the opposite behaviour at times, and some subsystems already do limited retries before finally giving up on a failed write, though there are also times when retrying at lower levels is pointless if a higher level of code can handle the failure (e.g. mirrored block devices, filesystem data mirroring, userspace data mirroring, or cross-node replication).

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on "I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!" without looking at the larger picture of what is practical to change and where the issue should best be fixed.

Regardless of why this is the case, the net is that PG needs to deal with all of the systems that currently exist that have this behaviour, even if some day in the future it may change (though that is unlikely). It seems ironic that "keep dirty pages in userspace until fsync() returns success" is totally unacceptable, but "keep dirty pages in the kernel" is fine. My (limited) understanding of databases was that they preferred to cache everything in userspace and use O_DIRECT to write to disk (which returns an error immediately if the write fails and does not double buffer data).


From: Martin Steigerwald martin@...htvoll.de Date: Tue, 10 Apr 2018 21:47:21 +0200

Hi Theodore, Darrick, Joshua.

CC´d fsdevel as it does not appear to be Ext4 specific to me (and to you as well, Theodore).

Theodore Y. Ts'o - 10.04.18, 20:43:

This isn't actually an ext4 issue, but a long-standing VFS/MM issue. […] First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Guh. I was not aware of this. I knew consumer-grade SSDs often do not have power loss protection, but still thought they´d handle garble collection in an atomic way. Sometimes I am tempted to sing an "all hardware is crap" song (starting with Meltdown/Spectre, then probably heading over to storage devices and so on… including firmware crap like Intel ME).

Next, the reason why fsync() has the behaviour that it does is one ofhe the most common cases of I/O storage errors in buffered use cases, certainly as seen by the community distros, is the user who pulls out USB stick while it is in use. In that case, if there are dirtied pages in the page cache, the question is what can you do? Sooner or later the writes will time out, and if you leave the pages dirty, then it effectively becomes a permanent memory leak. You can't unmount the file system --- that requires writing out all of the pages such that the dirty bit is turned off. And if you don't clear the dirty bit on an I/O error, then they can never be cleaned. You can't even re-insert the USB stick; the re-inserted USB stick will get a new block device. Worse, when the USB stick was pulled, it will have suffered a power drop, and see above about what could happen after a power drop for non-power fail certified flash devices --- it goes double for the cheap sh*t USB sticks found in the checkout aisle of Micro Center.

From the original PostgreSQL mailing list thread I did not get on how exactly FreeBSD differs in behavior, compared to Linux. I am aware of one operating system that from a user point of view handles this in almost the right way IMHO: AmigaOS.

When you removed a floppy disk from the drive while the OS was writing to it it showed a "You MUST insert volume somename into drive somedrive:" and if you did, it just continued writing. (The part that did not work well was that with the original filesystem if you did not insert it back, the whole disk was corrupted, usually to the point beyond repair, so the "MUST" was no joke.)

In my opinion from a user´s point of view this is the only sane way to handle the premature removal of removable media. I have read of a GSoC project to implement something like this for NetBSD but I did not check on the outcome of it. But in MS-DOS I think there has been something similar, however MS-DOS is not an multitasking operating system as AmigaOS is.

Implementing something like this for Linux would be quite a feat, I think, cause in addition to the implementation in the kernel, the desktop environment or whatever other userspace you use would need to handle it as well, so you´d have to adapt udev / udisks / probably Systemd. And probably this behavior needs to be restricted to anything that is really removable and even then in order to prevent memory exhaustion in case processes continue to write to an removed and not yet re-inserted USB harddisk the kernel would need to halt I/O processes which dirty I/O to this device. (I believe this is what AmigaOS did. It just blocked all subsequent I/O to the device still it was re-inserted. But then the I/O handling in that OS at that time is quite different from what Linux does.)

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I was not aware that flash based media may be as crappy as you hint at.

From my tests with AmigaOS 4.something or AmigaOS 3.9 + 3rd Party Poseidon USB stack the above mechanism worked even with USB sticks. I however did not test this often and I did not check for data corruption after a test.


From:   Andres Freund <andres@...razel.de>
Date:   Tue, 10 Apr 2018 15:07:26 -0700

(Sorry if I screwed up the thread structure - I'd to reconstruct the reply-to and CC list from web archive as I've not found a way to properly download an mbox or such of old content. Was subscribed to fsdevel but not ext4 lists)

Hi,

2018-04-10 18:43:56 Ted wrote:

I'll try to give as unbiased a description as possible, but certainly some of this is going to be filtered by my own biases no matter how careful I can be.

Same ;)

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

2018-04-10 18:43:56 Ted wrote:

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file writtten. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

2018-04-10 19:44:48 Andreas wrote:

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

2018-04-10 18:43:56 Ted wrote:

I think Anthony Iliopoulos was pretty clear in his multiple descriptions in that thread of why the current behaviour is needed (OOM of the whole system if dirty pages are kept around forever), but many others were stuck on "I can't believe this is happening??? This is totally unacceptable and every kernel needs to change to match my expectations!!!" without looking at the larger picture of what is practical to change and where the issue should best be fixed.

Everone can participate in discussions...


From:   Andreas Dilger <adilger@...ger.ca>
Date:   Wed, 11 Apr 2018 15:52:44 -0600

On Apr 10, 2018, at 4:07 PM, Andres Freund andres@...razel.de wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error. If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.

There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.

2018-04-10 18:43:56 Ted wrote:

So this is the explanation for why Linux handles I/O errors by clearing the dirty bit after reporting the error up to user space. And why there is not eagerness to solve the problem simply by "don't clear the dirty bit". For every one Postgres installation that might have a better recover after an I/O error, there's probably a thousand clueless Fedora and Ubuntu users who will have a much worse user experience after a USB stick pull happens.

I don't think these necessarily are as contradictory goals as you paint them. At least in postgres' case we can deal with the fact that an fsync retry isn't going to fix the problem by reentering crash recovery or just shutting down - therefore we don't need to keep all the dirty buffers around. A per-inode or per-superblock bit that causes further fsyncs to fail would be entirely sufficent for that.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

Consider if there was a per-inode "there was once an error writing this inode" flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.

IMHO, the only alternative would be to keep the dirty pages in memory until they are written to disk. If that was not possible, what then? It would need a reboot to clear the dirty pages, or truncate the file (discarding all data)?

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is "re-opened", and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.

That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO;

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others. Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

That said, even if a fix was available for Linux tomorrow, it would be years before a majority of users would have it available on their system, that includes even the errseq mechanism that was landed a few months ago. That implies to me that you'd want something that fixes PG now so that it works around whatever (perceived) breakage exists in the Linux fsync() implementation. Since the thread indicates that non-Linux kernels have the same fsync() behaviour, it makes sense to do that even if the Linux fix was available.

2018-04-10 19:44:48 Andreas wrote:

The confusion is whether fsync() is a "level" state (return error forever if there were pages that could not be written), or an "edge" state (return error only for any write failures since the previous fsync() call).

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

I can't say how common or uncommon such a workload is, though PG is the only application that I've heard of doing it, and I've been working on filesystems for 20 years. I'm a bit surprised that anyone expects fsync() on a newly-opened fd to have any state from write() calls that predate the open. I can understand fsync() returning an error for any IO that happens within the context of that fsync(), but how far should it go back for reporting errors on that file? Forever? The only way to clear the error would be to reboot the system, since I'm not aware of any existing POSIX code to clear such an error


From:   Dave Chinner <david@...morbit.com>
Date:   Thu, 12 Apr 2018 10:09:16 +1000

On Wed, Apr 11, 2018 at 03:52:44PM -0600, Andreas Dilger wrote: > On Apr 10, 2018, at 4:07 PM, Andres Freund andres@...razel.de wrote: > > 2018-04-10 18:43:56 Ted wrote: > >> So for better or for worse, there has not been as much investment in > >> buffered I/O and data robustness in the face of exception handling of > >> storage devices. > > > > That's a bit of a cop out. It's not just databases that care. Even more > > basic tools like SCM, package managers and editors care whether they can > > proper responses back from fsync that imply things actually were synced. > > Sure, but it is mostly PG that is doing (IMHO) crazy things like writing > to thousands(?) of files, closing the file descriptors, then expecting > fsync() on a newly-opened fd to return a historical error.

Yeah, this seems like a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation.

And speaking of "behaving differently to expectations", nobody has mentioned that close() can also return write errors. Hence if you do write - close - open - fsync the the write error might get reported on close, not fsync. IOWs, the assumption that "async writeback errors will persist across close to open" is fundamentally broken to begin with. It's even documented as a slient data loss vector in the close(2) man page:

$ man 2 close
.....
   Dealing with error returns from close()

	  A careful programmer will check the return value of
	  close(), since it is quite possible that  errors  on  a
	  previous  write(2)  operation  are reported  only on the
	  final close() that releases the open file description.
	  Failing to check the return value when closing a file may
	  lead to silent loss of data.  This can especially be
	  observed with NFS and with disk quota.

Yeah, ensuring data integrity in the face of IO errors is a really hard problem. :/

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do [complex things PG needs]" discussion.

In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.

This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....


From:   Andres Freund <andres@...razel.de>
Date:   Wed, 11 Apr 2018 19:17:52 -0700

On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:

On Apr 10, 2018, at 4:07 PM, Andres Freund andres@...razel.de wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too:

  /* We want to guarantee the extracted files are on the disk, so that the
   * subsequent renames to the info database do not end up with old or zero
   * length files in case of a system crash. As neither dpkg-deb nor tar do
   * explicit fsync()s, we have to do them here.
   * XXX: This could be avoided by switching to an internal tar extractor. */
  dir_sync_contents(cidir);

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.

I think there's some legitimate arguments that a database should use direct IO (more on that as a reply to David), but claiming that all sorts of random utilities need to use DIO with buffering etc is just insane.

If an editor tries to write a file, then calls fsync and gets an error, the user will enter a new pathname and retry the write. The package manager will assume the package installation failed, and uninstall the parts of the package that were already written.

Except that they won't notice that they got a failure, at least in the dpkg case. And happily continue installing corrupted data

There is no way the filesystem can handle the package manager failure case, and keeping the pages dirty and retrying indefinitely may never work (e.g. disk is dead or disconnected, is a sparse volume without any free space, etc). This (IMHO) implies that the higher layer (which knows more about what the write failure implies) needs to deal with this.

Yea, I agree that'd not be sane. As far as I understand the dpkg code (all of 10min reading it), that'd also be unnecessary. It can abort the installation, but only if it detects the error. Which isn't happening.

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

Or even more extreme, you untar/zip/git clone a directory. Then do a sync. And you don't know whether anything actually succeeded.

Consider if there was a per-inode "there was once an error writing this inode" flag. Then fsync() would return an error on the inode forever, since there is no way in POSIX to clear this state, since it would need to be kept in case some new fd is opened on the inode and does an fsync() and wants the error to be returned.

The data in the file also is corrupt. Having to unmount or delete the file to reset the fact that it can't safely be assumed to be on disk isn't insane.

Both in postgres, and a lot of other applications, it's not at all guaranteed to consistently have one FD open for every file written. Therefore even the more recent per-fd errseq logic doesn't guarantee that the failure will ever be seen by an application diligently fsync()ing.

... only if the application closes all fds for the file before calling fsync. If any fd is kept open from the time of the failure, it will return the original error on fsync() (and then no longer return it).

It's not that you need to keep every fd open forever. You could put them into a shared pool, and re-use them if the file is "re-opened", and call fsync on each fd before it is closed (because the pool is getting too big or because you want to flush the data for that file, or shut down the DB). That wouldn't require a huge re-architecture of PG, just a small library to handle the shared fd pool.

Except that postgres uses multiple processes. And works on a lot of architectures. If we started to fsync all opened files on process exit our users would lynch us. We'd need a complicated scheme that sends processes across sockets between processes, then deduplicate them on the receiving side, somehow figuring out which is the oldest filedescriptors (handling clockdrift safely).

Note that it'd be perfectly fine that we've "thrown away" the buffer contents if we'd get notified that the fsync failed. We could just do WAL replay, and restore the contents (just was we do after crashes and/or for replication).

That might even improve performance, because opening and closing files is itself not free, especially if you are working with remote filesystems.

There's already a per-process cache of open files.

You'd not even need to have per inode information or such in the case that the block device goes away entirely. As the FS isn't generally unmounted in that case, you could trivially keep a per-mount (or superblock?) bit that says "I died" and set that instead of keeping per inode/whatever information.

The filesystem will definitely return an error in this case, I don't think this needs any kind of changes:

int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) { if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb)))) return -EIO;

Well, I'm making that argument because several people argued that throwing away buffer contents in this case is the only way to not cause OOMs, and that that's incompatible with reporting errors. It's clearly not...

2018-04-10 18:43:56 Ted wrote:

If you are aware of a company who is willing to pay to have a new kernel feature implemented to meet your needs, we might be able to refer you to a company or a consultant who might be able to do that work.

I find it a bit dissapointing response. I think it's fair to say that for advanced features, but we're talking about the basic guarantee that fsync actually does something even remotely reasonable.

Linux (as PG) is run by people who develop it for their own needs, or are paid to develop it for the needs of others.

Sure.

Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

I don't think this is that PG specific, as explained above.


From:   Andres Freund <andres@...razel.de>
Date:   Wed, 11 Apr 2018 19:32:21 -0700

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do [complex things PG needs]" discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

In this case, robust IO error reporting is easy with DIO. It's one of the reasons most of the high performance database engines are either using or moving to non-blocking AIO+DIO (RWF_NOWAIT) and use O_DSYNC/RWF_DSYNC for integrity-critical IO dispatch. This is also being driven by the availability of high performance, high IOPS solid state storage where buffering in RAM to optimise IO patterns and throughput provides no real performance benefit.

Using the AIO+DIO infrastructure ensures errors are reported for the specific write that fails at failure time (i.e. in the aio completion event for the specific IO), yet high IO throughput can be maintained without the application needing it's own threading infrastructure to prevent blocking.

This means the application doesn't have to guess where the write error occurred to retry/recover, have to handle async write errors on close(), have to use fsync() to gather write IO errors and then infer where the IO failure was, or require kernels on every supported platform to jump through hoops to try to do exactly the right thing in error conditions for everyone in all circumstances at all times....

Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.


From:   Andres Freund <andres@...razel.de>
Date:   Wed, 11 Apr 2018 19:51:13 -0700

Hi,

On 2018-04-11 19:32:21 -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

And before somebody argues that that's a too small window to trigger the problem realistically: Restoring large databases happens pretty commonly (for new replicas, testcases, or actual fatal issues), takes time, and it's where a lot of storage is actually written to for the first time in a while, so it's far from unlikely to trigger bad block errors or such.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Wed, 11 Apr 2018 20:02:48 -0700

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 01:09:24 -0400

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Most of that sounds like a good thing to do, but you got to recognize that that's a lot of linux specific code.

I know it's not what PG has chosen, but realistically all of the other major databases and userspace based storage systems have used DIO precisely because it's the way to avoid OS-specific behavior or require OS-specific code. DIO is simple, and pretty much the same everywhere.

In contrast, the exact details of how buffered I/O workrs can be quite different on different OS's. This is especially true if you take performance related details (e.g., the cleaning algorithm, how pages get chosen for eviction, etc.)

As I read the PG-hackers thread, I thought I saw acknowledgement that some of the behaviors you don't like with Linux also show up on other Unix or Unix-like systems?


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 01:34:45 -0400

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

If there is no open file descriptor, and in many cases, no process (because it has already exited), it may be horrible, but what the h*ll else do you expect the OS to do?

The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon. If it detects errors on a particular hard drive, it tells the cluster file system to stop using that disk, and to reconstruct from erasure code all of the data chunks on that disk onto other disks in the cluster. We then run a series of disk diagnostics to make sure we find all of the bad sectors (every often, where there is one bad sector, there are several more waiting to be found), and then afterwards, put the disk back into service.

By making it be a separate health monitoring process, we can have HDD experts write much more sophisticated code that can ask the disk firmware for more information (e.g., SMART, the grown defect list), do much more careful scrubbing of the disk media, etc., before returning the disk back to service.

Everyone already has too much work to do, so you need to find someone who has an interest in fixing this (IMHO very peculiar) use case. If PG developers want to add a tunable "keep dirty pages in RAM on IO failure", I don't think that it would be too hard for someone to do. It might be harder to convince some of the kernel maintainers to accept it, and I've been on the losing side of that battle more than once. However, like everything you don't pay for, you can't require someone else to do this for you. It wouldn't hurt to see if Jeff Layton, who wrote the errseq patches, would be interested to work on something like this.

I don't think this is that PG specific, as explained above.

The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.

The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.


From:   Dave Chinner <david@...morbit.com>
Date:   Thu, 12 Apr 2018 15:45:36 +1000

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do " discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.


From:   Lukas Czerner <lczerner@...hat.com>
Date:   Thu, 12 Apr 2018 12:19:26 +0200

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.


From:   Jeff Layton <jlayton@...hat.com>
Date:   Thu, 12 Apr 2018 07:09:14 -0400

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Thu, 12 Apr 2018 04:19:48 -0700

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

Yeah, it's definitely half-arsed. We could make further changes to improve the situation, but they'd have wider impact. For example, we can tell if the error has been sampled by any existing fd, so we could bias our inode reaping to have inodes with unreported errors stick around in the cache for longer.

I think the crux of the matter here is not really about error reporting, per-se. I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

I suspect it isn't. If there's a transient error then we should reattempt the write. OTOH if the error is permanent then reattempting the write isn't going to do any good and it's just going to cause the drive to go through the whole error handling dance again. And what do we do if we're low on memory and need these pages back to avoid going OOM? There's a lot of options here, all of them bad in one situation or another.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.

I don't know either. It'd force the application to face up to the fact that the data is gone immediately rather than only finding it out after a reboot. Again though that might cause more problems than it solves. It's hard to know what the right thing to do is.


From:   Jeff Layton <jlayton@...hat.com>
Date:   Thu, 12 Apr 2018 07:24:12 -0400

On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote:

To pound the broken record: there are many good reasons why Linux filesystem developers have said "you should use direct IO" to the PG devs each time we have this "the kernel doesn't do " discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.

Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

No, Just saying fsyncing individual files and directories is about the most inefficient way you could possible go about doing this.

You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.


From:   Dave Chinner <david@...morbit.com>
Date:   Thu, 12 Apr 2018 22:01:22 +1000

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

Maybe that's ok in the face of a writeback error though? IDK.

No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 11:16:46 -0400

On Thu, Apr 12, 2018 at 10:01:22PM +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they think they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us again, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.

Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...

We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors). Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.

It's certainly that's what I would do if I didn't decide to use a hosted cloud solution, such as Cloud SQL for Postgres, and let someone else solve the really hard problems of dealing with real-world HDD failures. :-)


From:   Jeff Layton <jlayton@...hat.com>
Date:   Thu, 12 Apr 2018 11:08:50 -0400

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

That can even happen before you get the chance to call fsync, so even a write()+read()+fsync() is not guaranteed to be safe in this regard today, given sufficient memory pressure.

I think the current situation is fine from a "let's not OOM at all costs" standpoint, but not so good for application predictability. We should really consider ways to do better here.

Maybe that's ok in the face of a writeback error though? IDK.

No matter what we do for async writeback error handling, it will be slightly different from filesystem to filesystem, not to mention OS to OS. The is no magic bullet here, so I'm not sure we should worry too much. There's direct IO for anyone who cares that need to know about the completion status of every single write IO....

I think we we have an opportunity here to come up with better defined and hopefully more useful behavior for buffered I/O in the face of writeback errors. The first step would be to hash out what we'd want it to look like.

Maybe we need a plenary session at LSF/MM?


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 12:46:27 -0700

Hi,

On 2018-04-12 12:19:26 +0200, Lukas Czerner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea. Or even just cp -r ing it, and then starting up a copy of the database. What you're saying is that none of that is doable in a safe way, unless you use special-case DIO using tooling for the whole operation (or at least tools that fsync carefully without ever closing a fd, which certainly isn't the case for cp et al).

Does not seem like a problem to me, just checksum the thing if you really need to be extra safe. You should probably be doing it anyway if you backup / archive / timetravel / whatnot.

That doesn't really help, unless you want to sync() and then re-read all the data to make sure it's the same. Rereading multi-TB backups just to know whether there was an error that the OS knew about isn't particularly fun. Without verifying after sync it's not going to improve the situation measurably, you're still only going to discover that $data isn't available when it's needed.

What you're saying here is that there's no way to use standard linux tools to manipulate files and know whether it failed, without filtering kernel logs for IO errors. Or am I missing something?


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 12:55:36 -0700

Hi,

On 2018-04-12 01:34:45 -0400, Theodore Y. Ts'o wrote:

The solution we use at Google is that we watch for I/O errors using a completely different process that is responsible for monitoring machine health. It used to scrape dmesg, but we now arrange to have I/O errors get sent via a netlink channel to the machine health monitoring daemon.

Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.

The reality is that recovering from disk errors is tricky business, and I very much doubt most userspace applications, including distro package managers, are going to want to engineer for trying to detect and recover from disk errors. If that were true, then Red Hat and/or SuSE have kernel engineers, and they would have implemented everything everything on your wish list. They haven't, and that should tell you something.

The problem really isn't about recovering from disk errors. Knowing about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that "we" get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.

The other reality is that once a disk starts developing errors, in reality you will probably need to take the disk off-line, scrub it to find any other media errors, and there's a good chance you'll need to rewrite bad sectors (incluing some which are on top of file system metadata, so you probably will have to run fsck or reformat the whole file system). I certainly don't think it's realistic to assume adding lots of sophistication to each and every userspace program.

If you have tens or hundreds of thousands of disk drives, then you will need to do tsomething automated, but I claim that you really don't want to smush all of that detailed exception handling and HDD repair technology into each database or cluster file system component. It really needs to be done in a separate health-monitor and machine-level management system.

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 13:13:22 -0700

Hi,

On 2018-04-12 11:16:46 -0400, Theodore Y. Ts'o wrote:

That's the problem. The best that could be done (and it's not enough) would be to have a mode which does with the PG folks want (or what they think they want). It seems what they want is to have an error result in the page being marked clean. When they discover the outcome (OOM-city and the unability to unmount a file system on a failed drive), then they will complain to us again, at which point we can tell them that want they really want is another variation on O_PONIES, and welcome to the real world and real life.

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient. I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem. If the drive is entirely gone there's obviously no point in keeping per-file information around, so per-blockdev/fs information suffices entirely to return an error on fsync (which at least on ext4 appears to happen if the underlying blockdev is gone).

Have fun making up things we want, but I'm not sure it's particularly productive.

Which is why, even if they were to pay someone to implement what they want, I'm not sure we would want to accept it upstream --- or distro's might consider it a support nightmare, and refuse to allow that mode to be enabled on enterprise distro's. But at least, it will have been some PG-based company who will have implemented it, so they're not wasting other people's time or other people's resources...

Well, that's why I'm discussing here so we can figure out what's acceptable before considering wasting money and revew cycles doing or paying somebody to do some crazy useless shit.

We could try to get something like what Google is doing upstream, which is to have the I/O errors sent to userspace via a netlink channel (without changing anything else about how buffered writeback is handled in the face of errors).

Ah, darn. After you'd mentioned that in an earlier mail I'd hoped that'd be upstream. And yes, that'd be perfect.

Then userspace applications could switch to Direct I/O like all of the other really serious userspace storage solutions I'm aware of, and then someone could try to write some kind of HDD health monitoring system that tries to do the right thing when a disk is discovered to have developed some media errors or something more serious (e.g., a head failure). That plus some kind of RAID solution is I think the only thing which is really realistic for a typical PG site.

As I said earlier, I think there's good reason to move to DIO for postgres. But to keep that performant is going to need some serious work.

But afaict such a solution wouldn't really depend on applications using DIO or not. Before finishing a checkpoint (logging it persistently and allowing to throw older data away), we could check if any errors have been reported and give up if there have been any. And after starting postgres on a directory restored from backup using $tool, we can fsync the directory recursively, check for such errors, and give up if there've been any.


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 13:24:57 -0700

On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care that much.

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to
2) oldcount=$(cat /sys/$whatnot/unreported_errors)
3) filesystem operations in $directory
4) sync;sync;
5) newcount=$(cat /sys/$whatnot/unreported_errors)
6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Thu, 12 Apr 2018 13:28:30 -0700

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 14:11:45 -0700

On 2018-04-12 07:24:12 -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 15:45 +1000, Dave Chinner wrote:

On Wed, Apr 11, 2018 at 07:32:21PM -0700, Andres Freund wrote:

Hi,

On 2018-04-12 10:09:16 +1000, Dave Chinner wrote: > To pound the broken record: there are many good reasons why Linux > filesystem developers have said "you should use direct IO" to the PG > devs each time we have this "the kernel doesn't do PG needs>" discussion.

I personally am on board with doing that. But you also gotta recognize that an efficient DIO usage is a metric ton of work, and you need a large amount of differing logic for different platforms. It's just not realistic to do so for every platform. Postgres is developed by a small number of people, isn't VC backed etc. The amount of resources we can throw at something is fairly limited. I'm hoping to work on adding linux DIO support to pg, but I'm sure as hell not going to do be able to do the same on windows (solaris, hpux, aix, ...) etc.

And there's cases where that just doesn't help at all. Being able to untar a database from backup / archive / timetravel / whatnot, and then fsyncing the directory tree to make sure it's actually safe, is really not an insane idea.

Yes it is.

This is what syncfs() is for - making sure a large amount of of data and metadata spread across many files and subdirectories in a single filesystem is pushed to stable storage in the most efficient manner possible.

syncfs isn't standardized, it operates on an entire filesystem (thus writing out unnecessary stuff), it has no meaningful documentation of it's return codes. Yes, using syncfs() might better performancewise, but it doesn't seem like it actually solves anything, performance aside:

Just note that the error return from syncfs is somewhat iffy. It doesn't necessarily return an error when one inode fails to be written back. I think it mainly returns errors when you get a metadata writeback error.

You can still use syncfs but what you'd probably have to do is call syncfs while you still hold all of the fd's open, and then fsync each one afterward to ensure that they all got written back properly. That should work as you'd expect.

Which again doesn't allow one to use any non-bespoke tooling (like tar or whatnot). And it means you'll have to call syncfs() every few hundred files, because you'll obviously run into filehandle limitations.


From:   Jeff Layton <jlayton@...hat.com>
Date:   Thu, 12 Apr 2018 17:14:54 -0400

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 17:21:44 -0400

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.

Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.

If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Thu, 12 Apr 2018 14:24:32 -0700

On Thu, Apr 12, 2018 at 05:21:44PM -0400, Theodore Y. Ts'o wrote:

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

That's not how errseq works, Ted ;-)

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

Only ones which occur after the last sampling get reported through this particular file descriptor.


From:   Jeff Layton <jlayton@...hat.com>
Date:   Thu, 12 Apr 2018 17:27:54 -0400

On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:

On 2018-04-12 07:09:14 -0400, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

While there's some differing opinions on the referenced postgres thread, the fundamental problem isn't so much that a retry won't fix the problem, it's that we might NEVER see the failure. If writeback happens in the background, encounters an error, undirties the buffer, we will happily carry on because we've never seen that. That's when we're majorly screwed.

I think there are two issues here - "fsync() on an fd that was just opened" and "persistent error state (without keeping dirty pages in memory)".

If there is background data writeback without an open file descriptor, there is no mechanism for the kernel to return an error to any application which may exist, or may not ever come back.

And that's horrible. If I cp a file, and writeback fails in the background, and I then cat that file before restarting, I should be able to see that that failed. Instead of returning something bogus.

What are you expecting to happen in this case? Are you expecting a read error due to a writeback failure? Or are you just saying that we should be invalidating pages that failed to be written back, so that they can be re-read?

Yes, I'd hope for a read error after a writeback failure. I think that's sane behaviour. But I don't really care that much.

I'll have to respectfully disagree. Why should I interpret an error on a read() syscall to mean that writeback failed? Note that the data is still potentially intact.

What might make sense, IMO, is to just invalidate the pages that failed to be written back. Then you could potentially do a read to fault them in again (i.e. sync the pagecache and the backing store) and possibly redirty them for another try.

Note that you can detect this situation by checking the return code from fsync. It should report the latest error once per file description.

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

syncfs could use some work.

I'm warming to willy's idea to add a per-sb errseq_t. I think that might be a simple way to get better semantics here. Not sure how we want to handle the reporting end yet though...

We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Thu, 12 Apr 2018 14:31:10 -0700

On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.

Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 14:37:56 -0700

On 2018-04-12 17:21:44 -0400, Theodore Y. Ts'o wrote:

On Thu, Apr 12, 2018 at 01:28:30PM -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per-superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

When or how would the per-superblock wb_err flag get cleared?

I don't think unmount + resettable via /sys would be an insane approach. Requiring explicit action to acknowledge data loss isn't a crazy concept. But I think that's something reasonable minds could disagree with.

Would all subsequent fsync() calls on that file system now return EIO? Or would only all subsequent syncfs() calls return EIO?

If it were tied to syncfs, I wonder if there's a way to have some errseq type logic. Store a per superblock (or whatever equivalent thing) errseq value of errors. For each fd calling syncfs() report the error once, but then store the current value in a separate per-fd field. And if that's considered too weird, only report the errors to fds that have been opened from before the error occurred.

I can see writing a tool 'pg_run_and_sync /directo /ries -- command' which opens an fd for each of the filesystems the directories reside on, and calls syncfs() after. That'd allow to use backup/restore tools at least semi safely.

I don't see that that'd realistically would trigger OOM or the inability to unmount a filesystem.

Ted's referring to the current state of affairs where the writeback error is held in the inode; if we can't evict the inode because it's holding the error indicator, that can send us OOM. If instead we transfer the error indicator to the superblock, then there's no problem.

Actually, I was referring to the pg-hackers original ask, which was that after an error, all of the dirty pages that couldn't be written out would stay dirty.

Well, it's an open list, everyone can argue. And initially people at first didn't know the OOM explanation, and then it takes some time to revise ones priors :). I think it's a design question that reasonable people can disagree upon (if "hot" removed devices are handled by throwing data away regardless, at least). But as it's clearly not something viable, we can move on to something that can solve the problem.

If it's only as single inode which is pinned in memory with the dirty flag, that's bad, but it's not as bad as pinning all of the memory pages for which there was a failed write. We would still need to invent some mechanism or define some semantic when it would be OK to clear the per-inode flag and let the memory associated with that pinned inode get released, though.

Yea, I agree that that's not obvious. One way would be to say that it's only automatically cleared when you unlink the file. A bit heavyhanded, but not too crazy.


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 17:52:52 -0400

On Thu, Apr 12, 2018 at 12:55:36PM -0700, Andres Freund wrote:

Any pointers to that the underling netlink mechanism? If we can force postgres to kill itself when such an error is detected (via a dedicated monitoring process), I'd personally be happy enough. It'd be nicer if we could associate that knowledge with particular filesystems etc (which'd possibly hard through dm etc?), but this'd be much better than nothing.

Yeah, sorry, it never got upstreamed. It's not really all that complicated, it was just that there were some other folks who wanted to do something similar, and there was a round of bike-sheddingh several years ago, and nothing ever went upstream. Part of the problem was that our orignial scheme sent up information about file system-level corruption reports --- e.g, those stemming from calls to ext4_error() --- and lots of people had different ideas about how tot get all of the possible information up in some structured format. (Think something like uerf from Digtial's OSF/1.)

We did something really simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.

The problem really isn't about recovering from disk errors. Knowing about them is the crucial part. We do not want to give back clients the information that an operation succeeded, when it actually didn't. There could be improvements above that, but as long as it's guaranteed that "we" get the error (rather than just some kernel log we don't have access to, which looks different due to config etc), it's ok. We can throw our hands up in the air and give up.

Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't that much variability.

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.

Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's possible that might help, but it's not likely to be helpful in my experience.


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 14:53:19 -0700

On 2018-04-12 17:27:54 -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:24 -0700, Andres Freund wrote:

At the very least some way to know that such a failure occurred from userland without having to parse the kernel log. As far as I understand, neither sync(2) (and thus sync(1)) nor syncfs(2) is guaranteed to report an error if it was encountered by writeback in the background.

If that's indeed true for syncfs(2), even if the fd has been opened before (which I can see how it could happen from an implementation POV, nothing would associate a random FD with failures on different files), it's really impossible to detect this stuff from userland without text parsing.

syncfs could use some work.

It's really too bad that it doesn't have a flags argument.

We probably also need to consider how to better track metadata writeback errors (on e.g. ext2). We don't really do that properly at quite yet either.

Even if it'd were just a perf-fs /sys/$something file that'd return the current count of unreported errors in a filesystem independent way, it'd be better than what we have right now.

1) figure out /sys/$whatnot $directory belongs to 2) oldcount=$(cat /sys/$whatnot/unreported_errors) 3) filesystem operations in $directory 4) sync;sync; 5) newcount=$(cat /sys/$whatnot/unreported_errors) 6) test "$oldcount" -eq "$newcount" || die-with-horrible-message

Isn't beautiful to script, but it's also not absolutely terrible.

ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places. By my reading XFS doesn't seem to have something similar.

Wouldn't be bad to standardize...


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 12 Apr 2018 17:57:56 -0400

On Thu, Apr 12, 2018 at 02:53:19PM -0700, Andres Freund wrote:

Isn't beautiful to script, but it's also not absolutely terrible.

ext4 seems to have something roughly like that (/sys/fs/ext4/$dev/errors_count), and by my reading it already seems to be incremented from the necessary places.

This is only for file system inconsistencies noticed by the kernel. We don't bump that count for data block I/O errors.

The same idea could be used on a block device level. It would be pretty simple to maintain a counter for I/O errors, and when the last error was detected on a particular device. You could evne break out and track read errors and write errors eparately if that would be useful.

If you don't care what block was bad, but just that some I/O error had happened, a counter is definitely the simplest approach, and less hair to implemnet and use than something like a netlink channel or scraping dmesg....


From:   Andres Freund <andres@...razel.de>
Date:   Thu, 12 Apr 2018 15:03:59 -0700

Hi,

On 2018-04-12 17:52:52 -0400, Theodore Y. Ts'o wrote:

We did something really simple/stupid. We just sent essentially an ascii test string out the netlink socket. That's because what we were doing before was essentially scraping the output of dmesg (e.g. /dev/kmssg).

That's actually probably the simplest thing to do, and it has the advantage that it will work even on ancient enterprise kernels that PG users are likely to want to use. So you will need to implement the dmesg text scraper anyway, and that's probably good enough for most use cases.

The worst part of that is, as you mention below, needing to handle a lot of different error message formats. I guess it's reasonable enough if you control your hardware, but no such luck.

Aren't there quite realistic scenarios where one could miss kmsg style messages due to it being a ringbuffer?

Right, it's a little challenging because the actual regexp's you would need to use do vary from device driver to device driver. Fortunately nearly everything is a SCSI/SATA device these days, so there isn't that much variability.

There's also SAN / NAS type stuff - not all of that presents as a SCSI/SATA device, right?

Yea, agreed on all that. I don't think anybody actually involved in postgres wants to do anything like that. Seems far outside of postgres' remit.

Some people on the pg-hackers list were talking about wanting to retry the fsync() and hoping that would cause the write to somehow suceed. It's possible that might help, but it's not likely to be helpful in my experience.

Depends on the type of error and storage. ENOSPC, especially over NFS, has some reasonable chances of being cleared up. And for networked block storage it's also not impossible to think of scenarios where that'd work for EIO.

But I think besides hope of clearing up itself, it has the advantage that it trivially can give some feedback to the user. The user'll get back strerror(ENOSPC) with some decent SQL error code, which'll hopefully cause them to investigate (well, once monitoring detects high error rates). It's much nicer for the user to type COMMIT; get an appropriate error back etc, than if the database just commits suicide.


From:   Dave Chinner <david@...morbit.com>
Date:   Fri, 13 Apr 2018 08:44:04 +1000

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Posix says this about write():

  After a write() to a regular file has successfully returned:

     Any successful read() from each byte position in the file that
     was modified by that write shall return the data specified by
     the write() for that position until such byte positions are
     again modified.

IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....


From:   Jeff Layton <jlayton@...hat.com>
Date:   Fri, 13 Apr 2018 08:56:38 -0400

On Thu, 2018-04-12 at 14:31 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 05:14:54PM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 13:28 -0700, Matthew Wilcox wrote:

On Thu, Apr 12, 2018 at 01:13:22PM -0700, Andres Freund wrote:

I think a per-file or even per-blockdev/fs error state that'd be returned by fsync() would be more than sufficient.

Ah; this was my suggestion to Jeff on IRC. That we add a per- superblock wb_err and then allow syncfs() to return it. So you'd open an fd on a directory (for example), and call syncfs() which would return -EIO or -ENOSPC if either of those conditions had occurred since you opened the fd.

Not a bad idea and shouldn't be too costly. mapping_set_error could flag the superblock one before or after the one in the mapping.

We'd need to define what happens if you interleave fsync and syncfs calls on the same inode though. How do we handle file->f_wb_err in that case? Would we need a second field in struct file to act as the per-sb error cursor?

Ooh. I hadn't thought that through. Bleh. I don't want to add a field to struct file for this uncommon case.

Maybe O_PATH could be used for this? It gets you a file descriptor on a particular filesystem, so syncfs() is defined, but it can't report a writeback error. So if you open something O_PATH, you can use the file's f_wb_err for the mapping's error cursor.

That might work.

It'd be a syscall behavioral change so we'd need to document that well. It's probably innocuous though -- I doubt we have a lot of callers in the field opening files with O_PATH and calling syncfs on them.


From:   Jeff Layton <jlayton@...hat.com>
Date:   Fri, 13 Apr 2018 09:18:56 -0400

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 11:08:50AM -0400, Jeff Layton wrote:

On Thu, 2018-04-12 at 22:01 +1000, Dave Chinner wrote:

On Thu, Apr 12, 2018 at 07:09:14AM -0400, Jeff Layton wrote:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

There isn't a right thing. Whatever we do will be wrong for someone.

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Not to mention a POSIX IO ordering violation. Seeing stale data after a "successful" write is simply not allowed.

I'm not so sure here, given that we're dealing with an error condition. Are we really obligated not to allow any changes to pages that we can't write back?

Posix says this about write():

After a write() to a regular file has successfully returned:

 Any successful read() from each byte position in the file that
 was modified by that write shall return the data specified by
 the write() for that position until such byte positions are
 again modified.

IOWs, even if there is a later error, we told the user the write was successful, and so according to POSIX we are not allowed to wind back the data to what it was before the write() occurred.

Given that the pages are clean after these failures, we aren't doing this even today:

Suppose we're unable to do writes but can do reads vs. the backing store. After a wb failure, the page has the dirty bit cleared. If it gets kicked out of the cache before the read occurs, it'll have to be faulted back in. Poof -- your write just disappeared.

Yes - I was pointing out what the specification we supposedly conform to says about this behaviour, not that our current behaviour conforms to the spec. Indeed, have you even noticed xfs_aops_discard_page() and it's surrounding context on page writeback submission errors?

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.


From:   Andres Freund <andres@...razel.de>
Date:   Fri, 13 Apr 2018 06:25:35 -0700

Hi,

On 2018-04-13 09:18:56 -0400, Jeff Layton wrote:

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

It's not perfect, but I think the amount of hacky OS specific code should be acceptable. And it does allow for a wrapper tool that can be used around backup restores etc to syncfs all the necessary filesystems. Let me mull with others for a bit.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Fri, 13 Apr 2018 07:02:32 -0700

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.
  2. Background writebacks should skip pages which are PageError.
  3. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

I think kupdate writes are the same as for_background writes. for_reclaim is tougher. I don't want to see us getting into OOM because we're hanging onto stale data, but we don't necessarily have an open fd to report the error on. I think I'm leaning towards behaving the same for for_reclaim as for_sync, but this is probably a subject on which reasonable people can disagree.

And this logic all needs to be on one place, although invoked from each filesystem.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Fri, 13 Apr 2018 07:48:07 -0700

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether anyone has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox mawilcox@...rosoft.com

diff --git a/lib/errseq.c b/lib/errseq.c
index df782418b333..093f1fba4ee0 100644
--- a/lib/errseq.c
+++ b/lib/errseq.c
@@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set);
 errseq_t errseq_sample(errseq_t *eseq)
 {
 	errseq_t old = READ_ONCE(*eseq);
-	errseq_t new = old;
 
-	/*
-	 * For the common case of no errors ever having been set, we can skip
-	 * marking the SEEN bit. Once an error has been set, the value will
-	 * never go back to zero.
-	 */
-	if (old != 0) {
-		new |= ERRSEQ_SEEN;
-		if (old != new)
-			cmpxchg(eseq, old, new);
-	}
-	return new;
+	/* If nobody has seen this error yet, then we can be the first. */
+	if (!(old & ERRSEQ_SEEN))
+		old = 0;
+	return old;
 }
 EXPORT_SYMBOL(errseq_sample);

From:   Dave Chinner <david@...morbit.com>
Date:   Sat, 14 Apr 2018 11:47:52 +1000

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

On Fri, Apr 13, 2018 at 09:18:56AM -0400, Jeff Layton wrote:

On Fri, 2018-04-13 at 08:44 +1000, Dave Chinner wrote:

To save you looking, XFS will trash the page contents completely on a filesystem level ->writepage error. It doesn't mark them "clean", doesn't attempt to redirty and rewrite them - it clears the uptodate state and may invalidate it completely. IOWs, the data written "sucessfully" to the cached page is now gone. It will be re-read from disk on the next read() call, in direct violation of the above POSIX requirements.

This is my point: we've done that in XFS knowing that we violate POSIX specifications in this specific corner case - it's the lesser of many evils we have to chose between. Hence if we chose to encode that behaviour as the general writeback IO error handling algorithm, then it needs to done with the knowledge it is a specification violation. Not to mention be documented as a POSIX violation in the various relevant man pages and that this is how all filesystems will behave on async writeback error.....

Got it, thanks.

Yes, I think we ought to probably do the same thing globally. It's nice to know that xfs has already been doing this. That makes me feel better about making this behavior the gold standard for Linux filesystems.

So to summarize, at this point in the discussion, I think we want to consider doing the following:

  • better reporting from syncfs (report an error when even one inode failed to be written back since last syncfs call). We'll probably implement this via a per-sb errseq_t in some fashion, though there are some implementation issues to work out.

  • invalidate or clear uptodate flag on pages that experience writeback errors, across filesystems. Encourage this as standard behavior for filesystems and maybe add helpers to make it easier to do this.

Did I miss anything? Would that be enough to help the Pg usecase?

I don't see us ever being able to reasonably support its current expectation that writeback errors will be seen on fd's that were opened after the error occurred. That's a really thorny problem from an object lifetime perspective.

I think we can do better than XFS is currently doing (but I agree that we should have the same behaviour across all Linux filesystems!)

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

We used to do this with XFS metadata. We'd just keep trying to write metadata and keep the filesystem running (because it's consistent in memory and it might be a transient error) rather than shutting down the filesystem after a couple of retries. the result was that users wouldn't notice there were problems until unmount, and the most common sympton of that was "why is system shutdown hanging?".

We now don't hang at unmount by default:

$ cat /sys/fs/xfs/dm-0/error/fail_at_unmount 
1
$

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

And this logic all needs to be on one place, although invoked from each filesystem.

Perhaps so, but as there's no "one-size-fits-all" behaviour, I really want to extend the XFS error config infrastructure to control what the filesystem does on error here.


From:   Andres Freund <andres@...razel.de>
Date:   Fri, 13 Apr 2018 19:04:33 -0700

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?


From:   Matthew Wilcox <willy@...radead.org>
Date:   Fri, 13 Apr 2018 19:38:14 -0700

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

So ... exponential backoff on retries?

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.



From:   bfields@...ldses.org (J. Bruce Fields)
Date:   Wed, 18 Apr 2018 12:52:19 -0400

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.


From:   bfields@...ldses.org (J. Bruce Fields)
Date:   Wed, 18 Apr 2018 14:09:03 -0400

On Wed, Apr 11, 2018 at 07:17:52PM -0700, Andres Freund wrote:

Hi,

On 2018-04-11 15:52:44 -0600, Andreas Dilger wrote:

On Apr 10, 2018, at 4:07 PM, Andres Freund andres@...razel.de wrote:

2018-04-10 18:43:56 Ted wrote:

So for better or for worse, there has not been as much investment in buffered I/O and data robustness in the face of exception handling of storage devices.

That's a bit of a cop out. It's not just databases that care. Even more basic tools like SCM, package managers and editors care whether they can proper responses back from fsync that imply things actually were synced.

Sure, but it is mostly PG that is doing (IMHO) crazy things like writing to thousands(?) of files, closing the file descriptors, then expecting fsync() on a newly-opened fd to return a historical error.

It's not just postgres. dpkg (underlying apt, on debian derived distros) to take an example I just randomly guessed, does too: /* We want to guarantee the extracted files are on the disk, so that the * subsequent renames to the info database do not end up with old or zero * length files in case of a system crash. As neither dpkg-deb nor tar do * explicit fsync()s, we have to do them here. * XXX: This could be avoided by switching to an internal tar extractor. */ dir_sync_contents(cidir);

(a bunch of other places too)

Especially on ext3 but also on newer filesystems it's performancewise entirely infeasible to fsync() every single file individually - the performance becomes entirely attrocious if you do that.

Is that still true if you're able to use some kind of parallelism? (async io, or fsync from multiple processes?)


From:   Dave Chinner <david@...morbit.com>
Date:   Thu, 19 Apr 2018 09:59:50 +1000

On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?

That's for metadata writeback error behaviour, not data writeback IO errors.

We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).


From:   Dave Chinner <david@...morbit.com>
Date:   Thu, 19 Apr 2018 10:13:43 +1000

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

e.g. XFS gets to enospc, runs out of reserve pool blocks so can't allocate space to write back the page, then space is freed up a few seconds later and so the next write will work just fine.

This is a recipe for "I lost data that I wrote /days/ before the system crashed" bug reports.

So ... exponential backoff on retries?

Maybe, but I don't think that actually helps anything and adds yet more "when should we write this" complication to inode writeback....

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....


From:   Eric Sandeen <esandeen@...hat.com>
Date:   Wed, 18 Apr 2018 19:23:46 -0500

On 4/18/18 6:59 PM, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:04:33PM -0700, Andres Freund wrote:

Hi,

On 2018-04-14 11:47:52 +1000, Dave Chinner wrote:

And we treat different errors according to their seriousness. EIO and device ENOSPC we default to retry forever because they are often transient, but for ENODEV we fail and shutdown immediately (someone pulled the USB stick out). metadata failure behaviour is configured via changing fields in /sys/fs/xfs//error/metadata//...

We've planned to extend this failure configuration to data IO, too, but never quite got around to it yet. this is a clear example of "one size doesn't fit all" and I think we'll end up doing the same sort of error behaviour configuration in XFS for these cases. (i.e. /sys/fs/xfs//error/writeback//....)

Have you considered adding an ext/fat/jfs errors=remount-ro/panic/continue style mount parameter?

That's for metadata writeback error behaviour, not data writeback IO errors.

/me points casually at data_err=abort & data_err=ignore in ext4...

       data_err=ignore
              Just print an error message if an error occurs in a file data buffer in ordered mode.

       data_err=abort
              Abort the journal if an error occurs in a file data buffer in ordered mode.

Just sayin'

We are definitely not planning to add mount options to configure IO error behaviors. Mount options are a horrible way to configure filesystem behaviour and we've already got other, fine-grained configuration infrastructure for configuring IO error behaviour. Which, as I just pointed out, was designed to be be extended to data writeback and other operational error handling in the filesystem (e.g. dealing with ENOMEM in different ways).

I don't disagree, but there are already mount-option knobs in ext4, FWIW.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Wed, 18 Apr 2018 17:40:37 -0700

On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....

Right. But then that's on the application.


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Wed, 18 Apr 2018 21:08:19 -0400

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o

Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.



From:   Christoph Hellwig <hch@...radead.org>
Date:   Thu, 19 Apr 2018 01:39:04 -0700

On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.

Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.


From:   "J. Bruce Fields" <bfields@...ldses.org>
Date:   Thu, 19 Apr 2018 10:10:16 -0400

On Thu, Apr 19, 2018 at 01:39:04AM -0700, Christoph Hellwig wrote:

On Wed, Apr 18, 2018 at 12:52:19PM -0400, J. Bruce Fields wrote:

Theodore Y. Ts'o - 10.04.18, 20:43:

First of all, what storage devices will do when they hit an exception condition is quite non-deterministic. For example, the vast majority of SSD's are not power fail certified. What this means is that if they suffer a power drop while they are doing a GC, it is quite possible for data written six months ago to be lost as a result. The LBA could potentialy be far, far away from any LBA's that were recently written, and there could have been multiple CACHE FLUSH operations in the since the LBA in question was last written six months ago. No matter; for a consumer-grade SSD, it's possible for that LBA to be trashed after an unexpected power drop.

Pointers to documentation or papers or anything? The only google results I can find for "power fail certified" are your posts.

I've always been confused by SSD power-loss protection, as nobody seems completely clear whether it's a safety or a performance feature.

Devices from reputable vendors should always be power fail safe, bugs notwithstanding. What power-loss protection in marketing slides usually means is that an SSD has a non-volatile write cache. That is once a write is ACKed data is persisted and no additional cache flush needs to be sent. This is a feature only available in expensive eterprise SSDs as the required capacitors are expensive. Cheaper consumer or boot driver SSDs have a volatile write cache, that is we need to do a separate cache flush to persist data (REQ_OP_FLUSH in Linux). But a reasonable implementation of those still won't corrupt previously written data, they will just lose the volatile write cache that hasn't been flushed. Occasional bugs, bad actors or other issues might still happen.

Thanks! That was my understanding too. But then the name is terrible. As is all the vendor documentation I can find:

https://insights.samsung.com/2016/03/22/power-loss-protection-how-ssds-are-protecting-data-integrity-white-paper/

"Power loss protection is a critical aspect of ensuring data integrity, especially in servers or data centers."

https://www.intel.com/content/.../ssd-320-series-power-loss-data-protection-brief.pdf

"Data safety features prepare for unexpected power-loss and protect system and user data."

Why do they all neglect to mention that their consumer drives are also perfectly capable of well-defined behavior after power loss, just at the expense of flush performance? It's ridiculously confusing.


From:   Matthew Wilcox <willy@...radead.org>
Date:   Thu, 19 Apr 2018 10:40:10 -0700

On Wed, Apr 18, 2018 at 09:08:19PM -0400, Theodore Y. Ts'o wrote:

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.o

Maybe we shouldn't be trying to do any of this in the kernel, or at least as little as possible in the kernel? Perhaps it would be better to do most of this as a device mapper hack; I suspect we'll need userspace help to igure out whether the user has plugged the same USB stick in, or a different USB stick, anyway.

The device mapper target (dm-removable?) was my first idea too, but I kept thinking through use cases and I think we end up wanting this functionality in the block layer. Let's try a story.

Stephen the PFY goes into the data centre looking to hotswap a failed drive. Due to the eight pints of lager he had for lunch, he pulls out the root drive instead of the failed drive. The air raid siren warbles and he realises his mistake, shoving the drive back in.

CYOA:

Currently: All writes are lost, calamities ensue. The PFY is fired.

With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.

Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.


From:   "Theodore Y. Ts'o" <tytso@....edu>
Date:   Thu, 19 Apr 2018 19:27:15 -0400

On Thu, Apr 19, 2018 at 10:40:10AM -0700, Matthew Wilcox wrote:

With dm-removable: Nobody thought to set up dm-removable on the root drive. Calamities still ensue, but now it's the BOFH's fault instead of the PFY's fault.

Built into the block layer: After a brief hiccup while we reattach the drive to its block_device, the writes resume and nobody loses their job.

What you're talking about is a deployment issue, though. Ultimately the distribution will set up dm-removable automatically if the user requests it, much like it sets up dm-crypt automatically for laptop users upon request.

My concern is that not all removable devices have a globally unique id number available in hardware so the kernel can tell whether or not it's the same device that has been plugged in. There are hueristics you could use -- for example, you could look at the file system uuid plus the last fsck time. But they tend to be very file system specific, and not things we would want ot have in the kernel.


From:   Dave Chinner <david@...morbit.com>
Date:   Fri, 20 Apr 2018 09:28:59 +1000

On Wed, Apr 18, 2018 at 05:40:37PM -0700, Matthew Wilcox wrote:

On Thu, Apr 19, 2018 at 10:13:43AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:38:14PM -0700, Matthew Wilcox wrote:

On Sat, Apr 14, 2018 at 11:47:52AM +1000, Dave Chinner wrote:

On Fri, Apr 13, 2018 at 07:02:32AM -0700, Matthew Wilcox wrote:

  1. If we get an error while wbc->for_background is true, we should not clear uptodate on the page, rather SetPageError and SetPageDirty.

So you're saying we should treat it as a transient error rather than a permanent error.

Yes, I'm proposing leaving the data in memory in case the user wants to try writing it somewhere else.

And if it's getting IO errors because of USB stick pull? What then?

I've been thinking about this. Ideally we want to pass some kind of notification all the way up to the desktop and tell the user to plug the damn stick back in. Then have the USB stick become the same blockdev that it used to be, and complete the writeback. We are so far from being able to do that right now that it's not even funny.

nod

But in the meantime, device unplug (should give ENODEV, not EIO) is a fatal error and we need to toss away the data.

  1. Background writebacks should skip pages which are PageError.

That seems decidedly dodgy in the case where there is a transient error - it requires a user to specifically run sync to get the data to disk after the transient error has occurred. Say they don't notice the problem because it's fleeting and doesn't cause any obvious problems?

That's fair. What I want to avoid is triggering the same error every 30 seconds (or whatever the periodic writeback threshold is set to).

So if kernel ring buffer overflows and so users miss the first error report, they'll have no idea that the data writeback is still failing?

I wasn't thinking about kernel ringbuffer based reporting; I was thinking about errseq_t based reporting, so the application can tell the fsync failed and maybe does something application-level to recover like send the transactions across to another node in the cluster (or whatever this hypothetical application is).

But if it's still failing, then we should be still trying to report the error. i.e. if fsync fails and the page remains dirty, then the next attmept to write it is a new error and fsync should report that. IOWs, I think we should be returning errors at every occasion errors need to be reported if we have a persistent writeback failure...

  1. for_sync writebacks should attempt one last write. Maybe it'll succeed this time. If it does, just ClearPageError. If not, we have somebody to report this writeback error to, and ClearPageUptodate.

Which may well be unmount. Are we really going to wait until unmount to report fatal errors?

Goodness, no. The errors would be immediately reportable using the wb_err mechanism, as soon as the first error was encountered.

But if there are no open files when the error occurs, that error won't get reported to anyone. Which means the next time anyone accesses that inode from a user context could very well be unmount or a third party sync/syncfs()....

Right. But then that's on the application.

Which we know don't do the right thing. Seems like a lot of hoops to jump through given it still won't work if the appliction isn't changed to support linux specific error handling requirements...


From:   Jan Kara <jack@...e.cz>
Date:   Sat, 21 Apr 2018 18:59:54 +0200

On Fri 13-04-18 07:48:07, Matthew Wilcox wrote:

On Tue, Apr 10, 2018 at 03:07:26PM -0700, Andres Freund wrote:

I don't think that's the full issue. We can deal with the fact that an fsync failure is edge-triggered if there's a guarantee that every process doing so would get it. The fact that one needs to have an FD open from before any failing writes occurred to get a failure, THAT'S the big issue.

Beyond postgres, it's a pretty common approach to do work on a lot of files without fsyncing, then iterate over the directory fsync everything, and then assume you're safe. But unless I severaly misunderstand something that'd only be safe if you kept an FD for every file open, which isn't realistic for pretty obvious reasons.

While accepting that under memory pressure we can still evict the error indicators, we can do a better job than we do today. The current design of error reporting says that all errors which occurred before you opened the file descriptor are of no interest to you. I don't think that's necessarily true, and it's actually a change of behaviour from before the errseq work.

Consider Stupid Task A which calls open(), write(), close(), and Smart Task B which calls open(), write(), fsync(), close() operating on the same file. If A goes entirely before B and encounters an error, before errseq_t, B would see the error from A's write.

If A and B overlap, even a little bit, then B still gets to see A's error today. But if writeback happens for A's write before B opens the file then B will never see the error.

B doesn't want to see historical errors that a previous invocation of B has already handled, but we know whether anyone has seen the error or not. So here's a patch which restores the historical behaviour of seeing old unhandled errors on a fresh file descriptor:

Signed-off-by: Matthew Wilcox mawilcox@...rosoft.com

So I agree with going to the old semantics of reporting errors from before a file was open at least once to someone. As the PG case shows apps are indeed relying on the old behavior. As much as it is unreliable, it ends up doing the right thing for these apps in 99% of cases and we shouldn't break them (BTW IMO the changelog should contain a note that this fixes a regression of PostgreSQL, a reference to this thread and CC to stable). Anyway feel free to add:

Reviewed-by: Jan Kara jack@...e.cz

Oh, and to make myself clear I do think we need to find a better way of reporting IO errors. I consider this just an immediate band-aid to avoid userspace regressions.

diff --git a/lib/errseq.c b/lib/errseq.c index df782418b333..093f1fba4ee0 100644 --- a/lib/errseq.c +++ b/lib/errseq.c @@ -119,19 +119,11 @@ EXPORT_SYMBOL(errseq_set); errseq_t errseq_sample(errseq_t *eseq) { errseq_t old = READ_ONCE(*eseq); - errseq_t new = old;


From:   Jan Kara <jack@...e.cz>
Date:   Sat, 21 Apr 2018 20:14:29 +0200

On Thu 12-04-18 07:09:14, Jeff Layton wrote:

On Wed, 2018-04-11 at 20:02 -0700, Matthew Wilcox wrote:

At the moment, when we open a file, we sample the current state of the writeback error and only report new errors. We could set it to zero instead, and report the most recent error as soon as anything happens which would report an error. That way err = close(open("file")); would report the most recent error.

That's not going to be persistent across the data structure for that inode being removed from memory; we'd need filesystem support for persisting that. But maybe it's "good enough" to only support it for recent files.

Jeff, what do you think?

I hate it :). We could do that, but....yecchhhh.

Reporting errors only in the case where the inode happened to stick around in the cache seems too unreliable for real-world usage, and might be problematic for some use cases. I'm also not sure it would really be helpful.

So this is never going to be perfect but I think we could do good enough by: 1) Mark inodes that hit IO error. 2) If the inode gets evicted from memory we store the fact that we hit an error for this IO in a more space efficient data structure (sparse bitmap, radix tree, extent tree, whatever). 3) If the underlying device gets destroyed, we can just switch the whole SB to an error state and forget per inode info. 4) If there's too much of per-inode error info (probably per-fs configurable limit in terms of number of inodes), we would yell in the kernel log, switch the whole fs to the error state and forget per inode info.

This way there won't be silent loss of IO errors. Memory usage would be reasonably limited. It could happen the whole fs would switch to error state "prematurely" but if that's a problem for the machine, admin could tune the limit for number of inodes to keep IO errors for...

I think the crux of the matter here is not really about error reporting, per-se.

I think this is related but a different question.

I asked this at LSF last year, and got no real answer:

When there is a writeback error, what should be done with the dirty page(s)? Right now, we usually just mark them clean and carry on. Is that the right thing to do?

One possibility would be to invalidate the range that failed to be written (or the whole file) and force the pages to be faulted in again on the next access. It could be surprising for some applications to not see the results of their writes on a subsequent read after such an event.

Maybe that's ok in the face of a writeback error though? IDK.

I can see the admin wanting to rather kill the machine with OOM than having to deal with data loss due to IO errors (e.g. if he has HA server fail over set up). Or retry for some time before dropping the dirty data. Or do what we do now (possibly with invalidating pages as you say). As Dave said elsewhere there's not one strategy that's going to please everybody. So it might be beneficial to have this configurable like XFS has it for metadata.

OTOH if I look at the problem from application developer POV, most apps will just declare game over at the face of IO errors (if they take care to check for them at all). And the sophisticated apps that will try some kind of error recovery have to be prepared that the data is just gone (as depending on what exactly the kernel does is rather fragile) so I'm not sure how much practical value the configurable behavior on writeback errors would bring.


Wed, 28 Mar 2018 00:00:00 +0000


Computer latency: 1977-2017

I've had this nagging feeling that the computers I use today feel slower than the computers I used as a kid. As a rule, I don’t trust this kind of feeling because human perception has been shown to be unreliable in empirical studies, so I carried around a high-speed camera and measured the response latency of devices I’ve run into in the past few months. Here are the results:

computerlatency
(ms)
yearclock# T
apple 2e3019831 MHz3.5k
ti 99/4a4019813 MHz8k
custom haswell-e 165Hz5020143.5 GHz2G
commodore pet 40166019771 MHz3.5k
sgi indy601993.1 GHz1.2M
custom haswell-e 120Hz6020143.5 GHz2G
thinkpad 13 chromeos7020172.3 GHz1G
imac g4 os 9702002.8 GHz11M
custom haswell-e 60Hz8020143.5 GHz2G
mac color classic90199316 MHz273k
powerspec g405 linux 60Hz9020174.2 GHz2G
macbook pro 201410020142.6 GHz700M
thinkpad 13 linux chroot10020172.3 GHz1G
lenovo x1 carbon 4g linux11020162.6 GHz1G
imac g4 os x1202002.8 GHz11M
custom haswell-e 24Hz14020143.5 GHz2G
lenovo x1 carbon 4g win15020162.6 GHz1G
next cube150198825 MHz1.2M
powerspec g405 linux17020174.2 GHz2G
packet around the world190
powerspec g405 win20020174.2 GHz2G
symbolics 362030019865 MHz390k

These are tests of the latency between a keypress and the display of a character in a terminal (see appendix for more details). The results are sorted from quickest to slowest. In the latency column, the background goes from green to yellow to red to black as devices get slower and the background gets darker as devices get slower. No devices are green. When multiple OSes were tested on the same machine, the os is in bold. When multiple refresh rates were tested on the same machine, the refresh rate is in italics.

In the year column, the background gets darker and purple-er as devices get older. If older devices were slower, we’d see the year column get darker as we read down the chart.

The next two columns show the clock speed and number of transistors in the processor. Smaller numbers are darker and blue-er. As above, if slower clocked and smaller chips correlated with longer latency, the columns would get darker as we go down the table, but it, if anything, seems to be the other way around.

For reference, the latency of a packet going around the world through fiber from NYC back to NYC via Tokyo and London is inserted in the table.

If we look at overall results, the fastest machines are ancient. Newer machines are all over the place. Fancy gaming rigs with unusually high refresh-rate displays are almost competitive with machines from the late 70s and early 80s, but “normal” modern computers can’t compete with thirty to forty year old machines.

We can also look at mobile devices. In this case, we’ll look at scroll latency in the browser:

devicelatency
(ms)
year
ipad pro 10.5" pencil302017
ipad pro 10.5"702017
iphone 4s702011
iphone 6s702015
iphone 3gs702009
iphone x802017
iphone 8802017
iphone 7802016
iphone 6802014
gameboy color801998
iphone 5902012
blackberry q101002013
huawei honor 81102016
google pixel 2 xl1102017
galaxy s71202016
galaxy note 31202016
moto x1202013
nexus 5x1202015
oneplus 3t1302016
blackberry key one1302017
moto e (2g)1402015
moto g4 play1402017
moto g4 plus1402016
google pixel1402016
samsung galaxy avant1502014
asus zenfone3 max1502016
sony xperia z5 compact1502015
htc one m41602013
galaxy s4 mini1702013
lg k41802016
packet190
htc rezound2402011
palm pilot 10004901996
kindle oasis 25702017
kindle paperwhite 36302015
kindle 48602011

As above, the results are sorted by latency and color-coded from green to yellow to red to black as devices get slower. Also as above, the year gets purple-er (and darker) as the device gets older.

If we exclude the game boy color, which is a different class of device than the rest, all of the quickest devices are Apple phones or tablets. The next quickest device is the blackberry q10. Although we don’t have enough data to really tell why the blackberry q10 is unusually quick for a non-Apple device, one plausible guess is that it’s helped by having actual buttons, which are easier to implement with low latency than a touchscreen. The other two devices with actual buttons are the gameboy color and the kindle 4.

After that iphones and non-kindle button devices, we have a variety of Android devices of various ages. At the bottom, we have the ancient palm pilot 1000 followed by the kindles. The palm is hamstrung by a touchscreen and display created in an era with much slower touchscreen technology and the kindles use e-ink displays, which are much slower than the displays used on modern phones, so it’s not surprising to see those devices at the bottom.

Why is the apple 2e so fast?

Compared to a modern computer that’s not the latest ipad pro, the apple 2 has significant advantages on both the input and the output, and it also has an advantage between the input and the output for all but the most carefully written code since the apple 2 doesn’t have to deal with context switches, buffers involved in handoffs between different processes, etc.

On the input, if we look at modern keyboards, it’s common to see them scan their inputs at 100 Hz to 200 Hz (e.g., the ergodox claims to scan at 167 Hz). By comparison, the apple 2e effectively scans at 556 Hz. See appendix for details.

If we look at the other end of the pipeline, the display, we can also find latency bloat there. I have a display that advertises 1 ms switching on the box, but if we look at how long it takes for the display to actually show a character from when you can first see the trace of it on the screen until the character is solid, it can easily be 10 ms. You can even see this effect with some high-refresh-rate displays that are sold on their allegedly good latency.

At 144 Hz, each frame takes 7 ms. A change to the screen will have 0 ms to 7 ms of extra latency as it waits for the next frame boundary before getting rendered (on average,we expect half of the maximum latency, or 3.5 ms). On top of that, even though my display at home advertises a 1 ms switching time, it actually appears to take 10 ms to fully change color once the display has started changing color. When we add up the latency from waiting for the next frame to the latency of an actual color change, we get an expected latency of 7/2 + 10 = 13.5ms

With the old CRT in the apple 2e, we’d expect half of a 60 Hz refresh (16.7 ms / 2) plus a negligible delay, or 8.3 ms. That’s hard to beat today: a state of the art “gaming monitor” can get the total display latency down into the same range, but in terms of marketshare, very few people have such displays, and even displays that are advertised as being fast aren’t always actually fast.

iOS rendering pipeline

If we look at what’s happening between the input and the output, the differences between a modern system and an apple 2e are too many to describe without writing an entire book. To get a sense of the situation in modern machines, here’s former iOS/UIKit engineer Andy Matuschak’s high-level sketch of what happens on iOS, which he says should be presented with the disclaimer that “this is my out of date memory of out of date information”:

Andy says “the actual amount of work happening here is typically quite small. A few ms of CPU time. Key overhead comes from:”

By comparison, on the Apple 2e, there basically aren’t handoffs, locks, or process boundaries. Some very simple code runs and writes the result to the display memory, which causes the display to get updated on the next scan.

Refresh rate vs. latency

One thing that’s curious about the computer results is the impact of refresh rate. We get a 90 ms improvement from going from 24 Hz to 165 Hz. At 24 Hz each frame takes 41.67 ms and at 165 Hz each frame takes 6.061 ms. As we saw above, if there weren’t any buffering, we’d expect the average latency added by frame refreshes to be 20.8ms in the former case and 3.03 ms in the latter case (because we’d expect to arrive at a uniform random point in the frame and have to wait between 0ms and the full frame time), which is a difference of about 18ms. But the difference is actually 90 ms, implying we have latency equivalent to (90 - 18) / (41.67 - 6.061) = 2 buffered frames.

If we plot the results from the other refresh rates on the same machine (not shown), we can see that they’re roughly in line with a “best fit” curve that we get if we assume that, for that machine running powershell, we get 2.5 frames worth of latency regardless of refresh rate. This lets us estimate what the latency would be if we equipped this low latency gaming machine with an infinity Hz display -- we’d expect latency to be 140 - 2.5 * 41.67 = 36 ms, almost as fast as quick but standard machines from the 70s and 80s.

Complexity

Almost every computer and mobile device that people buy today is slower than common models of computers from the 70s and 80s. Low-latency gaming desktops and the ipad pro can get into the same range as quick machines from thirty to forty years ago, but most off-the-shelf devices aren’t even close.

If we had to pick one root cause of latency bloat, we might say that it’s because of “complexity”. Of course, we all know that complexity is bad. If you’ve been to a non-academic non-enterprise tech conference in the past decade, there’s a good chance that there was at least one talk on how complexity is the root of all evil and we should aspire to reduce complexity.

Unfortunately, it's a lot harder to remove complexity than to give a talk saying that we should remove complexity. A lot of the complexity buys us something, either directly or indirectly. When we looked at the input of a fancy modern keyboard vs. the apple 2 keyboard, we saw that using a relatively powerful and expensive general purpose processor to handle keyboard inputs can be slower than dedicated logic for the keyboard, which would both be simpler and cheaper. However, using the processor gives people the ability to easily customize the keyboard, and also pushes the problem of “programming” the keyboard from hardware into software, which reduces the cost of making the keyboard. The more expensive chip increases the manufacturing cost, but considering how much of the cost of these small-batch artisanal keyboards is the design cost, it seems like a net win to trade manufacturing cost for ease of programming.

We see this kind of tradeoff in every part of the pipeline. One of the biggest examples of this is the OS you might run on a modern desktop vs. the loop that’s running on the apple 2. Modern OSes let programmers write generic code that can deal with having other programs simultaneously running on the same machine, and do so with pretty reasonable general performance, but we pay a huge complexity cost for this and the handoffs involved in making this easy result in a significant latency penalty.

A lot of the complexity might be called accidental complexity, but most of that accidental complexity is there because it’s so convenient. At every level from the hardware architecture to the syscall interface to the I/O framework we use, we take on complexity, much of which could be eliminated if we could sit down and re-write all of the systems and their interfaces today, but it’s too inconvenient to re-invent the universe to reduce complexity and we get benefits from economies of scale, so we live with what we have.

For those reasons and more, in practice, the solution to poor performance caused by “excess” complexity is often to add more complexity. In particular, the gains we’ve seen that get us back to the quickness of the quickest machines from thirty to forty years ago have come not from listening to exhortations to reduce complexity, but from piling on more complexity.

The ipad pro is a feat of modern engineering; the engineering that went into increasing the refresh rate on both the input and the output as well as making sure the software pipeline doesn’t have unnecessary buffering is complex! The design and manufacture of high-refresh-rate displays that can push system latency down is also non-trivially complex in ways that aren’t necessary for bog standard 60 Hz displays.

This is actually a common theme when working on latency reduction. A common trick to reduce latency is to add a cache, but adding a cache to a system makes it more complex. For systems that generate new data and can’t tolerate a cache, the solutions are often even more complex. An example of this might be large scale RoCE deployments. These can push remote data access latency from from the millisecond range down to the microsecond range, which enables new classes of applications. However, this has come at a large cost in complexity. Early large-scale RoCE deployments easily took tens of person years of effort to get right and also came with a tremendous operational burden.

Conclusion

It’s a bit absurd that a modern gaming machine running at 4,000x the speed of an apple 2, with a CPU that has 500,000x as many transistors (with a GPU that has 2,000,000x as many transistors) can maybe manage the same latency as an apple 2 in very carefully coded applications if we have a monitor with nearly 3x the refresh rate. It’s perhaps even more absurd that the default configuration of the powerspec g405, which had the fastest single-threaded performance you could get until October 2017, had more latency from keyboard-to-screen (approximately 3 feet, maybe 10 feet of actual cabling) than sending a packet around the world (16187 mi from NYC to Tokyo to London back to NYC, more due to the cost of running the shortest possible length of fiber).

On the bright side, we’re arguably emerging from the latency dark ages and it’s now possible to assemble a computer or buy a tablet with latency that’s in the same range as you could get off-the-shelf in the 70s and 80s. This reminds me a bit of the screen resolution & density dark ages, where CRTs from the 90s offered better resolution and higher pixel density than affordable non-laptop LCDs until relatively recently. 4k displays have now become normal and affordable 8k displays are on the horizon, blowing past anything we saw on consumer CRTs. I don’t know that we’ll see the same kind improvement with respect to latency, but one can hope. There are individual developers improving the experience for people who use certain, very carefully coded, applications, but it's not clear what force could cause a significant improvement in the default experience most users see.

Appendix: why measure latency?

Latency matters! For very simple tasks, people can perceive latencies down to 2 ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

The most commonly cited document on response time is the nielsen group article on response times, which claims that latncies below 100ms feel equivalent and perceived as instantaneous. One easy way to see that this is false is to go into your terminal and try sleep 0; echo "pong" vs. sleep 0.1; echo "test" (or for that matter, try playing an old game that doesn't have latency compensation, like quake 1, with 100 ms ping, or even 30 ms ping, or try typing in a terminal with 30 ms ping). For more info on this and other latency fallacies, see this document on common misconceptions about latency.

Throughput also matters, but this is widely understood and measured. If you go to pretty much any mainstream review or benchmarking site, you can find a wide variety of throughput measurements, so there’s less value in writing up additional throughput measurements.

Appendix: apple 2 keyboard

The apple 2e, instead of using a programmed microcontroller to read the keyboard, uses a much simpler custom chip designed for reading keyboard input, the AY 3600. If we look at the AY 3600 datasheet,we can see that the scan time is (90 * 1/f) and the debounce time is listed as strobe_delay. These quantities are determined by some capacitors and a resistor, which appear to be 47pf, 100k ohms, and 0.022uf for the Apple 2e. Plugging these numbers into the AY3600 datasheet, we can see that f = 50 kHz, giving us a 1.8 ms scan delay and a 6.8 ms debounce delay (assuming the values are accurate -- capacitors can degrade over time, so we should expect the real delays to be shorter on our old Apple 2e), giving us less than 8.6 ms for the internal keyboard logic.

Comparing to a keyboard with a 167 Hz scan rate that scans two extra times to debounce, the equivalent figure is 3 * 6 ms = 18 ms. With a 100Hz scan rate, that becomes 3 * 10 ms = 30 ms. 18 ms to 30 ms of keyboard scan plus debounce latency is in line with what we saw when we did some preliminary keyboard latency measurements.

For reference, the ergodox uses a 16 MHz microcontroller with ~80k transistors and the apple 2e CPU is a 1 MHz chip with 3.5k transistors.

Appendix: why should android phones have higher latency than old apple phones?

As we've seen, raw processing power doesn't help much with many of the causes of latency in the pipeline, like handoffs between different processes, so phones that an android phone with a 10x more powerful processor than an ancient iphone isn't guaranteed to be quicker to respond, even if it can render javascript heavy pages faster.

If you talk to people who work on non-Apple mobile CPUs, you'll find that they run benchmarks like dhrystone (a synthetic benchmark that was irrelevant even when it was created, in 1984) and SPEC2006 (an updated version of a workstation benchmark that was relevant in the 90s and perhaps even as late as the early 2000s if you care about workstation workloads, which are completely different from mobile workloads). This problem where the vendor who makes the component has an intermediate target that's only weakly correlated to the actual user experience. I've heard that there are people working on the pixel phones who care about end-to-end latency, but it's difficult to get good latency when you have to use components that are optimized for things like dhrystone and SPEC2006.

If you talk to people at Apple, you'll find that they're quite cagey, but that they've been targeting the end-to-end user experience for quite a long time and they they can do "full stack" optimizations that are difficult for android vendors to pull of. They're not literally impossible, but making a change to a chip that has to be threaded up through the OS is something you're very unlikely to see unless google is doing the optimization, and google hasn't really been serious about the end-to-end experience until recently.

Having relatively poor performance in aspects that aren't measured is a common theme and one we saw when we looked at terminal latency. Prior to examining temrinal latency, public benchmarks were all throughput oriented and the terminals that priortized performance worked on increasing throughput, even though increasing terminal throughput isn't really useful. After those terminal latency benchmarks, some terminal authors looked into their latency and found places they could trim down buffering and remove latency. You get what you measure.

Appendix: experimental setup

Most measurements were taken with the 240fps camera (4.167 ms resolution) in the iPhone SE. Devices with response times below 40 ms were re-measured with a 1000fps camera (1 ms resolution), the Sony RX100 V in PAL mode. Results in the tables are the results of multiple runs and are rounded to the nearest 10 ms to avoid the impression of false precision. For desktop results, results are measured from when the key started moving until the screen finished updating. Note that this is different from most key-to-screen-update measurements you can find online, which typically use a setup that effectively removes much or all of the keyboard latency, which, as an end-to-end measurement, is only realistic if you have a psychic link to your computer (this isn't to say the measurements aren't useful -- if, as a programmer, you want a reproducible benchmark, it's nice to reduce measurement noise from sources that are beyond your control, but that's not relevant to end users). People often advocate measuring from one of: {the key bottoming out, the tactile feel of the switch}. Other than for measurement convenience, there appears to be no reason to do any of these, but people often claim that's when the user expects the keyboard to "really" work. But these are independent of when the switch actually fires. Both the distance between the key bottoming out and activiation as well as the distance between feeling feedback and activation are arbitrary and can be tuned. See this post on keyboard latency measurements for more info on keyboard fallacies.

Another significant difference is that measurements were done with settings as close to the default OS settings as possible since approximately 0% of users will futz around with display settings to reduce buffering, disable the compositor, etc. Waiting until the screen has finished updating is also different from most end-to-end measurements do -- most consider the update "done" when any movement has been detected on the screen. Waiting until the screen is finished changing is analogous to webpagetest's "visually complete" time.

Computer results were taken using the “default” terminal for the system (e.g., powershell on windows, lxterminal on lubuntu), which could easily cause 20 ms to 30 ms difference between a fast terminal and a slow terminal. Between measuring time in a terminal and measuring the full end-to-end time, measurements in this article should be slower than measurements in other, similar, articles (which tend to measure time to first change in games).

The powerspec g405 baseline result is using integrated graphics (the machine doesn’t come with a graphics card) and the 60 Hz result is with a cheap video card. The baseline was result was at 30 Hz because the integrated graphics only supports hdmi output and the display it was attached to only runs at 30 Hz over hdmi.

Mobile results were done by using the default browser, browsing to https://danluu.com, and measuring the latency from finger movement until the screen first updates to indicate that scrolling has occurred. In the cases where this didn’t make sense, (kindles, gameboy color, etc.), some action that makes sense for the platform was taken (changing pages on the kindle, pressing the joypad on the gameboy color in a game, etc.). Unlike with the desktop/laptop measurements, this end-time for the measurement was on the first visual change to avoid including many frames of scrolling. To make the measurement easy, the measurement was taken with a finger on the touchscreen and the timer was started when the finger started moving (to avoid having to determine when the finger first contacted the screen).

In the case of “ties”, results are ordered by the unrounded latency as a tiebreaker, but this shouldn’t be considered significant. Differences of 10 ms should probably also not be considered significant.

The custom haswell-e was tested with gsync on and there was no observable difference. The year for that box is somewhat arbitrary, since the CPU is from 2014, but the display is newer (I believe you couldn’t get a 165 Hz display until 2015.

The number of transistors for some modern machines is a rough estimate because exact numbers aren’t public. Feel free to ping me if you have a better estimate!

The color scales for latency and year are linear and the color scales for clock speed and number of transistors are log scale.

All Linux results were done with a pre-KPTI kernel. It's possible that KPTI will impact user perceivable latency.

Measurements were done as cleanly as possible (without other things running on the machine/device when possible, with a device that was nearly full on battery for devices with batteries). Latencies when other software is running on the device or when devices are low on battery might be much higher.

If you want a reference to compare the kindle against, a moderately quick page turn in a physical book appears to be about 200 ms.

This is a work in progress. I expect to get benchmarks from a lot more old computers the next time I visit Seattle. If you know of old computers I can test in the NYC area (that have their original displays or something like them), let me know! If you have a device you’d like to donate for testing, feel free to mail it to

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

Thanks to RC, David Albert, Bert Muthalaly, Christian Ternus, Kate Murphy, Ikhwan Lee, Peter Bhat Harkins, Leah Hanson, Alicia Thilani Singham Goodwin, Amy Huang, Dan Bentley, Jacquin Mininger, Rob, Susan Steinman, Raph Levien, Max McCrea, Peter Town, Jon Cinque, Anonymous, and Jonathan Dahan for donating devices to test and thanks to Leah Hanson, Andy Matuschak, Milosz Danczak, amos (@fasterthanlime), @emitter_coupled, Josh Jordan, mrob, and David Albert for comments/corrections/discussion.

Sun, 24 Dec 2017 00:00:00 +0000


How good are decisions?

A statement I commonly hear in tech-utopian circles is that some seeming inefficiency can’t actually be inefficient because the market is efficient and inefficiencies will quickly be eliminated. A contentious example of this is the claim that companies can’t be discriminating because the market is too competitive to tolerate discrimination. A less contentious example is that when you see a big company doing something that seems bizarrely inefficient, maybe it’s not inefficient and you just lack the information necessary to understand why the decision was efficient.

Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.

One nice thing about sports is that they often have detailed play-by-play data and well-defined win criteria which lets us tell, on average, what the expected value of a decision is. In this post, we’ll look at the cost of bad decision making in one sport and then briefly discuss why decision quality in sports might be the same or better as decision quality in other fields.

Just to have a concrete example, we’re going to look at baseball, but you could do the same kind of analysis for football, hockey, basketball, etc., and my understanding is that you’d get a roughly similar result in all of those cases.

We’re going to model baseball as a state machine, both because that makes it easy to understand the expected value of particular decisions and because this lets us talk about the value of decisions without having to go over most of the rules of baseball.

We can treat each baseball game as an independent event. In each game, two teams play against each other and the team that scores more runs (points) wins. Each game is split into 9 “innings” and in each inning each team will get one set of chances on offense. In each inning, each team will play until it gets 3 “outs”. Any given play may or may not result in an out.

One chunk of state in our state machine is the number of outs and the inning. The other chunks of state we’re going to track are who’s “on base” and which player is “at bat”. Each teams defines some order of batters for their active players and after each player bats once this repeats in a loop until the team collects 3 outs and the inning is over. The state of who is at bat is saved between innings. Just for example, you might see batters 1-5 bat in the first inning, 6-9 and then 1 again in the second inning, 2- … etc.

When a player is at bat, the player may advance to a base and players who are on base may also advance, depending on what happens. When a player advances 4 bases (that is, through 1B, 2B, 3B, to what would be 4B except that it isn’t called that) a run is scored and the player is removed from the base. As mentioned above, various events may cause a player to be out, in which case they also stop being on base.

An example state from our state machine is:

{1B, 3B; 2 outs}

This says that there’s a player on 1B, a player on 3B, there are two outs. Note that this is independent of the score, who’s actually playing, and the inning.

Another state is:

{--; 0 outs}

With a model like this, if we want to determine the expected value of the above state, we just need to look up the total number of runs across all innings played in a season divided by the number of innings to find the expected number of runs from the state above (ignoring the 9th inning because a quirk of baseball rules distorts statistics from the 9th inning). If we do this, we find that, from the above state, a team will score .555 runs in expectation.

We can then compute the expected number of runs for all of the other states:

012
basesouts
--.555.297.117
1B.953.573.251
2B1.189.725.344
3B1.482.983.387
1B,2B1.573.971.466
1B,3B1.9041.243.538
2B,3B2.0521.467.634
1B,2B,3B2.4171.650.815

In this table, each entry is the expected number of runs from the remainder of the inning from some particular state. Each column shows the number of outs and each row shows the state of the bases. The color coding scheme is: the starting state (.555 runs) has a white background. States with higher run expectation are more blue and states with lower run expectation are more red.

This table and the other stats in this post come from The Book by Tango et al., which mostly discussed baseball between 1999 and 2002. See the appendix if you're curious about how things change if we use a more detailed model.

The state we’re tracking for an inning here is who’s on base and the number of outs. Innings start with nobody on base and no outs.

As above, we see that we start the inning with .555 runs in expectation. If a play puts someone on 1B without getting an out, we now have .953 runs in expectation, i.e., putting someone on first without an out is worth .953 - .555 = .398 runs.

This immediately gives us the value of some decisions, e.g., trying to “steal” 2B with no outs and someone on first. If we look at cases where the batter’s state doesn’t change, a successful steal moves us to the {2B, 0 outs} state, i.e., it gives us 1.189 - .953 = .236 runs. A failed steal moves us to the {--, 1 out} state, i.e., it gives us .953 - .297 = -.656 runs. To break even, we need to succeed .656 / .236 = 2.78x more often than we fail, i.e., we need a .735 success rate to break even. If we want to compute the average value of a stolen base, we can compute the weighted sum over all states, but for now, let’s just say that it’s possible to do so and that you need something like a .735 success rate for stolen bases to make sense.

We can then look at the stolen base success rate of teams to see that, in any given season, maybe 5-10 teams are doing better than breakeven, leaving 20-25 teams at breakeven or below (mostly below). If we look at a bad but not historically bad stolen-base team of that era, they might have a .6 success rate. It wouldn’t be unusual for a team from that era to make between 100 and 200 attempts. Just so we can compute an approximation, if we assume they were all attempts from the {1B, 0 outs} state, the average run value per attempt would be .4 * (-.656) + .6 * .236 = -0.12 runs per attempt. Another first-order approximation is that a delta of 10 runs is worth 1 win, so at 100 attempts we have -1.2 wins and at 200 attempts we have -2.4 wins.

If we run the math across actual states instead of using the first order approximation, we see that the average stolen base is worth -.467 runs and the average successful steal is worth .175 runs. In that case, a steal attempt with a .6 success rate is worth .4 * (-.467) + .6 * .175 = -0.082 runs. With this new approximation, our estimate for the approximate cost in wins of stealing “as normal” vs. having a “no stealing” rule for a team that steals badly and often is .82 to 1.64 wins per season. Note that this underestimates the cost of stealing since getting into position to steal increases the odds of a successful “pickoff”, which we haven’t accounted for. From our state-machine standpoint, a pickoff is almost equivalent to a failed steal, but the analysis necessary to compute the difference in pickoff probability is beyond the scope of this post.

We can also do this for other plays coaches can cause (or prevent). For the “intentional walk”, we see that an intentional walk appears to be worth .102 runs for the opposing team. In 2002, a team that issued “a lot” of intentional walks might have issued 50, resulting in 50 * .102 runs for the opposing team, giving a loss of roughly 5 runs or .5 wins.

If we optimistically assume a “sac bunt” never fails, the cost of a sac bunt is .027 runs per attempt. If we look at the league where pitchers don’t bat, a team that was heavy on sac bunts might’ve done 49 sac bunts (we do this to avoid “pitcher” bunts, which add complexity to the approximation), costing a total of 49 * .027 = 1.32 runs or .132 wins.

Another decision that’s made by a coach is setting the batting order. Players bat (take a turn) in order, 1-9, mod 9. That is, when the 10th “player” is up, we actually go back around and the 1st player bats. At some point the game ends, so not everyone on the team ends up with the same number of “at bats”.

There’s a just-so story that justifies putting the fastest player first, someone with a high “batting average” second, someone pretty good third, your best batter fourth, etc. This story, or something like it, has been standard for over 100 years.

I’m not going to walk through the math for computing a better batting order because I don’t think there’s a short, easy to describe, approximation. It turns out that if we compute the difference between an “optimal” order and a “typical” order justified by the story in the previous paragraph, using an optimal order appears to be worth between 1 and 2 wins per season.

These approximations all leave out important information. In three out of the four cases, we assumed an average player at all times and didn’t look at who was at bat. The information above actually takes this into account to some extent, but not fully. How exactly this differs from a better approximation is a long story and probably too much detail for a post that’s using baseball to talk about decisions outside of baseball, so let’s just say that we have pretty decent but not amazing approximation that says that a coach who makes bad decisions following conventional wisdom that are in the normal range of bad decisions during a baseball season might be able cost their team something like 1 + 1.2 + .5 + .132 = 2.83 wins on these three decisions alone vs. a decision rule that says “never do these actions that, on average, have negative value”. If we compare to a better decision rule such as “do these actions when they have positive value and not when they have negative value” or a manager that generally makes good decisions, let’s conservatively estimate that’s maybe worth 3 wins.

We’ve looked at four decisions (sac bunt, steal, intentional walk, and batting order). But there are a lot of other decisions! Let’s arbitrarily say that if we look at all decisions and not just these four decisions, having a better heuristic for all decisions might be worth 4 or 5 wins per season.

What does 4 or 5 wins per season really mean? One way to look at it is that baseball teams play 162 games, so an “average” team wins 81 games. If we look at the seasons covered, the number of wins that teams that made the playoffs had was {103, 94, 103, 99, 101, 97, 98, 95, 95, 91, 116, 102, 88, 93, 93, 92, 95, 97, 95, 94, 87, 91, 91, 95, 103, 100, 97, 97, 98, 95, 97, 94}. Because of the structure of the system, we can’t name a single number for a season and say that N wins are necessary to make the playoffs and that teams with fewer than N wins won’t make the playoffs, but we can say that 95 wins gives a team decent odds of making the playoffs. 95 - 81 = 14. 5 wins is more than a third of the difference between an average team and a team that makes the playoffs. This a huge deal both in terms of prestige and also direct economic value.

If we want to look at it at the margin instead of on average, the smallest delta in wins between teams that made the playoffs and teams that didn’t in each league was {1, 7, 8, 1, 6, 2, 6, 3}. For teams that are on the edge, a delta of 5 wins wouldn’t always be the difference between a successful season (making playoffs) and an unsuccessful season (not making playoffs), but there are teams within a 5 win delta of making the playoffs in most seasons. If we were actually running a baseball team, we’d want to use a much more fine-grained model, but as a first approximation we can say that in-game decisions are a significant factor in team performance and that, using some kind of computation, we can determine the expected cost of non-optimal decisions.

Another way to look at what 5 wins is worth is to look at what it costs to get a player who’s not a pitcher that’s 5 wins above average (WAA) (we look at non-pitchers because non-pitchers tend to play in every game and pitchers tend to play in parts of some games, making a comparison between pitchers and non-pitchers more complicated). Of the 8 non-pitcher positions (we look at non-pitcher positions because it makes comparisons simpler), there are 30 teams, so we have 240 team-positions pairs. In 2002, of these 240 team-position pairs, there are two that were >= 5 WAA, Texas-SS (Alex Rodriguez, paid $22m) and SF-LF (Barry Bonds, paid $15m). If we look at the other seasons in the range of dates we’re looking at, there are either 2 or 3 team-position pairs where a team is able to get >= 5 WAA in a season These aren’t stable across seasons because player performance is volatile, so it’s not as easy as finding someone great and paying them $15m. For example, in 2002, there were 7 non-pitchers paid $14m or more and only two of them we worth 5 WAA or more. For reference, the average total team payroll (teams have 26 players per) in 2002 was $67m, with a minimum of $34m and a max of $126m. At the time a $1m salary for a manager would’ve been considered generous, making a 5 WAA manager an incredible deal.

5 WAA assumes typical decision making lining up with events in a bad, but not worst-case way. A more typical case might be that a manager costs a team 3 wins. In that case, in 2002, there were 25 team-position pairs out of 240 where a single player could make up for the loss caused from management by conventional wisdom. Players who provide that much value and who aren’t locked up in artificially cheap deals with particular teams due to the mechanics of player transfers are still much more expensive than managers.

If we look at how teams have adopted data analysis in order to improve both in-game decision making and team-composition decisions, it’s been a slow, multi-decade, process. Moneyball describes part of the shift from using intuition and observation to select players to incorporating statistics into the process. Stats nerds were talking about how you could do this at least since 1971 and no team really took it seriously until the 90s and the ideas didn’t really become mainstream until the mid 2000s, after a bestseller had been published.

If we examine how much teams have improved at the in-game decisions we looked at here, the process has been even slower. It’s still true today that statistics-driven decisions aren’t mainstream. Things are getting better, and if we look at the aggregate cost of the non-optimal decisions mentioned here, the aggregate cost has been getting lower over the past couple decades as intuition-driven decisions slowly converge to more closely match what stats nerds have been saying for decades. For example, if we look at the total number of sac bunts recorded across all teams from 1999 until now, we see:

1999200020012002200320042005200620072008200920102011201220132014201520162017
160416281607163316261731162016511540152616351544166714791383134312001025925

Despite decades of statistical evidence that sac bunts are overused, we didn’t really see a decline across all teams until 2012 or so. Why this is varies on a team-by-team and case-by-case basis, but the fundamental story that’s been repeated over and over again both for statistically-driven team composition and statistically driven in-game decisions is that the people who have the power to make decisions often stick to conventional wisdom instead of using “radical” statistically-driven ideas. There are a number of reasons as to why this happens. One high-level reason is that the change we’re talking about was a cultural change and cultural change is slow. It doesn’t surprise people when it takes a generation for scientific consensus to shift and why should be baseball be any different?

One specific lower-level reason “obviously” non-optimal decisions can persist for so long is that there’s a lot of noise in team results. You sometimes see a manager make some radical decisions (not necessarily statistics-driven), followed by some poor results, causing management to fire the manager. There’s so much volatility that you can’t really judge players or managers based on small samples, but this doesn’t stop people from doing so. The combination of volatility and skepticism of radical ideas heavily disincentivizes going against conventional wisdom.

Among the many consequences of this noise is the fact that the winner of the "world series" (the baseball championship) is heavily determined by randomness. Whether or not a team makes the playoffs is determined over 162 games, which isn't enough to remove all randomness, but is enough that the result isn't mostly determined by randomness. This isn't true of the playoffs, which are too short for the outcome to be primarily determined by the difference in the quality of teams. Once a team wins the world series, people come up with all kinds of just-so stories to justify why the team should've won, but if we look across all games, we can see that the stories are just stories. This is, perhaps, not so different to listening to people tell you why their startup was successful.

There are metrics we can use that are better predictors of future wins and losses (i.e., are less volatile than wins and losses), but, until recently, convincing people that those metrics were meaningful was also a radical idea.

If we think about the general case, what’s happening is that decisions have probabilistic payoffs. There’s very high variance in actual outcomes (wins and losses), so it’s possible to make good decisions and not see the direct effect of them for a long time. Even if there are metrics that give us a better idea of what the “true” value of a decision is, if you’re operating in an environment where your management doesn’t believe in those metrics, you’re going to have a hard time keeping your job (or getting a job in the first place) if you want to do something radical whose value is only demonstrated by some obscure-sounding metric unless they take a chance on you for a year or two. There have been some major phase changes in what metrics are accepted, but they’ve taken decades.

If we look at business or engineering decisions, the situation is much messier. If we look at product or infrastructure success as a “win”, there seems to be much more noise in whether or not a team gets a “win”. Moreover, unlike in baseball, the sort of play-by-play or even game data that would let someone analyze “wins” and “losses” to determine the underlying cause isn’t recorded, so it’s impossible to determine the true value of decisions. And even if the data were available, there are so many more factors that determine whether or not something is a “win” that it’s not clear if we’d be able to determine the expected value of decisions even if we had the data.

We’ve seen that in a field where one can sit down and determine the expected value of decisions, it can take decades for this kind of analysis to influence some important decisions. If we look at fields where it’s more difficult to determine the true value of decisions, how long should we expect it to take for “good” decision making to surface? It seems like it would be a while, perhaps forever, unless there’s something about the structure of baseball and other sports that makes it particularly difficult to remove a poor decision maker and insert a better decision maker.

One might argue that baseball is different because there are a fixed number of teams and it’s quite unusual for a new team to enter the market, but if you look at things like public clouds, operating systems, search engines, car manufacturers, etc., the situation doesn’t look that different. If anything, it appears to be much cheaper to take over a baseball team and replace management (you sometimes see baseball teams sell for roughly a billion dollars) and there are more baseball teams than there are competitive products in the markets we just discussed, at least in the U.S. One might also argue that, if you look at the structure of baseball teams, it’s clear that positions are typically not handed out based on decision-making merit and that other factors tend to dominate, but this doesn’t seem obviously more true in baseball than in engineering fields.

This isn’t to say that we expect obviously bad decisions everywhere. You might get that idea if you hung out on baseball stats nerd forums before Moneyball was published (and for quite some time after), but if you looked at formula 1 (F1) around the same time, you’d see teams employing PhDs who are experts in economics and game theory to make sure they were making reasonable decisions. This doesn’t mean that F1 teams always make perfect decisions, but they at least avoided making decisions that interested amateurs could identify as inefficient for decades. There are some fields where competition is cutthroat and you have to do rigorous analysis to survive and there are some fields where competition is more sedate. In living memory, there was a time when training for sports was considered ungentlemanly and someone who trained with anything resembling modern training techniques would’ve had a huge advantage. Over the past decade or so, we’re seeing the same kind of shift but for statistical techniques in baseball instead of training in various sports.

If we want to look at the quality of decision making, it's too simplistic to say that we expect a firm to make good decisions because they're exposed to markets and there's economic value in making good decisions and people within the firm will probably be rewarded greatly if they make good decisions. You can't even tell if this is happening by asking people if they're making rigorous, data-driven, decisions. If you'd ask people in baseball they were using data in their decisions, they would've said yes throughout the 70s and 80s. Baseball has long been known as a sport where people track all kinds of numbers and then use those numbers. It's just that people didn't backtest their predictions, let alone backtest their predictions with holdouts.

The paradigm shift of using data effectively to drive decisions has been hitting different fields at different rates over the past few decades, both inside and outside of sports. Why this change happened in F1 before it happened in baseball is due to a combination of the difference in incentive structure in F1 teams vs. baseball teams and the difference in institutional culture. We may take a look at this in a future post, but this turns out to be a fairly complicated issue that requires a lot more background. We’ll try to explore the necessary background in future posts.

Appendix: non-idealities in our baseball analysis

In order to make this a short blog post and not a book, there are a lot of simplifications the approximation we discussed. One major simplification is the idea that all runs are equivalent. This is close enough to true that this is a decent approximation. But there are situations where the approximation isn’t very good, such as when it’s the 9th inning and the game is tied. In that case, a decision that increases the probability of scoring 1 run but decreases the probability of scoring multiple runs is actually the right choice.

This is often given as a justification for a relatively late-game sac bunt. But if we look at the probability of a successful sac bunt, we see that it goes down in later innings. We didn’t talk about how the defense is set up, but defenses can set up in ways that reduce the probability of a successful sac bunt but increase the probability of success of non-bunts and vice versa. Before the last inning, this actually makes sac bunt worse late in the game and not better! If we take all of that into account in the last inning of a tie game, the probability that a sac bunt is a good idea then depends on something else we haven’t discussed, the batter at the plate.

In our simplified model, we computed the expected value in runs across all batters. But at any given time, a particular player is batting. A successful sac bunt advances runners and increases the number of outs by one. The alternative is to let the batter “swing away”, which will result in some random outcome. The better the batter, the higher the probability of an outcome that’s better than the outcome of a sac bunt. To determine the optimal decision, we not only need to know how good the current batter is but how good the subsequent batters are. One common justification for the sac bunt is that pitchers are terrible hitters and they’re not bad at sac bunting because they have so much practice doing it (because they’re terrible hitters), but it turns out that pitchers are also below average sac bunters and that the argument that we should expect pitchers to sac because they’re bad hitters doesn’t hold up if we look at the data in detail.

Another reason to sac bunt (or bunt in general) is that the tendency to sometimes do this induces changes in defense which make non-bunt plays work better.

A full computation should also take into account the number of balls and strikes a current batter has, which is a piece of state we haven’t discussed at all as well as the speed of the batter and the players on base as well as the particular stadium the game is being played in and the opposing pitcher as well as the quality of their defense. All of this can be done, even on a laptop -- this is all “small data” as far as computers are concerned, but walking through the analysis even for one particular decision would be substantially longer than everything in this post combined including this disclaimer. It’s perhaps a little surprising that taking all of these non-idealities into account doesn’t overturn the general result, but it turns out that it doesn’t (it finds that there are many situations in which sac bunts have positive expected value, but that sac bunts were still heavily overused for decades).

There’s a similar situation for intentional walks, where the non-idealities in our analysis appear to support issuing intentional walks. In particular, the two main conventional justifications for an intentional walk are

  1. By walking the current batter, we can set up a “force” or a “double play” (increase the probability of getting one out or two outs in one play). If the game is tied in the last inning, putting another player on base has little downside and has the upside of increasing the probability of allowing zero runs and continuing the tie.
  2. By walking the current batter, we can get to the next, worse batter.

An example situation where people apply the justification in (1) is in the {1B, 3B; 2 out} state. The team that’s on defense will lose if the player at 3B advances one base. The reasoning goes, walking a player and changing the state to {1B, 2B, 3B; 2 out} won’t increase the probability that the player at 3B will score and end the game if the current batter “puts the ball into play”, and putting another player on base increases the probability that the defense will be able to get an out.

The hole in this reasoning is that the batter won’t necessarily put the ball into play. After the state is {1B, 2B, 3B; 2 out}, the pitcher may issue an unintentional walk, causing each runner to advance and losing the game. It turns out that being in this state doesn’t affect the the probability of an unintentional walk very much. The pitcher tries very hard to avoid a walk but, at the same time, the batter tries very hard to induce a walk!

On (2), the two situations where the justification tend to be applied are when the current player at bat is good or great, or the current player is batting just before the pitcher. Let’s look at these two separately.

Barry Bonds’s seasons from 2001, 2002, and 2004 were some of the statistically best seasons of all time and are as extreme a case as one can find in modern baseball. If we run our same analysis and account for the quality of the players batting after Bonds, we find that it’s sometimes the correct decision for the opposing team to intentionally walk Bonds, but it was still the case that most situations do not warrant an intentional walk and that Bonds was often intentionally walked in a situation that didn’t warrant an intentional walk. In the case of a batter who is not having one of the statistically best seasons on record in modern baseball, intentional walks are even less good.

In the case of the pitcher batting, doing the same kind of analysis as above also reveals that there are situations where an intentional walk are appropriate (not-late game, {1B, 2B; 2 out}, when the pitcher is not a significantly above average batter for a pitcher). Even though it’s not always the wrong decision to issue an intentional walk, the intentional walk is still grossly overused.

One might argue the fact that our simple analysis has all of these non-idealities that could have invalidated the analysis is a sign that decision making in baseball wasn’t so bad after all, but I don’t think that holds. An first-order approximation that someone could do in an hour or two finds that decision making seems quite bad, on average. If a team was interested in looking at data, that ought to lead them into doing a more detailed analysis that takes into account the conventional-wisdom based critiques of the obvious one-hour analysis. It appears that this wasn’t done, at least not for decades.

The problem is that before people started running the data, all we had to go by were stories. Someone would say "with 2 outs, you should walk the batter before the pitcher to get to the pitcher [in some situations] to get to the pitcher and get the guaranteed out". Someone else might respond "we obviously shouldn't do that late game because the pitcher will get subbed out for a pinch hitter and early game, we shouldn't do it because even if it works and we get the easy out, it sets the other team up to lead off the next inning with their #1 hitter instead of an easy out". Which of these stories is the right story turns out to be an empirical question. The thing that I find most unfortunate is that, after started people running the numbers and the argument became one of stories vs. data, people persisted in sticking with the story-based argument for decades. We see the same thing in business and engineering, but it's arguably more excusable there because decisions in those areas tend to be harder to quantify. Even if you can reduce something to a simple engineering equation, someone can always argue that the engineering decision isn't what really matters and this other business concern that's hard to quantify is the most important thing.

Appendix: possession

Something I find interesting is that statistical analysis in football, baseball, and basketball has found that teams have overwhelmingly undervalued possessions for decades. Baseball doesn't have the concept of possession per se, but if you look at being on offense as "having posession" and getting 3 outs as "losing posession", it's quite similar.

In football, we see that maintaining posession is such a big deal that it is usually an error to punt on 4th down, but this hasn't stopped teams from punting by default basically forever. And in basketball, players who shoot a lot with a low shooting percentage were (and arguably still are) overrated.

I don't think this is fundamental -- that possessions are as valuable as they are comes out of the rules of each game. It's arbitrary. I still find it interesting, though.

Appendix: other analysis of management decisions

Bloom et al., Does management matter? Evidence from India looks at the impact of management interventions and the effect on productivity.

Other work by Bloom.

DellaVigna et al., Uniform pricing in US retail chains allegedly finds a significant amount of money left on the table by retail chains (seven percent of profits) and explores why that might happen and what the impacts are.

The upside of work like this vs. sports work is that it attempts to quantify the impact of things outside of a contrived game. The downside is that the studies are on things that are quite messy and it's hard to tell what the study actually means. Just for example, if you look at studies on innovation, economists often use patents as a proxy for innovation and then come to some conclusion based on some variable vs. number of patents. But if you're familiar with engineering patents, you'll know that number of patents is an incredibly poor proxy for innovation. In the hardware world, IBM is known for cranking out a very large number of useless patents (both in the sense of useless for innovation and also in the narrow sense of being useless as a counter-attack in patent lawsuits) and there are some companies that get much more mileage out of filing many fewer patents.

AFAICT, our options here are to know a lot about decisions in a context that's arguably completely irrelevant, or to have ambiguous information and probably know very little about a context that seems relevant to the real world. I'd love to hear about more studies in either camp (or even better, studies that don't have either problem).

Thanks to Leah Hanson, David Turner, Milosz Dan, Andrew Nichols, Justin Blank, @hoverbikes, Kate Murphy, Ben Kuhn, Patrick Collison, and an anonymous commenter for comments/corrections/discussion.

Tue, 21 Nov 2017 00:00:00 +0000


How out of date are Android devices?

It's common knowledge that Android device tend to be more out of date than iOS devices, but what does this actually mean? Let’s look at android marketshare data to see how old devices in the wild are. The x axis of the plot below is date, and the y axis is Android marketshare. The share of all devices sums to 100% (with some artifacts because the public data Google provides is low precision).

Color indicates age:

If we look at the graph, we see a number of reverse-S shaped contours; between each pair of contours, devices get older as we go from left to right. Each contour corresponds to the release of a new android version and the associated devices running that android version. As time passes, devices on that version get older. When a device is upgraded, they’re effectively removed from one contour into a new contour and the color changes to a less outdated color.

Markshare of outdated android devices is increasing

There are three major ways in which this graph understates the number of outdated devices:

First, we’re using API version data for this and don’t have access to the marketshare of point releases and minor updates, so we assume that all devices on the same API version are up to date until the moment a new API version is released, but many (and perhaps most) devices won’t receive updates within an API version.

Second, this graph shows marketshare, but the number of Android devices has dramatically increased over time. For example, if we look at the 80%-ile most outdated devices (i.e., draw a line 20% up from the bottom), it the 80%-ile device today is a few months more outdated than it was in 2014. The huge growth of Android means that there are many many more outdated devices now than there were in 2014.

Third, this data comes from scraping Google Play Store marketshare info. That data shows marketshare of devices that have visited in the Play Store in the last 7 days. In general, it seems reasonable to believe that devices that visit the play store are more up to date than devices that don’t, so we should expect an unknown amount of bias in this data that causes the graph to show that devices are newer than they actually are. This seems plausible both for devices that are used as conventional mobile devices as well as for mobile devices that have replaced things liked traditonally embedded devices, PoS boxes, etc.

If we're looking at this from a security standpoint, some devices will receive updates without updating their major version, skewing the date to look more outdated than it used it. However, when researchers have used more fine-grained data to see which devices are taking updates, they found that this was not a large effect.

One thing we can see from that graph is that, as time goes on, the world accumulates a larger fraction of old devices over time. This makes sense and we could have figured this out without looking at the data. After all, back at the beginning of 2010, Android phones couldn’t be much more than a year old, and now it’s possible to have Android devices that are nearly a decade old.

Something that wouldn’t have been obvious without looking at the data is that the uptake of new versions seems to be slowing down -- we can see this by looking at the last few contour lines at the top right of the graph, corresponding to the most recent Android releases. These lines have a shallower slope than the contour lines for previous releases. Unfortunately, with this data alone, we can’t tell why the slope is shallower. Some possible reasons might be:

Without more data, it’s impossible to tell how much each of these is contributing to the problem. BTW, let me know if you know of a reasonable source for the active number of Android devices going back to 2010! I’d love to produce a companion graph of the total number of outdated devices.

But even with the data we have, we can take a guess at how many outdated devices are in use. In May 2017, Google announced that there are over two billion active Android devices. If we look at the latest stats (the far right edge), we can see that nearly half of these devices are two years out of date. At this point, we should expect that there are more than one billion devices that are two years out of date! Given Android's update model, we should expect approximately 0% of those devices to ever get updated to a modern version of Android.

Percentiles

Since there’s a lot going on in the graph, we might be able to see something if we look at some subparts of the graph. If we look at a single horizontal line across the graph, that corresponds to the device age at a certain percentile:

Over time, the Nth percentile out of date device is getting more out of date

In this graph, the date is on the x axis and the age in months is on the y axis. Each line corresponds to a different percentile (higher percentile is older), which corresponds to a horizontal slice of the top graph at that percentile.

Each individual line seems to have two large phases (with some other stuff, too). There’s one phase where devices for that percentile get older as quickly as time is passing, followed by a phase where, on average, devices only get slightly older. In the second phase, devices sometimes get younger as new releases push younger versions into a certain percentile, but this doesn’t happen often enough to counteract the general aging of devices. Taken as a whole, this graph indicates that, if current trends continue, we should expect to see proportionally more old Android devices as time goes on, which is exactly what we’d expect from the first, busier, graph.

Dates

Another way to look at the graph is to look at a vertical slice instead of a horizontal slice. In that case, each slice corresponds to looking at the ages of devices at one particular date:

In this plot, the x axis indicates the age percentile and the y axis indicates the raw age in months. Each line is one particular date, with older dates being lighter / yellower and newer dates being darker / greener.

As with the other views of the same data, we can see that Android devices appear to be getting more out of date as time goes on. This graph would be too busy to read if we plotted data for all of the dates that are available, but we can see it as an animation:

iOS

For reference, iOS 11 was released two months ago and it now has just under 50% iOS marketshare despite November’s numbers coming before the release of the iPhone X (this is compared to < 1% marketshare for the latest Android version, which was released in August). It’s overwhelmingly likely that, by the start of next year, iOS 11 will have more than 50% marketshare and there’s an outside chance that it will have 75% marketshare, i.e., it’s likely that the corresponding plot for iOS would have the 50%-ile (red) line in the second plot at age = 0 and it’s not implausible that the 75%-ile (orange) line would sometimes dip down to 0. As is the case with Android, there are some older devices that stubbornly refuse to update; iOS 9.3, released a bit over two years ago, sits at just a bit above 5% marketshare. This means that, in the iOS version of the plot, it’s plausible that we’d see the corresponding 99%-ile (green) line in the second plot at a bit over two years (half of what we see for the Android plot).

Windows XP

People sometimes compare Android to Windows XP because there are a large number of both in the wild and in both cases, most devices will not get security updates. However, this is tremendously unfair to Windows XP, which was released on 10/2001 and got security updates until 4/2014, twelve and a half years later. Additionally, Microsoft has released at least one security update after the official support period (there was an update in 5/2017 in response to the WannaCry ransomware). It's unfortunate that Microsoft decided to end support for XP while there are still so many XP boxes in the wild, but supporting an old OS for over twelve years and then issuing an emergency security patch after more fifteen years puts Microsoft into a completely different league than Google and Apple when it comes to device support.

Another difference between Android and Windows is that Android's scale is unprecedented in the desktop world. The were roughly 200 million PCs sold in 2017. Samsung alone has been selling that many mobile devices per year since 2008. Of course, those weren't Android devices in 2008, but Android's dominance in the non-iOS mobile space means that, overall, those have mostly been Android devices. Today, we still see nearly 50 year old PDP-11 devices in use. There are few enough PDPs around that running into one is a cute, quaint, surprise (0.6 million PDP-11s were sold). Desktops boxes age out of service more quickly than PDPs and mobile devices age out of service even more quickly, but the sheer difference in number of devices caused by the ubiquity of modern computing devices means that we're going to see many more XP-era PCs in use 50 years after the release of XP and it's plausible we'll see even more mobile devices around 50 years from now. Many of these ancient PDP, VAX, DOS, etc. boxes are basically safe because they're run in non-networked configurations, but it looks like the same thing is not going to be true for many of these old XP and Android boxes that are going to stay in service for decades.

Conclusion

We’ve seen that Android devices appear to be getting more out of date over time. This makes it difficult for developers to target “new” Android API features, where new means anything introduced in the past few years. It also means that there are a lot of Android devices out there that are behind in terms of security. This is true both in absolute terms and also relative to iOS.

Until recently, Android was directly tied to the hardware it ran on, making it very painful to keep old devices up to date because that requiring a custom Android build with phone-specific (or at least SoC-specific work). Google claims that this problem is fixed in the latest Android version (8.0, Oreo). People who remember Google's "Android update alliance" annoucement in 2011 may be a bit skeptical of the more recent annoucement. In 2011, Google and U.S. carries announced that they'd keep devices up to date for 18 months, which mostly didn't happen. However, even if the current annoucement isn't smoke and mirrors and the latest version of Android solves the update probem, we've seen that it takes years for Android releases to get adopted and we've also seen that the last few Android releases have significantly slower uptake than previous releases. Additionally, even though this is supposed to make updates easier, it looks like Android is still likely to stay behind iOS in terms of updates for a while. Google has promised that its latest phone (Pixel 2, 10/2017) will get updates for three years. That seems like a step in the right direction, but as we’ve seen from the graphs above, extending support by a year isn’t nearly enough to keep most Android devices up to date. But if you have an iPhone, the latest version of iOS (released 9/2017) works on devices back to the iPhone 5S (released 9/2013).

If we look at the newest Android release (8.0, 8/2017), it looks like you’re quite lucky if you have a two year old device that will get the latest update. The oldest “Google” phone supported is the Nexus 6P (9/2015), giving it just under two years of support.

If you look back at devices that were released around when the iPhone5S, the situation looks even worse. Back then, I got a free Moto X for working at Google; the Moto X was about as close to an official Google phone as you could get at the time (this was back when Google owned Moto). The Moto X was released on 8/2013 (a month before the iPhone 5S) and the latest version of Android it supports is 5.1, which was released on 2/2015, a little more than a year and a half later. For an Android phone of its era, the Moto X was supported for an unusually long time. It's a good sign that things look worse as look further back in time, but at the rate things are improving, it will be years before there's a decently supported Android device released and then years beyond those years before that Android version is in widespread use. It's possible that Fuchsia will fix this, but Fuchsia is also many years away from widespread use.

In a future post, we'll look at Android response latency is also quite interesting. It’s much more variable between phones than iOS response latency is between different models of iPhone.

The main thing I’m missing from my analysis of phone latency is older phones. If you have an old phone I haven’t tested and want to donate it for testing, you can mail it to:

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

Thanks to Leah Hanson, Kate Murphy, Daniel Thomas, Marek Majkowski, @zofrex, @Aissn, Chris Palmer, JonLuca De Caro, and an anonymous person for comments/corrections/related discussion.

Also, thanks to Victorien Villard for making the data these graphs were based on available!

Sun, 12 Nov 2017 00:00:00 +0000


UI backwards compatibility

About once a month, an app that I regularly use will change its UI in a way that breaks muscle memory, basically tricking the user into doing things they don’t want.

Zulip

In recent memory, Zulip (a slack competitor) changed its newline behavior so that ctrl + enter sends a message instead of inserting a new line. After this change, I sent a number of half-baked messages and it seemed like some other people did too.

Around the time they made that change, they made another change such that a series of clicks that would cause you to send a private message to someone would instead cause you to send a private message to the alphabetically first person who was online. Most people didn’t notice that this was a change, but when I mentioned that this had happened to me a few times in the past couple weeks, multiple people immediately said that the exact same thing happened to them. Some people also mentioned that the behavior of navigation shortcut keys was changed in a way that could cause people to broadcast a message instead of sending a private message. In both cases, some people blamed themselves and didn’t know why they’d just started making mistakes that caused them to send messages to the wrong place.

Doors

A while back, I was at Black Seed Bagel, which has a door that looks 75% like a “push” door from both sides when it’s actually a push door from the outside and a pull door from the inside. An additional clue that makes it seem even more like a "push" door from the inside is that most businesses have outward opening doors (this is required for exit doors in the U.S. when the room occupancy is above 50 and many businesses in smaller spaces voluntarily follow the same convention). During the course of an hour long conversation, I saw a lot of people go in and out and my guess is that ten people failed on their first attempt to use the door while exiting. When people were travelling in pairs or groups, the person in front would often say something like “I’m dumb. We just used this door a minute ago”. But the people were not, in fact, acting dumb. If anything is dumb, it’s designing doors such that are users have to memorize which doors act like “normal” doors and which doors have their cues reversed.

If you’re interested in the physical world, The Design of Everyday Things, gives many real-world examples where users are subtly nudged into doing the wrong thing. It also discusses general principles in a way that allows you to see the general idea and apply and avoid the same issues when designing software.

Facebook

Last week, FB changed its interface so that my normal sequence of clicks to hide a story saves the story instead of hiding it. Saving is pretty much the opposite of hiding! It’s the opposite both from the perspective of the user and also as a ranking signal to the feed ranker. The really “great” thing about a change like this is that it A/B tests incredibly well if you measure new feature “engagement” by number of clicks because many users will accidentally save a story when they meant to hide it. Earlier this year, twitter did something similar by swapping the location of “moments” and “notifications”.

Even if the people making the change didn’t create the tricky interface in order to juice their engagement numbers, this kind of change is still problematic because it poisons analytics data. While it’s technically possible to build a model to separate out accidental clicks vs. purposeful clicks, that’s quite rare (I don’t know of any A/B tests where people have done that) and even in cases where it’s clear that users are going to accidentally trigger an action, I still see devs and PMs justify a feature because of how great it looks on naive statistics like DAU/MAU.

API backwards compatibility

When it comes to software APIs, there’s a school of thought that says that you should never break backwards compatibility for some classes of widely used software. A well-known example is Linus Torvalds:

People should basically always feel like they can update their kernel and simply not have to worry about it.

I refuse to introduce "you can only update the kernel if you also update that other program" kind of limitations. If the kernel used to work for you, the rule is that it continues to work for you. … I have seen, and can point to, lots of projects that go "We need to break that use case in order to make progress" or "you relied on undocumented behavior, it sucks to be you" or "there's a better way to do what you want to do, and you have to change to that new better way", and I simply don't think that's acceptable outside of very early alpha releases that have experimental users that know what they signed up for. The kernel hasn't been in that situation for the last two decades. ... We do API breakage inside the kernel all the time. We will fix internal problems by saying "you now need to do XYZ", but then it's about internal kernel API's, and the people who do that then also obviously have to fix up all the in-kernel users of that API. Nobody can say "I now broke the API you used, and now you need to fix it up". Whoever broke something gets to fix it too. ... And we simply do not break user space.

Raymond Chen quoting Colen:

Look at the scenario from the customer’s standpoint. You bought programs X, Y and Z. You then upgraded to Windows XP. Your computer now crashes randomly, and program Z doesn’t work at all. You’re going to tell your friends, "Don’t upgrade to Windows XP. It crashes randomly, and it’s not compatible with program Z." Are you going to debug your system to determine that program X is causing the crashes, and that program Z doesn’t work because it is using undocumented window messages? Of course not. You’re going to return the Windows XP box for a refund. (You bought programs X, Y, and Z some months ago. The 30-day return policy no longer applies to them. The only thing you can return is Windows XP.)

While this school of thought is a minority, it’s a vocal minority with a lot of influence. It’s much rarer to hear this kind of case made for UI backwards compatibility. You might argue that this is fine -- people are forced to upgrade nowadays, so it doesn’t matter if stuff breaks. But even if users can’t escape, it’s still a bad user experience.

The counterargument to this school of thought is that maintaining compatibility creates technical debt. It’s true! Just for example, Linux is full of slightly to moderately wonky APIs due to the “do not break user space” dictum. One example is int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags, struct timespec *timeout); . You might expect the timeout to fire if you don’t receive a packet, but the manpage reads:

The timeout argument points to a struct timespec (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (but see BUGS!).

The BUGS section reads:

The timeout argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to vlen-1 datagrams are received before the timeout expires, but then no further datagrams are received, the call will block forever.

This is arguably not even the worst mis-feature of recvmmsg, which returns an ssize_t into a field of size int.

If you have a policy like “we simply do not break user space”, this sort of technical debt sticks around forever. But it seems to me that it’s not a coincidence that the most widely used desktop, laptop, and server operating systems in the world bend over backwards to maintain backwards compatibility.

The case for UI backwards compatability is arguably stronger than the case for API backwards compatability because breaking API changes can be mechanically fixed and, with the proper environment, all callers can be fixed at the same time as the API changes. There's no equivalent way to reach into people's brains and change user habits, so a breaking UI change inevitably results in pain for some users.

The case for the case for UI backwards compatibility is arguably weaker than the case for API backwards compatibility because API backwards compatibility has a lower cost -- if some API is problematic, you can make a new API and then document the old API as something that shouldn’t be used (you’ll see lots of these if you look at Linux syscalls). This doesn’t really work with GUIs since UI elements compete with each other for a small amount of screen real-estate. An argument that I think is underrated is that changing UIs isn’t as great as most companies seem to think -- very dated looking UIs that haven’t been refreshed to keep up with trends can be successful (e.g., plentyoffish and craigslist). Companies can even become wildly successful without any significant UI updates, let alone UI redesigns -- a large fraction of linkedin’s rocketship growth happened in a period where the UI was basically frozen. I’m told that freezing the UI wasn’t a deliberate design decision; instead, it was a side effect of severe technical debt, and that the UI was unfrozen the moment a re-write allowed people to confidently change the UI. Linkedin has managed to add a lot of dark patterns since they unfroze their front-end, but the previous UI seemed to work just fine in terms of growth.

Despite the success of a number of UIs which aren’t always updated to track the latest trends, at most companies, it’s basically impossible to make the case that UIs shouldn’t be arbitrarily changed without adding functionality, let alone make the case that UIs shouldn’t push out old functionality with new functionality.

UI deprecation

A case that might be easier to make is that shortcuts and shortcut-like UI elements can be deprecated before removal, similar to the way evolving APIs will add deprecation warnings before making breaking changes. Instead of regularly changing UIs so that users’ muscle memory is used against them and causes users to do the opposite of what they want, UIs can be changed so that doing the previously trained set of actions causes nothing to happen. For example, FB could have moved “hide post” down and inserted a no-op item in the old location, and then after people had gotten used to not clicking in the old “hide post” location for “hide post”, they could have then put “save post” in the old location for “hide post”.

Zulip could’ve done something similar and caused the series of actions that used to let you send a private message to the person you want cause no message to be sent instead of sending a private message to the alphabetically first person on the online list.

These solutions aren’t ideal because the user still has to retrain their muscle memory on the new thing, but it’s still a lot better than the current situation, where many UIs regularly introduce arbitrary-seeming changes that sow confusion and chaos.

In some cases (e.g., the no-op menu item), this presents a pretty strange interface to new users. Users don’t expect to see a menu item that does nothing with an arrow that says to click elsewhere on the menu instead. This can be fixed by only rolling out deprecation “warnings” to users who regularly use the old shortcut or shortcut-like path. If there are multiple changes being deprecated, this results in a combinatorial explosion of possibilities, but if you're regularly deprecating multiple independent items, that's pretty extreme and users are probably going to be confused regardless of how it's handled. Given the amount of effort made to avoid user hostile changes and the dominance of the “move fast and break things” mindset, the case for adding this kind of complexity just to avoid giving users a bad experience probably won’t hold at most companies, but this at least seems plausible in principle.

Breaking existing user workflows arguably doesn’t matter for an app like FB, which is relatively sticky as a result of its dominance in its area, but most applications are more like Zulip than FB. Back when Zulip and Slack were both young, Zulip messages couldn’t be edited or deleted. This was on purpose -- messages were immutable and everyone I know who suggested allowing edits was shot down because mutable messages didn’t fit into the immutable model. Back then, if there was a UI change or bug that caused users to accidentally send a public message instead of a private message, that was basically permanent. I saw people accidentally send public messages often enough that I got into the habit of moving private message conversations to another medium. That didn’t bother me too much since I’m used to quirky software, but I know people who tried Zulip back then and, to this day, still refuse to use Zulip due to UI issues they hit back then. That’s a bit of an extreme case, but the general idea that users will tend to avoid apps that repeatedly cause them pain isn’t much of a stretch.

In studies on user retention, it appears to be the case that an additional 500ms of page-load latency negative impacts retention. If that's the case, it seems like switching the UI around so that the user has to spend 5s undoing and action or broadcasts a private message publicly in a way that can't be undone should have a noticable impact on retention, although I don't know of any public studies that look at this.

Conclusion

If I worked on UI, I might have some suggestions or a call to action. But as an outsider, I’m wary of making actual suggestions -- programmers seem especially prone to coming into an area they’re not familiar with and telling experts how they should solve their problems. While this occasionally works, the most likely outcome is that the outsider either re-invents something that’s been known for decades or completely misses the most important parts of the problem.

It sure would be nice if shortcuts didn’t break so often that I spend as much time consciously stopping myself from using shortcuts as I do actually using the app. But there are probably reasons this is difficult to test/enforce. The huge number of platforms that need to be tested for robust UI testing make testing hard even without adding this extra kind of test. And, even when we’re talking about functional correctness problems, “move fast and break things” is much trendier than “try to break relatively few things”. Since UI “correctness” often has even lower priority than functional correctness, it’s not clear how someone could successfully make a case for spending more effort on it.

On the other hand, despite all these disclaimers, Google sometimes does the exact things described in this post. Chrome recently removed backspace to go backwards; if you hit backspace, you get a note telling you to use alt+left instead and when maps moved some items around a while back, they put in no-op placeholders that pointed people to the new location. This doesn't mean that Google always does this well -- on April fools day of 2016, gmail replaced send and archive with send and attach a gif that's offensive in some contexts -- but these examples indicate that maintaining backwards compatibility through significant changes isn't just a hypothetical idea, it can and has been done.

Thanks to Leah Hanson, Allie Jones, Randall Koutnik, Kevin Lynagh, David Turner, Christian Ternus, Ted Unangst, Michael Bryc, Tony Finch, Stephen Tigner, Steven McCarthy, Julia Evans, @BaudDev, and an anonymous person who has a moral objection to public acknowledgements for comments/corrections/discussion.

If you're curious why "anon" is against acknowledgements, it's because they first saw these in Paul Graham's writing, whose acknowledgements are sort of a who's who of SV. anon's belief is that these sorts of list serve as a kind of signalling. I won't claim that's wrong, but I get a lot of help with my writing both from people reading drafts and also from the occasional helpful public internet comment and I think it's important to make it clear that this isn't a one-person effort to combat what Bunnie Huang calls "the idol effect".

In a future post, we'll look at empirical work on how line length affects readability. I've read every study I could find, but I might be missing some. If know of a good study you think I should include, please let me know.

Thu, 09 Nov 2017 00:00:00 +0000


Filesystem error handling

We’re going to reproduce some results from papers on filesystem robustness that were written up roughly a decade ago: Prabhakaran et al. SOSP 05 paper, which injected errors below the filesystem and Gunawi et al. FAST 08, which looked at how often filessytems failed to check return codes of functions that can return errors.

Prabhakaran et al. injected errors at the block device level (just underneath the filesystem) and found that ext3, resierfs, ntfs, and jfs mostly handled read errors reasonbly but ext3, ntfs, and jfs mostly ignored write errors. While the paper is interesting, someone installing Linux on a system today is much more likely to use ext4 than any of the now-dated filesystems tested by Prahbhakaran et al. We’ll try to reproduce some of the basic results from the paper on more modern filesystems like ext4 and btrfs, some legacy filesystems like exfat, ext3, and jfs, as well as on overlayfs.

Gunawi et al. found that errors weren’t checked most of the time. After we look at error injection on modern filesystems, we’ll look at how much (or little) filesystems have improved their error handling code.

Error injection

A cartoon view of a file read might be: pread syscall -> OS generic filesystem code -> filesystem specific code -> block device code -> device driver -> device controller -> disk. Once the disk gets the request, it sends the data back up: disk -> device controller -> device driver -> block device code -> filesystem specific code -> OS generic filesystem code -> pread. We’re going to look at error injection at the block device level, right below the file system.

Let’s look at what happened when we injected errors in 2017 vs. what Prabhakaran et al. found in 2005.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfspropproppropproppropprop
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

Each row shows results for one filesystem. read and write indicating reading and writing data, respectively, where the block device returns an error indicating that the operation failed. silent indicates a read failure (incorrect data) where the block device didn’t indicate an error. This could happen if there’s disk corruption, a transient read failure, or a transient write failure silently caused bad data to be written. file indicates that the operation was done on a file opened with open and mmap indicates that the test was done on a file mapped with mmap. ignore (red) indicates that the error was ignored, prop (yellow) indicates that the error was propagated and that the pread or pwrite syscall returned an error code, and fix (green) indicates that the error was corrected. No errors were corrected. Grey entries indicate configurations that weren’t tested.

From the table, we can see that, in 2005, ext3 and jfs ignored write errors even when the block device indicated that the write failed and that things have improved, and that any filesystem you’re likely to use will correctly tell you that a write failed. jfs hasn’t improved, but jfs is now rarely used outside of legacy installations.

No tested filesystem other than btrfs handled silent failures correctly. The other filesystems tested neither duplicate nor checksum data, making it impossible for them to detect silent failures. zfs would probably also handle silent failures correctly but wasn’t tested. apfs, despite post-dating btrfs and zfs, made the explicit decision to not checksum data and silently fail on silent block device errors. We’ll discuss this more later.

In all cases tested where errors were propagated, file reads and writes returned EIO from pread or pwrite, respectively; mmap reads and writes caused the process to receive a SIGBUS signal.

The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test. The table below tests the same thing, but with a 445 byte file instead of an 8k file. The choice of 445 was arbitrary.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfsfixfixfixfixfixfix
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

In the small file test table, all the results are the same, except for btrfs, which returns correct data in every case tested. What’s happening here is that the filesystem was created on a rotational disk and, by default, btrfs duplicates filesystem metadata on rotational disks (it can be configured to do so on SSDs, but that’s not the default). Since the file was tiny, btrfs packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure.

Overlay

Overlayfs allows one file system to be “overlaid” on another. As explained in the initial commit, one use case might be to put an (upper) read-write directory tree on top of a (lower) read-only directory tree, where all modifications go to the upper, writable layer.

Although not listed on the tables, we also tested every filesystem other than fat as the lower filesystem with overlay fs (ext4 was the upper filesystem for all tests). Every filessytem tested showed the same results when used as the bottom layer in overlay as when used alone. fat wasn’t tested because mounting fat resulted in a filesystem not supported error.

Error correction

btrfs doesn’t, by default, duplicate metadata on SSDs because the developers believe that redundancy wouldn’t provide protection against errors on SSD (which is the same reason apfs doesn’t have redundancy). SSDs do a kind of write coalescing, which is likely to cause writes which happen consecutively to fall into the same block. If that block has a total failure, the redundant copies would all be lost, so redundancy doesn’t provide as much protection against failure as it would on a rotational drive.

I’m not sure that this means that redundancy wouldn’t help -- Individual flash cells degrade with operation and lose charge as they age. SSDs have built-in wear-leveling and error-correction that’s designed to reduce the probability that a block returns bad data, but over time, some blocks will develop so many errors that the error-correction won’t be able to fix the error and the block will return bad data. In that case, a read should return some bad bits along with mostly good bits. AFAICT, the publicly available data on SSD error rates seems to line up with this view.

Error detection

Relatedly, it appears that apfs doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”. Publicly available studies on SSD reliability have not found that there’s a model that doesn’t sometimes return bad data. It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:

The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

While there is one sense in which SSDs are more reliable than rotational disks, there’s also a sense in which they appear to be less reliable. It’s not impossible that Apple uses some kind of custom firmware on its drive that devotes more bits to error correction than you can get in publicly available disks, but even if that’s the case, you might plug a non-apple drive into your apple computer and want some kind of protection against data corruption.

Internal error handling

Now that we’ve reproduced some tests from Prabhakaran et al., we’re going to move on to Gunawi et al.. Since the paper is fairly involved, we’re just going to look at one small part of the paper, the part where they examined three function calls, filemap_fdatawait, filemap_fdatawrite, and sync_blockdev to see how often errors weren’t checked for these functions.

Their justification for looking at these function is given as:

As discussed in Section 3.1, a function could return more than one error code at the same time, and checking only one of them suffices. However, if we know that a certain function only returns a single error code and yet the caller does not save the return value properly, then we would know that such call is really a flaw. To find real flaws in the file system code, we examined three important functions that we know only return single error codes: sync_blockdev, filemap_fdatawrite, and filemap_fdatawait. A file system that does not check the returned error codes from these functions would obviously let failures go unnoticed in the upper layers.

Ignoring errors from these functions appears to have fairly serious consequences. The documentation for filemap_fdatawait says:

filemap_fdatawait — wait for all under-writeback pages to complete ... Walk the list of under-writeback pages of the given address space and wait for all of them. Check error status of the address space and return it. Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.

The comment next to the code for sync_blockdev reads:

Write out and wait upon all the dirty data associated with a block device via its mapping. Does not take the superblock lock.

In both of these cases, it appears that ignoring the error code could mean that data would fail to get written to disk without notifying the writer that the data wasn’t actually written?

Let’s look at how often calls to these functions didn’t completely ignore the error code:

fn 2008 '08 % 2017 '17 %
filemap_fdatawait 7 / 29 24 12 / 17 71
filemap_fdatawrite 17 / 47 36 13 / 22 59
sync_blockdev 6 / 21 29 7 / 23 30

This table is for all code in linux under fs. Each row shows data for calls of one function. For each year, the leftmost cell shows the number of calls that do something with the return value over the total number of calls. The cell to the right shows the percentage of calls that do something with the return value. “Do something” is used very loosely here -- branching on the return value and then failing to handle the error in either branch, returning the return value and having the caller fail to handle the return value, as well as saving the return value and then ignoring it are all considered doing something for the purposes of this table.

For example Gunawi et al. noted that cifs/transport.c had

int SendReceive () { 
    int rc;
    rc = cifs_sign_smb(); // 
    ... 
    rc = smb_send();
}

Although cifs_sign_smb returned an error code, it was never checked before being overwritten by smb_send, which counted as being used for our purposes even though the error wasn’t handled.

Overall, the table appears to show that many more errors are handled now than were handled in 2008 when Gunawi et al. did their analysis, but it’s hard to say what this means from looking at the raw numbers because it might be ok for some errors not to be handled and different lines of code are executed with different probabilities.

Conclusion

Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.

Most filesystems don’t have checksums for data and leave error detection and correction up to userspace software. When I talk to server-side devs at big companies, their answer is usually something like “who cares? All of our file accesses go through a library that checksums things anyway and redundancy across machines and datacenters takes care of failures, so we only need error detection and not correction”. While that’s true for developers at certain big companies, there’s a lot of software out there that isn’t written robustly and just assumes that filesystems and disks don’t have errors.

This was a joint project with Wesley Aptekar-Cassels; the vast majority of the work for the project was done while pair programming at RC. We also got a lot of help from Kate Murphy. Both Wesley (w.aptekar@gmail.com) and Kate (hello@kate.io) are looking for work. They’re great and I highly recommend talking to them if you’re hiring!

Appendix: error handling in C

A fair amount of effort has been applied to get error handling right. But C makes it very easy to get things wrong, even when you apply a fair amount effort and even apply extra tooling. One example of this in the code is the submit_one_bio function. If you look at the definition, you can see that it’s annotated with __must_check, which will cause a compiler warning when the result is ignored. But if you look at calls of submit_one_bio, you’ll see that its callers aren’t annotated and can ignore errors. If you dig around enough you’ll find one path of error propagation that looks like:

submit_one_bio
submit_extent_page
__extent_writepage
extent_write_full_page
write_cache_pages
generic_writepages
do_writepages
__filemap_fdatawrite_range
__filemap_fdatawrite
filemap_fdatawrite

Nine levels removed from submit_one_bio, we see our old friend, `filemap_fdatawrite, which we know often doesn’t get checked for errors.

There's a very old debate over how to prevent things like this from accidentally happening. One school of thought, which I'll call the Uncle Bob (UB) school believes that we can't fix these kinds of issues with tools or processes and simply need to be better programmers in order to avoid bugs. You'll often hear people of the UB school say things like, "you can't get rid of all bugs with better tools (or processes)". In his famous and well-regarded talk, Simple Made Easy, Rich Hickey says

What's true of every bug found in the field?

[Audience reply: Someone wrote it?] [Audience reply: It got written.]

It got written. Yes. What's a more interesting fact about it? It passed the type checker.

[Audience laughter]

What else did it do?

[Audience reply: (Indiscernible)]

It passed all the tests. Okay. So now what do you do? Right? I think we're in this world I'd like to call guardrail programming. Right? It's really sad. We're like: I can make change because I have tests. Who does that? Who drives their car around banging against the guardrail saying, "Whoa! I'm glad I've got these guardrails because I'd never make it to the show on time."

[Audience laughter]

If you watch the talk, Rich uses "simplicity" the way Uncle Bob uses "discipline". They way these statements are used, they're roughly equivalent to Ken Thompson saying "Bugs are bugs. You write code with bugs because you do". The UB school throws tools and processes under the bus, saying that it's unsafe to rely solely on tools or processes.

Rich's rhetorical trick is brilliant -- I've heard that line quoted tens of times since the talk to argue against tests or tools or types. But, like guardrails, most tools and processes aren't about eliminating all bugs, they're about reducing the severity or probability of bugs. If we look at this particular function call, we can see that a static analysis tool failed to find this bug. Does that mean that we should give up on static analysis tools? A static analysis tool could look for all calls of submit_one_bio and show you the cases where the error is propogated up N levels only to be dropped. Gunawi et al. did exactly that and found a lot of bugs. A person basically can't do the same thing without tooling. They could try, but people are lucky if they get 95% accuracy when manually digging through things like this. The sheer volume of code guarantees that a human doing this by hand would make mistakes.

Even better than a static analysis tool would be a language that makes it harder to accidentally forget about checking for an error. One of the issues here is that it's sometimes valid to drop an error. There are a number of places where there's no interace that allows an error to get propogated out of the filesystem, making it correct to drop the error, modulo changing the interface. In the current situation, as an outsider reading the code, if you look at a bunch of calls that drop errors, it's very hard to say, for all of them, which of those is a bug and which of those is correct. If the default is that we have a kind of guardrail that says "this error must be checked", people can still incorrectly ignore errors, but you at least get an annotation that the omission was on purpose. For example, if you're forced to specifically write code that indicates that you're ignoring an error, and in code that's inteded to be robust, like filesystem code, code that drops an error on purpose is relatively likely to be accompanied by a comment explaining why the error was dropped.

Appendix: why wasn't this done earlier?

After all, it would be nice if we knew if modern filesystems could do basic tasks correctly. Filesystem developers probably know this stuff, but since I don't follow LKML, I had no idea whether or not things had improved since 2005 until we ran the experiment.

The papers we looked at here came out of Andrea and Remzi Arpaci-Dusseau's research lab. Remzi has a talk where he mentioned that grad students don't want to reproduce and update old work. That's entirely reasonable, given the incentives they face. And I don't mean to pick on academia here -- this work came out of academia, not industry. It's possible this kind of work simply wouldn't have happened if not for the academic incentive system.

In general, it seems to be quite difficult to fund work on correctness. There are a fair number of papers on new ways to find bugs, but that's relatively little work on applying existing techniques to existing code. In academia, that seems to be hard to get a good publication out of, in the open source world, that seems to be less interesting to people than writing new code. That's also entirely reasonable -- people should work on what they want, and even if they enjoy working on correctness, that's probably not a great career decision in general. I was at the RC career fair the other night and my badge said I was interesting in testing. The first person who chatted me up opened with "do you work in QA?". Back when I worked in hardware, that wouldn't have been a red flag, but in software, "QA" is code for a low-skill, tedious, and poorly paid job. Much of industry considers testing and QA to be an afterthought. As a result, open source projects that companies rely on are often woefully underfunded. Google funds some great work (like afl-fuzz), but that's the exception and not the rule, even within Google, and most companies don't fund any open source work. The work in this post was done by a few people who are intentionally temporarily unemployed, which isn't really a scalable model.

Occasionally, you'll see someone spend a lot of effort on immproving correctness, but that's usually done as a massive amount of free labor. Kyle Kingsbury might be the canonical example of this -- my understanding is that he worked on the Jepsen distributed systems testing tool on nights and weekends for years before turning that into a consulting business. It's great that he did that -- he showed that almost every open source distributed system had serious data loss or corruption bugs. I think that's great, but stories about heoric effort like that always worry me because heroism doesn't scale. If Kyle hadn't come along, would most of the bugs that he and his tool found still plague open source distributed systems today? That's a scary thought.

If I knew how to fund more work on correctness, I'd try to convince you that we should switch to this new model, but I don't know of a funding model that works. I've set up a patreon (donation account), but it would be quite extraordinary if that was sufficient to actually fund a signifcant amount of work. If you look at how much programmers make off of donations, if I made two order of magnitude less than I could if I took a job in industry, that would already put me in the top 1% of programmers on patreon. If I made one order of magnitude less than I'd make in industry, that would be extraordinary. Off the top of my head, the only programmers who make more than that off of patreon either make something with much broader appeal (like games) or are Evan You, who makes one of the most widely use front-end libraries in existence. And if I actually made as much as I can make in industry, I suspect that would make me the highest grossing programmer on patreon, even though, by industry standards, my compensation hasn't been anything special.

If I had to guess, I'd say that part of the reason it's hard to fund this kind of work is that consumers don't incentivize companies to fund this sort of work. If you look at "big" tech companies, two of them are substantially more serious about correctness than their competitors. This results in many fewer horror stories about lost emails and documents as well as lost entire accounts. If you look at the impact on consumers, it might be something like the difference between 1% of people seeing lost/corrupt emails vs. 0.001%. I think that's pretty significant if you multiply that cost across all consumers, but the vast majority of consumers aren't going to make decisions based on that kind of difference. If you look at an area where correctness problems are much more apparent, like databases or backups, you'll find that even the worst solutions have defenders who will pop into any dicussions and say "works for me". A backup solution that works 90% of the time is quite bad, but if you have one that works 90% of the time, it will still have staunch defenders who drop into discussions to say things like "I've restored from backup three times and it's never failed! You must be making stuff up!". I don't blame companies for rationally responding to consumers, but I do think that the result is unfortunate for consumers.

Just as an aside, one of the great wonders of doing open work for free is that the more free work you do, the more people complain that you didn't do enough free work. As David MacIver has said, doing open source work is like doing normal paid work, except that you get paid in complaints instead of cash. It's basically guaranteed that the most common comment on this post, for all time, will be that didn't test someone's pet filesystem because we're btrfs shills or just plain lazy, even though we include a link to a repo that lets anyone add tests as they please. Pretty much every time I've done any kind of free experimental work, people who obvously haven't read the experimental setup or the source code complain that the experiment couldn't possibly be right because of [thing that isn't true that anyone could see by looking at the setup] and that it's absolutely inexcusable that I didn't run the experiment on the exact pet thing they wanted to see. Having played video games competitively in the distant past, I'm used to much more intense internet trash talk, but in general, this incentive system seems to be backwards.

Appendix: experimental setup

For the error injection setup, a high-level view of the experimental setup is that dmsetup was used to simulate bad blocks on the disk.

A list of the commands run looks something like:

cp images/btrfs.img.gz /tmp/tmpeas9efr6.gz
gunzip -f /tmp/tmpeas9efr6.gz
losetup -f
losetup /dev/loop19 /tmp/tmpeas9efr6
blockdev --getsize /dev/loop19
#        0 74078 linear /dev/loop19 0
#        74078 1 error
#        74079 160296 linear /dev/loop19 74079
dmsetup create fserror_test_1508727591.4736078
mount /dev/mapper/fserror_test_1508727591.4736078 /mnt/fserror_test_1508727591.4736078/
mount -t overlay -o lowerdir=/mnt/fserror_test_1508727591.4736078/,upperdir=/tmp/tmp4qpgdn7f,workdir=/tmp/tmp0jn83rlr overlay /tmp/tmpeuot7zgu/
./mmap_read /tmp/tmpeuot7zgu/test.txt
umount /tmp/tmpeuot7zgu/
rm -rf /tmp/tmp4qpgdn7f
rm -rf /tmp/tmp0jn83rlr
umount /mnt/fserror_test_1508727591.4736078/
dmsetup remove fserror_test_1508727591.4736078
losetup -d /dev/loop19
rm /tmp/tmpeas9efr6

See this github repo for the exact set of commands run to execute tests.

Note that all of these tests were done on linux, so fat means the linux fat implementation, not the windows fat implementation. zfs and reiserfs weren’t tested because they couldn’t be trivially tested in the exact same way that we tested other filesystems (one of us spent an hour or two trying to get zfs to work, but its configuration interface is inconsistent with all of the filesystems tested; reiserfs appears to have a consistent interface but testing it requires doing extra work for a filesystem that appears to be dead). ext3 support is now provided by the ext4 code, so what ext3 means now is different from what it meant in 2005.

All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines. All filesystems were configured with default settings. For btrfs, this meant duplicated metadata without duplicated data and, as far as we know, the settings wouldn't have made a difference for other filesystems.

The second part of this doesn’t have much experimental setup to speak of. The setup was to grep the linux source code for the relevant functions.

Thanks to Leah Hanson, David Wragg, Ben Kuhn, Wesley Aptekar-Cassels, Joel Borggrén-Franck, and Dan Puttick for comments/corrections on this post.

Mon, 23 Oct 2017 00:00:00 +0000


Keyboard latency

If you look at “gaming" keyboards, a lot of them sell for $100 or more on the promise that they’re fast. Ad copy that you’ll see includes:

Despite all of these claims, I can only find one person who’s publicly benchmarked keyboard latency and they only tested two keyboards. In general, my belief is that if someone makes performance claims without benchmarks, the claims probably aren’t true, just like how code that isn’t tested (or otherwise verified) should be assumed broken.

The situation with gaming keyboards reminds me a lot of talking to car salesmen:

Salesman: this car is super safe! It has 12 airbags! Me: that’s nice, but how does it fare in crash tests? Salesman: 12 airbags!

Sure, gaming keyboards have 1000Hz polling, but so what?

Two obvious questions are:

  1. Does keyboard latency matter?
  2. Are gaming keyboards actually quicker than other keyboards?

Does keyboard latency matter?

A year ago, if you’d asked me if I was going to build a custom setup to measure keyboard latency, I would have said that’s silly, and yet here I am, measuring keyboard latency with a logic analyzer.

It all started because I had this feeling that some old computers feel much more responsive than modern machines. For example, an iMac G4 running macOS 9 or an Apple 2 both feel quicker than my 4.2 GHz Kaby Lake system. I never trust feelings like this because there’s decades of research showing that users often have feelings that are the literal opposite of reality, so got a high-speed camera and started measuring actual keypress-to-screen-update latency as well as mouse-move-to-screen-update latency. It turns out the machines that feel quick are actually quick, much quicker than my modern computer -- computers from the 70s and 80s commonly have keypress-to-screen-update latencies in the 30ms to 50ms range out of the box, whereas modern computers are often in the 100ms to 200ms range when you press a key in a terminal. It’s possible to get down to the 50ms range in well optimized games with a fancy gaming setup, and there’s one extraordinary consumer device that can easily get below 50ms, but the default experience is much slower. Modern computers have much better throughput, but their latency isn’t so great.

Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency. But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has 80k transistors running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.

Let’s look at the measured keypress-to-USB latency on some keyboards:

keyboard latency
(ms)
connection gaming
apple magic (usb) 15 USB FS
hhkb lite 2 20 USB FS
MS natural 4000 20 USB
das 3 25 USB
logitech k120 30 USB
unicomp model M 30 USB FS
pok3r vortex 30 USB FS
filco majestouch 30 USB
dell OEM 30 USB
powerspec OEM 30 USB
kinesis freestyle 2 30 USB FS
chinfai silicone 35 USB FS
razer ornata chroma 35 USB FS Yes
olkb planck rev 4 40 USB FS
ergodox 40 USB FS
MS comfort 5000 40 wireless
easterntimes i500 50 USB FS Yes
kinesis advantage 50 USB FS
genius luxemate i200 55 USB
topre type heaven 55 USB FS
logitech k360 60 "unifying"

The latency measurements are the time from when the key starts moving to the time when the USB packet associated with the key makes it out onto the USB bus. Numbers are rounded to the nearest 5 ms in order to avoid giving a false sense of precision. The easterntimes i500 is also sold as the tomoko MMC023.

The connection column indicates the connection used. USB FS stands for the usb full speed protocol, which allows up to 1000Hz polling, a feature commonly advertised by high-end keyboards. USB is the usb low speed protocol, which is the protocol most keyboards use. The ‘gaming’ column indicates whether or not the keyboard is branded as a gaming keyboard. wireless indicates some kind of keyboard-specific dongle and unifying is logitech's wireless device standard.

We can see that, even with the limited set of keyboards tested, there can be as much as a 45ms difference in latency between keyboards. Moreover, a modern computer with one of the slower keyboards attached can’t possibly be as responsive as a quick machine from the 70s or 80s because the keyboard alone is slower than the entire response pipeline of some older computers.

That establishes the fact that modern keyboards contribute to the latency bloat we’ve seen over the past forty years. The other half of the question is, does the latency added by a modern keyboard actually make a difference to users? From looking at the table, we can see that among the keyboard tested, we can get up to a 40ms difference in average latency. Is 40ms of latency noticeable? Let’s take a look at some latency measurements for keyboards and then look at the empirical research on how much latency users notice.

There’s a fair amount of empirical evidence on this and we can see that, for very simple tasks, people can perceive latencies down to 2ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

Are gaming keyboards faster than other keyboards?

I’d really like to test more keyboards before making a strong claim, but from the preliminary tests here, it appears that gaming keyboards aren’t generally faster than non-gaming keyboards.

Gaming keyboards often claim to have features that reduce latency, like connecting over USB FS and using 1000Hz polling. The USB low speed spec states that the minimum time between packets is 10ms, or 100 Hz. However, it’s common to see USB devices round this down to the nearest power of two and run at 8ms, or 125Hz. With 8ms polling, the average latency added from having to wait until the next polling interval is 4ms. With 1ms polling, the average latency from USB polling is 0.5ms, giving us a 3.5ms delta. While that might be a significant contribution to latency for a quick keyboard like the Apple magic keyboard, it’s clear that other factors dominate keyboard latency for most keyboards and that the gaming keyboards tested here are so slow that shaving off 3.5ms won’t save them.

Conclusion

Most keyboards add enough latency to make the user experience noticeably worse, and keyboards that advertise speed aren’t necessarily faster. The two gaming keyboards we measured weren’t faster than non-gaming keyboards, and the fastest keyboard measured was a minimalist keyboard from Apple that’s marketed more on design than speed.

Previously, we've seen that terminals can add significant latency, up 100ms in mildly pessimistic conditions if you choose the "right" terminal. In a future post, we'll look at the entire end-to-end pipeline to see other places latency has crept in and we'll also look at how some modern devices keep latency down.

Appendix: where is the latency coming from?

A major source of latency is key travel time. It’s not a coincidence that the quickest keyboard measured also has the shortest key travel distance by a large margin. video setup I’m using to measure end-to-end latency is a 240 fps camera, which means that frames are 4ms apart. When videoing “normal" keypresses and typing, it takes 4-8 frames for a key to become fully depressed. Most switches will start firing before the key is fully depressed, but the key travel time is still significant and can easily add 10ms of delay (or more, depending on the switch mechanism). Contrast this to the Apple "magic" keyboard measured, where the key travel is so short that it can’t be captured with a 240 fps camera, indicating that the key travel time is < 4ms.

Note that, unlike the other measurement I was able to find online, this measurement was from the start of the keypress instead of the switch activation. This is because, as a human, you don't activate the switch, you press the key. A measurement that starts from switch activiation time misses this large component to latency. If, for example, you're playing a game and you switch from moving forward to moving backwards when you see something happen, you have pay the cost of the key movement, which is different for different keyboards. A common response to this is that "real" gamers will preload keys so that they don't have to pay the key travel cost, but if you go around with a high speed camera and look at how people actually use their keyboards, the fraction of keypresses that are significantly preloaded is basically zero even when you look at gamers. It's possible you'd see something different if you look at high-level competitive gamers, but even then, just for example, people who use a standard wasd or esdf layout will typically not preload a key when going from back to forward. Also, the idea that it's fine that keys have a bunch of useless travel because you can pre-depress the key before really pressing the key is just absurd. That's like saying latency on modern computers is fine because some people build gaming boxes that, when run with unusually well optimzed software, get 50ms response time. Normal, non-hardcore-gaming users simply aren't going to do this. Since that's the vast majority of the market, even if all "serious" gamers did this, that would stll be a round error.

The other large sources of latency are scaning the keyboard matrix and debouncing. Neither of these delays are inherent -- keyboards use a matrix that has to be scanned instead of having a wire per-key because it saves a few bucks, and most keyboards scan the matrix at such a slow rate that it induces human noticable delays because that saves a few bucks, but a manufacturer willing to spend a bit more on manufacturing a keyboard could make the delay from that far below the threshold of human perception. See below for debouncing delay.

Although we didn't discuss throughput in this, when I measure my typing speed, I find that I can type faster with the low-travel Apple keyboard than with any of the other keyboards. There's no way to do a blinded experiment for this, but Gary Bernhardt and others have also observed the same thing. Some people claim that key travel doesn't matter for typing speed because they use the minimum amount of travel necessary and that this therefore can't matter, but as with the above claims on keypresses, if you walk around with a high speed camera and observe what actually happens when people type, it's very hard to find someone who actually does this.

Appendix: counter-arguments to common arguments that latency doesn’t matter

Before writing this up, I read what I could find about latency and it was hard to find non-specialist articles or comment sections that didn’t have at least one of the arguments listed below:

Computers and devices are fast

The most common response to questions about latency is that input latency is basically zero, or so close to zero that it’s a rounding error. For example, two of the top comments on this slashdot post asking about keyboard latency are that keyboards are so fast that keyboard speed doesn’t matter. One person even says

There is not a single modern keyboard that has 50ms latency. You (humans) have that sort of latency.

As far as response times, all you need to do is increase the poll time on the USB stack

As we’ve seen, some devices do have latencies in the 50ms range. This quote as well as other comments in the thread illustrate another common fallacy -- that input devices are limited by the speed of the USB polling. While that’s technically possible, most devices are nowhere near being fast enough to be limited by USB polling latency.

Unfortunately, most online explanations of input latency assume that the USB bus is the limiting factor.

Humans can’t notice 100ms or 200ms latency

Here’s a “cognitive neuroscientist who studies visual perception and cognition" who refers to the fact that human reaction time is roughly 200ms, and then throws in a bunch more scientific mumbo jumbo to say that no one could really notice latencies below 100ms. This is a little unusual in that the commenter claims some kind of special authority and uses a lot of terminology, but it’s common to hear people claim that you can’t notice 50ms or 100ms of latency because human reaction time is 200ms. This doesn’t actually make sense because there are independent quantities. This line of argument is like saying that you wouldn’t notice a flight being delayed by an hour because the duration of the flight is six hours.

Another problem with this line of reasoning is that the full pipeline from keypress to screen update is quite long and if you say that it’s always fine to add 10ms here and 10ms there, you end up with a much larger amount of bloat through the entire pipeline, which is how we got where we are today, where can buy a system with the CPU that gives you the fastest single-threaded performance money can buy and get 6x the latency of a machine from the 70s.

It doesn’t matter because the game loop runs at 60 Hz

This is fundamentally the same fallacy as above. If you have a delay that’s half the duration a clock period, there’s a 50% chance the delay will push the event into the next processing step. That’s better than a 100% chance, but it’s not clear to me why people think that you’d need a delay as long as the the clock period for the delay to matter. And for reference, the 45ms delta between the slowest and fastest keyboard measured here corresponds to 2.7 frames at 60fps.

Keyboards can’t possibly response faster more quickly than 5ms/10ms/20ms due to debouncing

Even without going through contortions to optimize the switch mechanism, if you’re willing to put hysteresis into the system, there’s no reason that the keyboard can’t assume a keypress (or release) is happening the moment it sees an edge. This is commonly done for other types of systems and AFAICT there’s no reason keyboards couldn’t do the same thing (and perhaps some do). The debounce time might limit the repeat rate of the key, but there’s no inherent reason that it has to affect the latency. And if we're looking at the repeat rate, imagine we have a 5ms limit on the rate of change of the key state due to introducing hysteresis. That gives us one full keypress cycle (press and release) every 10ms, or 100 keypresses per second per key, which is well beyond the capacity of any human. You might argue that this introduces a kind of imprecision, which might matter in some applications (music, rythym games), but that's limited by the switch mechanism. Using a debouncing mechanism with hysteresis doesn't make us any worse off than we were before.

An additional problem with debounce delay is that most keyboard manufacturers seem to have confounded scan rate and debounce delay. It's common to see keyboards with scan rates in the 100 Hz to 200 Hz range. This is justified by statements like "there's no point in scanning faster because the debounce delay is 5ms", which combines two fallacies mentioned above. If you pull out the schematics for the Apple 2e, you can see that the scan rate is roughly 50 kHz. Its debounce time is roughly 6ms, which corresponds to a frequency of 167 Hz. Why scan so quickly? The fast scan allows the keyboard controller to start the clock on the debounce time almost immediately (after at most 20 microseconds), as opposed a modern keyboard that scans at 167 Hz, which might not start the clock on debouncing for 6ms, or after 300x as much time.

Apologies for not explaining terminology here, but I think that anyone making this objection should understand the explanation :-).

Appendix: experimental setup

The USB measurement setup was a USB cable. Cutting open the cable damages the signal integrity and I found that, with a very long cable, some keyboards that weakly drive the data lines didn't drive them strongly enough to get a good signal with the cheap logic analyzer I used.

The start-of-input was measured by pressing two keys at once -- one key on the keyboard and a button that was also connected to the logic analyzer. This introduces some jitter as the two buttons won’t be pressed at exactly the same time. To calibrate the setup, we used two identical buttons connected to the logic analyzer. The median jitter was < 1ms and the 90%-ile jitter was roughly 5ms. This is enough that tail latency measurements for quick keyboards aren’t really possible with this setup, but average latency measurements like the ones done here seem like they should be ok. The input jitter could probably be reduced to a negligible level by building a device to both trigger the logic analyzer and press a key on the keyboard under test at the same time. Average latency measurements would also get better with such a setup (because it would be easier to run a large number of measurements).

If you want to know the exact setup, a E-switch LL1105AF065Q switch was used. Power and ground were supplied by an arduino board. There’s no particular reason to use this setup. In fact, it’s a bit absurd to use an entire arduino to provide power, but this was done with spare parts that were lying around and this stuff just happened to be stuff that RC had in their lab, with the exception of the switches. There weren’t two identical copies of any switch, so we bought a few switches so we could do calibration measurements with two identical switches. The exact type of switch isn’t important here; any low-resistance switch would do.

Tests were done by pressing the z key and then looking for byte 29 on the USB bus and then marking the end of the first packet containing the appropriate information. But, as above, any key would do.

I don't actually trust this setup and I'd like to build a completely automated setup before testing more keyboards. While the measurements are in line with the one other keyboard measurement I could find online, this setup has an inherent imprecision that's probably in the 1ms to 10ms range. While averaging across multiple measurements reduces that imprecision, since the measurements are done by a human, it's not guaranteed and perhaps not even likely that the errors are independent and will average out.

I’m looking for more phones and machines to measure and would love to run a quick benchmark with a high-speed camera if you have a machine or device that’s not on this list! If you’re not in the area and want to donate a device for me to test, feel free to mail the device to

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

This project was done with help from Wesley Aptekar-Cassels, Leah Hanson, and Kate Murphy. BTW, Wesley is looking for work. In addition to knowing “normal" programming stuff, he’s also familiar with robotics, controls, electronics, and general “low-level" programming.

Thanks to RC, Ahmad Jarara, Raph Levien, Peter Bhat Harkins, Brennan Chesley, Dan Bentley, Kate Murphy, Christian Ternus, Sophie Haskins, and Dan Puttick, for letting us use their keyboards for testing.

Thanks for Leah Hanson, Mark Feeney, and Zach Allaun for comments/corrections/discussion on this post.

Mon, 16 Oct 2017 00:00:00 +0000


Branch prediction

This is a pseudo-transcript for a talk on branch prediction given at Two Sigma on 8/22/2017 to kick off "localhost", a talk series organized by RC.

How many of you use branches in your code? Could you please raise your hand if you use if statements or pattern matching?

Most of the audience raises their hands

I won’t ask you to raise your hands for this next part, but my guess is that if I asked, how many of you feel like you have a good understanding of what your CPU does when it executes a branch and what the performance implications are, and how many of you feel like you could understand a modern paper on branch prediction, fewer people would raise their hands.

The purpose of this talk is to explain how and why CPUs do “branch prediction” and then explain enough about classic branch prediction algorithms that you could read a modern paper on branch prediction and basically know what’s going on.

Before we talk about branch prediction, let’s talk about why CPUs do branch prediction. To do that, we’ll need to know a bit about how CPUs work.

For the purposes of this talk, you can think of your computer as a CPU plus some memory. The instructions live in memory and the CPU executes a sequence of instructions from memory, where instructions are things like “add two numbers”, “move a chunk of data from memory to the processor”. Normally, after executing one instruction, the CPU will execute the instruction that’s at the next sequential address. However, there are instructions called “branches” that let you change the address next instruction comes from.

Here’s an abstract diagram of a CPU executing some instructions. The x-axis is time and the y-axis distinguishes different instructions.

Instructions executing sequentially

Here, we execute instruction A, followed by instruction B, followed by instruction C, followed by instruction D.

One way you might design a CPU is to have the CPU do all of the work for one instruction, then move on to the next instruction, do all of the work for the next instruction, and so on. There’s nothing wrong with this; a lot of older CPUs did this, and some modern very low-cost CPUs still do this. But if you want to make a faster CPU, you might make a CPU that works like an assembly line. That is, you break the CPU up into two parts, so that half the CPU can do the “front half” of the work for an instruction while half the CPU works on the “back half” of the work for an instruction, like an assembly line. This is typically called a pipelined CPU.

Instructions with overlapping execution

If you do this, the execution might look something like the above. After the first half of instruction A is complete, the CPU can work on the second half of instruction A while the first half of instruction B runs. And when the second half of A finishes, the CPU can start on both the second half of B and the first half of C. In this diagram, you can see that the pipelined CPU can execute twice as many instructions per unit time as the unpipelined CPU above.

There’s no reason that a CPU can only be broken up into two parts. We could break the CPU into three parts, and get a 3x speedup, or four parts and get a 4x speedup. This isn’t strictly true, and we generally get less than a 3x speedup for a three-stage pipeline or 4x speedup for a 4-stage pipeline because there’s overhead in breaking the CPU up into more parts and having a deeper pipeline.

One source of overhead is how branches are handled. One of the first things the CPU has to do for an instruction is to get the instruction; to do that, it has to know where the instruction is. For example, consider the following code:

if (x == 0) {
  // Do stuff
} else {
  // Do other stuff (things)
}
  // Whatever happens later

This might turn into assembly that looks something like

branch_if_not_equal x, 0, else_label
// Do stuff
goto end_label
else_label:
// Do things
end_label:
// whatever happens later

In this example, we compare x to 0. if_not_equal, then we branch to else_label and execute the code in the else block. If that comparison fails (i.e., if x is 0), we fall through, execute the code in the if block, and then jump to end_label in order to avoid executing the code in else block.

The particular sequence of instructions that’s problematic for pipelining is

branch_if_not_equal x, 0, else_label
???

The CPU doesn’t know if this is going to be

branch_if_not_equal x, 0, else_label
// Do stuff

or

branch_if_not_equal x, 0, else_label
// Do things

until the branch has finished (or nearly finished) executing. Since one of the first things the CPU needs to do for an instruction is to get the instruction from memory, and we don’t know which instruction ??? is going to be, we can’t even start on ??? until the previous instruction is nearly finished.

Earlier, when we said that we’d get a 3x speedup for a 3-stage pipeline or a 20x speedup for a 20-stage pipeline, that assumed that you could start a new instruction every cycle, but in this case the two instructions are nearly serialized.

non-overlapping execution due to branch stall

One way around this problem is to use branch prediction. When a branch shows up, the CPU will guess if the branch was taken or not taken.

speculating about a branch result

In this case, the CPU predicts that the branch won’t be taken and starts executing the first half of stuff while it’s executing the second half of the branch. If the prediction is correct, the CPU will execute the second half of stuff and can start another instruction while it’s executing the second half of stuff, like we saw in the first pipeline diagram.

overlapped execution after a correct prediction

If the prediction is wrong, when the branch finishes executing, the CPU will throw away the result from stuff.1 and start executing the correct instructions instead of the wrong instructions. Since we would’ve stalled the processor and not executed any instructions if we didn’t have branch prediction, we’re no worse off than we would’ve been had we not made a prediction (at least at the level of detail we’re looking at).

aborted prediction

What’s the performance impact of doing this? To make an estimate, we’ll need a performance model and a workload. For the purposes of this talk, our cartoon model of a CPU will be a pipelined CPU where non-branches take an average of one instruction per clock, unpredicted or mispredicted branches take 20 cycles, and correctly predicted branches take one cycle.

If we look at the most commonly used benchmark of “workstation” integer workloads, SPECint, the composition is maybe 20% branches, and 80% other operations. Without branch prediction, we then expect the “average” instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.2 * 20 + 0.8 * 1 = 4 + 0.8 = 4.8 cycles. With perfect, 100% accurate, branch prediction, we’d expect the average instruction to take 0.8 * 1 + 0.2 * 1 = 1 cycle, a 4.8x speedup! Another way to look at it is that if we have a pipeline with a 20-cycle branch misprediction penalty, we have nearly a 5x overhead from our ideal pipelining speedup just from branches alone.

Let’s see what we can do about this. We’ll start with the most naive things someone might do and work our way up to something better.

Predict taken

Instead of predicting randomly, we could look at all branches in the execution of all programs. If we do this, we’ll see that taken and not not-taken branches aren’t exactly balanced -- there are substantially more taken branches than not-taken branches. One reason for this is that loop branches are often taken.

If we predict that every branch is taken, we might get 70% accuracy, which means we’ll pay the the misprediction cost for 30% of branches, making the cost of of an average instruction (0.8 + 0.7 * 0.2) * 1 + 0.3 * 0.2 * 20 = 0.94 + 1.2. = 2.14. If we compare always predicting taken to no prediction and perfect prediction, always predicting taken gets a large fraction of the benefit of perfect prediction despite being a very simple algorithm.

2.14 cycles per instruction

Backwards taken forwards not taken (BTFNT)

Predicting branches as taken works well for loops, but not so great for all branches. If we look at whether or not branches are taken based on whether or not the branch is forward (skips over code) or backwards (goes back to previous code), we can see that backwards branches are taken more often than forward branches, so we could try a predictor which predicts that backward branches are taken and forward branches aren’t taken (BTFNT). If we implement this scheme in hardware, compiler writers will conspire with us to arrange code such that branches the compiler thinks will be taken will be backwards branches and branches the compiler thinks won’t be taken will be forward branches.

If we do this, we might get something like 80% prediction accuracy, making our cost function (0.8 + 0.8 * 0.2) * 1 + 0.2 * 0.2 * 20 = 0.96 + 0.8 = 1.76 cycles per instruction.

1.76 cycles per instruction

Used by

One-bit

So far, we’ve look at schemes that don’t store any state, i.e., schemes where the prediction ignores the program’s execution history. These are called static branch prediction schemes in the literature. These schemes have the advantage of being simple but they have the disadvantage of being bad at predicting branches whose behavior change over time. If you want an example of a branch whose behavior changes over time, you might imagine some code like

if (flag) {
  // things
  }

Over the course of the program, we might have one phase of the program where the flag is set and the branch is taken and another phase of the program where flag isn’t set and the branch isn’t taken. There’s no way for a static scheme to make good predictions for a branch like that, so let’s consider dynamic branch prediction schemes, where the prediction can change based on the program history.

One of the simplest things we might do is to make a prediction based on the last result of the branch, i.e., we predict taken if the branch was taken last time and we predict not taken if the branch wasn’t taken last time.

Since having one bit for every possible branch is too many bits to feasibly store, we’ll keep a table of some number of branches we’ve seen and their last results. For this talk, let’s store not taken as 0 and taken as 1.

prediction table with 1-bit entries indexed by low bits of branch address

In this case, just to make things fit on a diagram, we have a 64-entry table, which mean that we can index into the table with 6 bits, so we index into the table with the low 6 bits of the branch address. After we execute a branch, we update the entry in the prediction table (highlighted below) and the next time the branch is executed again, we index into the same entry and use the updated value for the prediction.

indexed entry changes on update

It’s possible that we’ll observe aliasing and two branches in two different locations will map to the same location. This isn’t ideal, but there’s a tradeoff between table speed & cost vs. size that effectively limits the size of the table.

If we use a one-bit scheme, we might get 85% accuracy, a cost of (0.8 + 0.85 * 0.2) * 1 + 0.15 * 0.2 * 20 = 0.97 + 0.6 = 1.57 cycles per instruction.

1.57 cycles per instruction

Used by

Two-bit

A one-bit scheme works fine for patterns like TTTTTTTT… or NNNNNNN… but will have a misprediction for a stream of branches that’s mostly taken but has one branch that’s not taken, ...TTTNTTT... This can be fixed by adding second bit for each address and implementing a saturating counter. Let’s arbitrarily say that we count down when a branch is not taken and count up when it’s taken. If we look at the binary values, we’ll then end up with:

00: predict Not
01: predict Not
10: predict Taken
11: predict Taken

The “saturating” part of saturating counter means that if we count down from 00, instead of underflowing, we stay at 00, and similar for counting up from 11 staying at 11. This scheme is identical to the one-bit scheme, except that each entry in the prediction table is two bits instead of one bit.

same as 1-bit, except that the table has 2 bits

Compared to a one-bit scheme, a two-bit scheme can have half as many entries at the same size/cost (if we only consider the cost of storage and ignore the cost of the logic for the saturating counter), but even so, for most reasonable table sizes a two-bit scheme provides better accuracy.

Despite being simple, this works quite well, and we might expect to see something like 90% accuracy for a two bit predictor, which gives us a cost of 1.38 cycles per instruction.

1.38 cycles per instruction

One natural thing to do would be to generalize the scheme to an n-bit saturating counter, but it turns out that adding more bits has a relatively small effect on accuracy. We haven’t really discussed the cost of the branch predictor, but going from 2 bits to 3 bits per branch increases the table size by 1.5x for little gain, which makes it not worth the cost in most cases. The simplest and most common things that we won’t predict well with a two-bit scheme are patterns like NTNTNTNTNT... or NNTNNTNNT…, but going to n-bits won’t let us predict those patterns well either!

Used by

Two-level adaptive, global (1991)

If we think about code like

for (int i = 0; i < 3; ++i) {
  // code here.
  }

That code will produce a pattern of branches like TTTNTTTNTTTN....

If we know the last three executions of the branch, we should be able to predict the next execution of the branch:

TTT:N
TTN:T
TNT:T
NTT:T

The previous schemes we’ve considered use the branch address to index into a table that tells us if the branch is, according to recent history, more likely to be taken or not taken. That tells us which direction the branch is biased towards, but it can’t tell us that we’re in the middle of a repetitive pattern. To fix that, we’ll store the history of the most recent branches as well as a table of predictions.

Use global branch history and branch address to index into prediction table

In this example, we concatenate 4 bits of branch history together with 2 bits of branch address to index into the prediction table. As before, the prediction comes from a 2-bit saturating counter. We don’t want to only use the branch history to index into our prediction table since, if we did that, any two branches with the same history would alias to the same table entry. In a real predictor, we’d probably have a larger table and use more bits of branch address, but in order to fit the table on a slide, we have an index that’s only 6 bits long.

Below, we’ll see what gets updated when we execute a branch.

Update changes index because index uses bits from branch history

The bolded parts are the parts that were updated. In this diagram, we shift new bits of branch history in from right to left, updating the branch history. Because the branch history is updated, the low bits of the index into the prediction table are updated, so the next time we take the same branch again, we’ll use a different entry in the table to make the prediction, unlike in previous schemes where the index is fixed by the branch address. The old entry’s value is updated so that the next time we take the same branch again with the same branch history, we’ll have the updated prediction.

Since the history in this scheme is global, this will correctly predict patterns like NTNTNTNT… in inner loops, but may not always correct make predictions for higher-level branches because the history is global and will be contaminated with information from other branches. However, the tradeoff here is that keeping a global history is cheaper than keeping a table of local histories. Additionally, using a global history lets us correctly predict correlated branches. For example, we might have something like:

if x > 0:
  x -= 1
if y > 0:
  y -= 1
if x * y > 0:
  foo()

If either the first branch or the next branch isn’t taken, then the third branch definitely will not be taken.

With this scheme, we might get 93% accuracy, giving us 1.27 cycles per instruction.

1.27 cycles per instruction

Used by

Two-level adaptive, local [1992]

As mentioned above, an issue with the global history scheme is that the branch history for local branches that could be predicted cleanly gets contaminated by other branches.

One way to get good local predictions is to keep separate branch histories for separate branches.

keep a table of per-branch histories instead of a global history

Instead of keeping a single global history, we keep a table of local histories, index by the branch address. This scheme is identical to the global scheme we just looked at, except that we keep multiple branch histories. One way to think about this is that having global history is a special case of local history, where the number of histories we keep track of is 1.

With this scheme, we might get something like 94% accuracy, which gives us a cost of 1.23 cycles per instruction.

1.23 cycles per instruction

Used by

gshare

One tradeoff a global two-level scheme has to make is that, for a prediction table of a fixed size, bits must be dedicated to either the branch history or the branch address. We’d like to give more bits to the branch history because that allows correlations across greater “distance” as well as tracking more complicated patterns and we’d like to give more bits to the branch address to avoid interference between unrelated branches.

We can try to get the best of both worlds by hashing both the branch history and the branch address instead of concatenating them. One of the simplest reasonable things one might do, and the first proposed mechanism was to xor them together. This two-level adaptive scheme, where we xorthe bits together is called gshare.

hash branch address and branch history instead of appending

With this scheme, we might see something like 94% accuracy. That’s the accuracy we got from the local scheme we just looked at, but gshare avoids having to keep a large table of local histories; getting the same accuracy while having to track less state is a significant improvement.

Used by

agree (1997)

One reason for branch mispredictions is interference between different branches that alias to the same location. There are many ways to reduce interference between branches that alias to the same predictor table entry. In fact, the reason this talk only runs into schemes invented in the 90s is because a wide variety of interference-reducing schemes were proposed and there are too many to cover in half an hour.

We’ll look at one scheme which might give you an idea of what an interference-reducing scheme could look like, the “agree” predictor. When two branch-history pairs collide, the predictions either match or they don’t. If they match, we’ll call that neutral interference and if they don’t, we’ll call that negative interference. The idea is that most branches tend to be strongly biased (that is, if we use two-bit entries in the predictor table, we expect that, without interference, most entries will be 00 or 11 most of the time, not 01 or 10). For each branch in the program, we’ll store one bit, which we call the “bias”. The table of predictions will, instead of storing the absolute branch predictions, store whether or not the prediction matches or does not match the bias.

predict whether or not a branch agrees with its bias as opposed to whether or not it's taken

If we look at how this works, the predictor is identical to a gshare predictor, except that we make the changes mentioned above -- the prediction is agree/disagree instead of taken/not-taken and we have a bias bit that’s indexed by the branch address, which gives us something to agree or disagree with. In the original paper, they propose using the first thing you see as the bias and other people have proposed using profile-guided optimization (basically running the program and feeding the data back to the compiler) to determine the bias.

Note that, when we execute a branch and then later come back around to the same branch, we’ll use the same bias bit because the bias is indexed by the branch address, but we’ll use a different predictor table entry because that’s indexed by both the branch address and the branch history.

updating uses the same bias but a different meta-prediction table entry

If it seems weird that this would do anything, let’s look at a concrete example. Say we have two branches, branch A which is taken with 90% probability and branch B which is taken with 10% probability. If those two branches alias and we assume the probabilities that each branch is taken are independent, the probability that they disagree and negatively interfere is P(A taken) * P(B not taken) + P(A not taken) + P(B taken) = (0.9 * 0.9) + (0.1 * 0.1) = 0.82.

If we use the agree scheme, we can re-do the calculation above, but the probability that the two branches disagree and negatively interfere is P(A agree) * P(B disagree) + P(A disagree) * P(B agree) = P(A taken) * P(B taken) + P(A not taken) * P(B taken) = (0.9 * 0.1) + (0.1 * 0.9) = 0.18. Another way to look at it is, to have destructive interference, one of the branches must disagree with its bias. By definition, if we’ve correctly determined the bias, this cannot be likely to happen.

With this scheme, we might get something like 95% accuracy, giving us 1.19 cycles per instruction.

1.19 cycles per instruction

Used by

Hybrid (1993)

As we’ve seen, local predictors can predict some kinds of branches well (e.g., inner loops) and global predictors can predict some kinds of branches well (e.g., some correlated branches). One way to try to get the best of both worlds is to have both predictors, then have a meta predictor that predicts if the local or the global predictor should be used. A simple way to do this is to have the meta-predictor use the same scheme as the two-bit predictor above, except that instead of predicting taken or not taken it predicts local predictor or global predictor

predict which of two predictors is correct instead of predicting if the branch is taken

Just as there are many possible interference-reducing schemes, of which the agree predictor, above is one, there are many possible hybrid schemes. We could use any two predictors, not just a local and global predictor, and we could even use more than two predictors.

If we use a local and global predictor, we might get something like 96% accuracy, giving us 1.15 cycles per instruction.

1.15 cycles per instruction

Used by

Not covered

There are a lot of things we didn’t cover in this talk! As you might expect, the set of material that we didn’t cover is much larger than what we did cover. I’ll briefly describe a few things we didn’t cover, with references, so you can look them up if you’re interested in learning more.

One major thing we didn’t talk about is how to predict the branch target. Note that this needs to be done even for some unconditional branches (that is, branches that don’t need directional prediction because they’re always taken), since (some) unconditional branches have unknown branch targets.

Branch target prediction is expensive enough that some early CPUs had a branch prediction policy of “always predict not taken” because a branch target isn’t necessary when you predict the branch won’t be taken! Always predicting not taken has poor accuracy, but it’s still better than making no prediction at all.

Among the interference reducing predictors we didn’t discuss are bi-mode, gskew, and YAGS. Very briefly, bi-mode is somewhat like agree in that it tries to seperate out branches based on direction, but the mechanism used in bi-mode is that we keep multiple predictor tables and a third predictor based on the branch address is used to predict which predictor table gets use for the particular combination of branch and branch history. Bi-mode appears to be more successful than agree in that it's seen wider use. With gksew, we keep at least three predictor tables and use a different hash to index into each table. The idea is that, even if two branches alias, those two branches will only alias in one of the tables, so we can use a vote and the result from the other two tables will override the potentially bad result from the aliasing table. I don't know how to describe YAGS very briefly :-).

Because we didn't take about speed (as in latency), a prediction strategy we didn't talk about is to have a small/fast predictor that can be overridden by a slower and more accurate predictor when the slower predictor computes its result.

Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use perceptron based branch predictors. Perceptrons are single-layer neural nets.

It’s been argued that Intel Haswell (2013) uses a variant of a TAGE predictor. TAGE stands for TAgged GEometric history length predictor. If we look at the predictors we’ve covered and look at actual executions of programs to see which branches we’re not predicting correctly, one major class of branches are branches that need a lot of history -- a significant number of branches need tens or hundreds of bits of history and some even need more than a thousand bits of branch history. If we have a single predictor or even a hybrid predictor that combines a few different predictors, it’s counterproductive to keep a thousand bits of history because that will make predictions worse for the branches which need a relatively small amount of history (especially relative to the cost), which is most branches. One of the ideas in the TAGE predictor is that, by keeping a geometric series of history lengths, each branch can use the appropriate history. That explains the GE. The TA part is that branches are tagged, which is a mechanism we don’t discuss that the predictor uses to track which branches should use which set of history.

Modern CPUs often have specialized predictors, e.g., a loop predictor can accurately predict loop branches in cases where a generalized branch predictor couldn’t reasonably store enough history to make perfect predictions for every iteration of the loop.

We didn’t talk at all about the tradeoff between using up more space and getting better predictions. Not only does changing the size of the table change the performance of a predictor, it also changes which predictors are better relative to each other.

We also didn’t talk at all about how different workloads affect different branch predictors. Predictor performance varies not only based on table size but also based on which particular program is run.

We’ve also talked about branch misprediction cost as if it’s a fixed thing, but it is not, and for that matter, the cost of non-branch instructions also varies widely between different workloads.

I tried to avoid introducing non-self-explanatory terminology when possible, so if you read the literature, terminology will be somewhat different.

Conclusion

We’ve looked at a variety of classic branch predictors and very briefly discussed a couple of newer predictors. Some of the classic predictors we discussed are still used in CPUs today, and if this were an hour long talk instead of a half-hour long talk, we could have discussed state-of-the-art predictors. I think that a lot of people have an idea that CPUs are mysterious and hard to understand, but I think that CPUs are actually easier to understand than software. I might be biased because I used to work on CPUs, but I think that this is not a result of my bias but something fundamental.

If you think about the complexity of software, the main limiting factor on complexity is your imagination. If you can imagine something in enough detail that you can write it down, you can make it. Of course there are cases where that’s not the limiting factor and there’s something more practical (e.g., the performance of large scale applications), but I think that most of us spend most of our time writing software where the limiting factor is the ability to create and manage complexity.

Hardware is quite different from this in that there are forces that push back against complexity. Every chunk of hardware you implement costs money, so you want to implement as little hardware as possible. Additionally, performance matters for most hardware (whether that’s absolute performance or performance per dollar or per watt or per other cost), and adding complexity makes hardware slower, which limits performance. Today, you can buy an off-the-shelf CPU for $300 which can be overclocked to 5 GHz. At 5 GHz, one unit of work is one-fifth of one nanosecond. For reference, light travels roughly one foot in one nanosecond. Another limiting factor is that people get pretty upset when CPUs don’t work perfectly all of the time. Although CPUs do have bugs, the rate of bugs is much lower than in almost all software, i.e., the standard to which they’re verified/tested is much higher. Adding complexity makes things harder to test and verify. Because CPUs are held to a higher correctness standard than most software, adding complexity creates a much higher test/verification burden on CPUs, which makes adding a similar amount of complexity much more expensive in hardware than in software, even ignoring the other factors we discussed.

A side effect of these factors that push back against chip complexity is that, for any particular “high-level” general purpose CPU feature, it is generally conceptually simple enough that it can be described in a half-hour or hour-long talk. CPUs are simpler than many programmers think! BTW, I say “high-level” to rule out things like how transistors and circuit design, which can require a fair amount of low-level (physics or solid-state) background to understand.

CPU internals series

Thanks to Leah Hanson, Hari Angepat, and Nick Bergson-Shilcock for reviewing practice versions of the talk and to Fred Clausen Jr for finding a typo in this post. Apologies for the somewhat slapdash state of this post -- I wrote it quickly so that people who attended the talk could refer to the “transcript ” soon afterwards and look up references, but this means that there are probably more than the usual number of errors and that the organization isn’t as nice as it would be for a normal blog post. In particular, things that were explained using a series of animations in the talk are not explained in the same level of detail and on skimming this, I notice that there’s less explanation of what sorts of branches each predictor doesn’t handle well, and hence less motivation for each predictor. I may try to go back and add more motivation, but I’m unlikely to restructure the post completely and generate a new set of graphics that better convey concepts when there are a couple of still graphics next to text. Thanks to Julien Vivenot, Ralph Corderoy, Vaibhav Sagar, Mindy Preston, and Uri Shaked for catching typos in this hastily written post.

Wed, 23 Aug 2017 00:00:00 +0000


Sattolo's algorithm

I recently had a problem where part of the solution was to do a series of pointer accesses that would walk around a chunk of memory in pseudo-random order. Sattolo's algorithm provides a solution to this because it produces a permutation of a list with exactly one cycle, which guarantees that we will reach every element of the list even though we're traversing it in random order.

However, the explanations of why the algorithm worked that I could find online either used some kind of mathematical machinery (stirling numbers, assuming familiarity with cycle notation, etc.), or used logic that was hard for me to follow. I find that this is common for explanations of concepts that could, but don't have to, use a lot of mathematical machinery. I don't think there's anything wrong with using existing mathematical methods per se -- it's a nice mental shortcut if you're familiar with the concepts, but I think it's unfortunate that it's hard to find a relatively simple explanation that doesn't require any background. When I was looking for a simple explanation, I also found a lot of people who were using Sattolo's algorithm in places where it wasn't appropriate and also people who didn't know that Sattolo's algorithm is what they were looking for, so here's an attempt at an explanation of why the algorithm works that doesn't assume an undergraduate combinatorics background.

Before we look at Sattolo's algorithm, let's look at Fisher-Yates, which is an in-place algorithm that produces a random permutation of an array/vector, where every possible permutation occurs with uniform probability.

We'll look at the code for Fisher-Yates and then how to prove that the algorithm produces the intended result.

def shuffle(a):
    n = len(a)
    for i in range(n - 1):  # i from 0 to n-2, inclusive.
        j = random.randrange(i, n)  # j from i to n-1, inclusive.
        a[i], a[j] = a[j], a[i]  # swap a[i] and a[j].

shuffle takes an array and produces a permutation of the array, i.e., it shuffles the array. We can think of this loop as placing each element of the array, a, in turn, from a[0] to a[n-2]. On some iteration, i, we choose one of n-i elements to swap with and swap element i with some random element. The last element in the array, a[n-1], is skipped because it would always be swapped with itself. One way to see that this produces every possible permutation with uniform probability is to write down the probability that each element will end up in any particular location1. Another way to do it is to observe two facts about this algorithm:

  1. Every output that Fisher-Yates produces is produced with uniform probability
  2. Fisher-Yates produces as many outputs as there are permutations (and each output is a permutation)

(1) For each random choice we make in the algorithm, if we make a different choice, we get a different output. For example, if we look at the resultant a[0], the only way to place the element that was originally in a[k] (for some k) in the resultant a[0] is to swap a[0] with a[k] in iteration 0. If we choose a different element to swap with, we'll end up with a different resultant a[0]. Once we place a[0] and look at the resultant a[1], the same thing is true of a[1] and so on for each a[i]. Additionally, each choice reduces the range by the same amount -- there's a kind of symmetry, in that although we place a[0] first, we could have placed any other element first; every choice has the same effect. This is vaguely analogous to the reason that you can pick an integer uniformly at random by picking digits uniformly at random, one at a time.

(2) How many different outputs does Fisher-Yates produce? On the first iteration, we fix one of n possible choices for a[0], then given that choice, we fix one of n-1 choices for a[1], then one of n-2 for a[2], and so on, so there are n * (n-1) * (n-2) * ... 2 * 1 = n! possible different outputs.

This is exactly the same number of possible permutations of n elements, by pretty much the same reasoning. If we want to count the number of possible permutations of n elements, we first pick one of n possible elements for the first position, n-1 for the second position, and so on resulting in n! possible permutations.

Since Fisher-Yates only produces unique permutations and there are exactly as many outputs as there are permutations, Fisher-Yates produces every possible permutation. Since Fisher-Yates produces each output with uniform probability, it produces all possible permutations with uniform probability.

Now, let's look at Sattolo's algorithm, which is almost identical to Fisher-Yates and also produces a shuffled version of the input, but produces something quite different:

def sattolo(a):
    n = len(a)
    for i in range(n - 1):
        j = random.randrange(i+1, n)  # i+1 instead of i
        a[i], a[j] = a[j], a[i]

Instead of picking an element at random to swap with, like we did in Fisher-Yates, we pick an element at random that is not the element being placed, i.e., we do not allow an element to be swapped with itself. One side effect of this is that no element ends up where it originally started.

Before we talk about why this produces the intended result, let's make sure we're on the same page regarding terminology. One way to look at an array is to view it as a description of a graph where the index indicates the node and the value indicates where the edge points to. For example, if we have the list 0 2 3 1, this can be thought of as a directed graph from its indices to its value, which is a graph with the following edges:

0 -> 0
1 -> 2
2 -> 3
3 -> 1

Node 0 points to itself (because the value at index 0 is 0), node 1 points to node 2 (because the value at index 1 is 2), and so on. If we traverse this graph, we see that there are two cycles. 0 -> 0 -> 0 ... and 1 -> 2 -> 3 -> 1....

Let's say we swap the element in position 0 with some other element. It could be any element, but let's say that we swap it with the element in position 2. Then we'll have the list 3 2 0 1, which can be thought of as the following graph:

0 -> 3
1 -> 2
2 -> 0
3 -> 1

If we traverse this graph, we see the cycle 0 -> 3 -> 1 -> 2 -> 0.... This is an example of a permutation with exactly one cycle.

If we swap two elements that belong to different cycles, we'll merge the two cycles into a single cycle. One way to see this is when we swap two elements in the list, we're essentially picking up the arrow-heads pointing to each element and swapping where they point (rather than the arrow-tails, which stay put). Tracing the result of this is like tracing a figure-8. Just for example, say if we swap 0 with an arbitrary element of the other cycle, let's say element 2, we'll end up with 3 2 0 1, whose only cycle is 0 -> 3 -> 1 -> 2 -> 0.... Note that this operation is reversible -- if we do the same swap again, we end up with two cycles again. In general, if we swap two elements from the same cycle, we break the cycle into two separate cycles.

If we feed a list consisting of 0 1 2 ... n-1 to sattolo's algorithm we'll get a permutation with exactly one cycle. Furthermore, we have the same probability of generating any permutation that has exactly one cycle. Let's look at why Sattolo's generates exactly one cycle. Afterwards, we'll figure out why it produces all possible cycles with uniform probability.

For Sattolo's algorithm, let's say we start with the list 0 1 2 3 ... n-1, i.e., a list with n cycles of length 1. On each iteration, we do one swap. If we swap elements from two separate cycles, we'll merge the two cycles, reducing the number of cycles by 1. We'll then do n-1 iterations, reducing the number of cycles from n to n - (n-1) = 1.

Now let's see why it's safe to assume we always swap elements from different cycles. In each iteration of the algorithm, we swap some element with index > i with the element at index i and then increment i. Since i gets incremented, the element that gets placed into index i can never be swapped again, i.e., each swap puts one of the two elements that was swapped into its final position, i.e., for each swap, we take two elements that were potentially swappable and render one of them unswappable.

When we start, we have n cycles of length 1, each with 1 element that's swappable. When we swap the initial element with some random element, we'll take one of the swappable elements and render it unswappable, creating a cycle of length 2 with 1 swappable element and leaving us with n-2 other cycles, each with 1 swappable element.

The key invariant that's maintained is that each cycle has exactly 1 swappable element. The invariant holds in the beginning when we have n cycles of length 1. And as long as this is true, every time we merge two cycles of any length, we'll take the swappable element from one cycle and swap it with the swappable element from the other cycle, rendering one of the two elements unswappable and creating a longer cycle that still only has one swappable element, maintaining the invariant.

Since we cannot swap two elements from the same cycle, we merge two cycles with every swap, reducing the number of cycles by 1 with each iteration until we've run n-1 iterations and have exactly one cycle remaining.

To see that we generate each cycle with equal probability, note that there's only one way to produce each output, i.e., changing any particular random choice results in a different output. In the first iteration, we randomly choose one of n-1 placements, then n-2, then n-3, and so on, so for any particular cycle, we produce it with probability (n-1) * (n-2) * (n-3) ... * 2 * 1 = (n-1)!. If we can show that there are (n-1)! permutations with exactly one cycle, then we'll know that we generate every permutation with exactly one cycle with uniform probability.

Let's say we have an arbitrary list of length n that has exactly one cycle and we add a single element, there are n ways to extend that to become a cycle of length n+1 because there are n places we could add in the new element and keep the cycle, which means that the number of cycles of length n+1, cycles(n+1), is n * cycles(n).

For example, say we have a cycle that produces the path 0 -> 1 -> 2 -> 0 ... and we want to add a new element, 3. We can substitute -> 3 -> for any -> and get a cycle of length 4 instead of length 3.

In the base case, there's one cycle of length 2, the permutation 1 0 (the other permutation of length two, 0 1, has two cycles of length one instead of having a cycle of length 2), so we know that cycles(2) = 1. If we apply the recurrence above, we get that cycles(n) = (n-1)!, which is exactly the number of different permutations that Sattolo's algorithm generates, which means that we generate all possible permutations with one cycle. Since we know that we generate each cycle with uniform probability, we now know that we generate all possible one-cycle permutations with uniform probability.

An alternate way to see that there are (n-1)! permutations with exactly one cycle, is that we rotate each cycle around so that 0 is at the start and write it down as 0 -> i -> j -> k -> .... The number of these is the same as the number of permutations of elements to the right of the 0 ->, which is (n-1)!.

Conclusion

We've looked at two algorithms that are identical, except for a two character change. These algorithms produce quite different results -- one algorithm produces a random permutation and the other produces a random permutation with exactly one cycle. I think these algorithms are neat because they're so simple, just a double for loop with a swap.

In practice, you probably don't "need" to know how these algorithms work because the standard library for most modern languages will have some way of producing a random shuffle. And if you have a function that will give you a shuffle, you can produce a permutation with exactly one cycle if you don't mind a non-in-place algorithm that takes an extra pass. I'll leave that as an exercise for the reader, but if you want a hint, one way to do it parallels the "alternate" way to see that there are `(n-1)! permutations with exactly one cycle.

Although I said that you probably don't need to know this stuff, you do actually need to know it if you're going to implement a custom shuffling algorithm! That may sound obvious, but there's a long history of people implementing incorrect shuffling algorithms. This was common in games and on online gambling sites in the 90s and even the early 2000s and you still see the occasional mis-implemented shuffle, e.g., when Microsoft implemented a bogus shuffle and failed to properly randomize a browser choice poll. At the time, the top Google hit for javascript random array sort was the incorrect algorithm that Microsoft ended up using. That site has been fixed, but you can still find incorrect tutorials floating around online.

Appendix: generating a random derangement

A permutation where no element ends up in its original position is called a derangement. Sattolo's algorithm generates derangements, but it only generates derangements with exactly one cycle, and there are derangements with more than one cycle (e.g., 3 2 1 0), so can't possibly generate random derangements with uniform probability.

One way to generate random derangements is to generate random shuffles using Fisher-Yates and then retry until we get a derangement:

def derangement(n):
    assert n != 1, "can't have a derangement of length 1"
    a = list(range(n))
    while not is_derangement(a):
        shuffle(a)
    return a

This algorithm is simple, and is overwhelmingly likely to eventually return a derangement (for n != 1), but it's not immediately obvious how long we should expect this to run before it returns a result. Maybe we'll get a derangement on the first try and run shuffle once, or maybe it will take 100 tries and we'll have to do 100 shuffles before getting a derangement.

To figure this out, we'll want to know the probability that a random permutation (shuffle) is a derangement. To get that, we'll want to know, given a list of of length n, how many permutations there are and how many derangements there are.

Since we're deep in the appendix, I'll assume that you know the number of permutations of a n elements is n! what binomial coefficients are, and are comfortable with Taylor series.

To count the number of derangements, we can start with the number of permutations, n!, and subtract off permutations where an element remains in its starting position, (n choose 1) * (n - 1)!. That isn't quite right because this double subtracts permutations where two elements remain in the starting position, so we'll have to add back (n choose 2) * (n - 2)!. That isn't quite right because we've overcorrected elements with three permutations, so we'll have to add those back, and so on and so forth, resulting in ∑ (−1)ᵏ (n choose k)(n−k)!. If we expand this out and divide by n! and cancel things out, we get ∑ (−1)ᵏ (1 / k!). If we look at the limit as the number of elements goes to infinity, this looks just like the Taylor series for e^x where x = -1, i.e., 1/e, i.e., in the limit, we expect that the fraction of permutations that are derangements is 1/e, i.e., we expect to have to do e times as many swaps to generate a derangement as we do to generate a random permutation. Like many alternating series, this series converges quickly. It gets within 7 significant figures of e when k = 10!

One silly thing about our algorithm is that, if we place the first element in the first location, we already know that we don't have a derangement, but we continue placing elements until we've created an entire permutation. If we reject illegal placements, we can do even better than a factor of e overhead. It's also possible to come up with a non-rejection based algorithm, but I really enjoy the naive rejection based algorithm because I find it delightful when basic randomized algorithms that consist of "keep trying again" work well.

Appendix: wikipedia's explanation of Sattolo's algorithm

I wrote this explanation because I found the explanation in Wikipedia relatively hard to follow, but if you find the explanation above difficult to understand, maybe you'll prefer wikipedia's version:

The fact that Sattolo's algorithm always produces a cycle of length n can be shown by induction. Assume by induction that after the initial iteration of the loop, the remaining iterations permute the first n - 1 elements according to a cycle of length n - 1 (those remaining iterations are just Sattolo's algorithm applied to those first n - 1 elements). This means that tracing the initial element to its new position p, then the element originally at position p to its new position, and so forth, one only gets back to the initial position after having visited all other positions. Suppose the initial iteration swapped the final element with the one at (non-final) position k, and that the subsequent permutation of first n - 1 elements then moved it to position l; we compare the permutation π of all n elements with that remaining permutation σ of the first n - 1 elements. Tracing successive positions as just mentioned, there is no difference between σ and π until arriving at position k. But then, under π the element originally at position k is moved to the final position rather than to position l, and the element originally at the final position is moved to position l. From there on, the sequence of positions for π again follows the sequence for σ, and all positions will have been visited before getting back to the initial position, as required.

As for the equal probability of the permutations, it suffices to observe that the modified algorithm involves (n-1)! distinct possible sequences of random numbers produced, each of which clearly produces a different permutation, and each of which occurs--assuming the random number source is unbiased--with equal probability. The (n-1)! different permutations so produced precisely exhaust the set of cycles of length n: each such cycle has a unique cycle notation with the value n in the final position, which allows for (n-1)! permutations of the remaining values to fill the other positions of the cycle notation

Thanks to Mathieu Guay-Paquet, Leah Hanson, Rudi Chen, Kamal Marhubi, Michael Robert Arntzenius, Heath Borders, Shreevatsa R, and David Turner for comments/corrections/discussion.


  1. a[0] is placed on the first iteration of the loop. Assuming randrange generates integers with uniform probability in the appropriate range, the original a[0] has 1/n probability of being swapped with any element (including itself), so the resultant a[0] has a 1/n chance of being any element from the original a, which is what we want.

    a[1] is placed on the second iteration of the loop. At this point, a[0] is some element from the array before it was mutated. Let's call the unmutated array original. a[0] is original[k], for some k. For any particular value of k, it contains original[k] with probability 1/n. We then swap a[1] with some element from the range [1, n-1].

    If we want to figure out the probability that a[1] is some particular element from original, we might think of this as follows: a[0] is original[k_0] for some k_0. a[1] then becomes original[k_1] for some k_1 where k_1 != k_0. Since k_0 was chosen uniformly at random, if we integrate over all k_0, k_1 is also uniformly random.

    Another way to look at this is that it's arbitrary that we place a[0] and choose k_0 before we place a[1] and choose k_1. We could just have easily placed a[1] and chosen k_1 first so, over all possible choices, the choice of k_0 cannot bias the choice of k_1.

    [return]

Wed, 09 Aug 2017 00:00:00 +0000


Terminal latency

There’s a great MSR demo from 2012 that shows the effect of latency on the experience of using a tablet. If you don’t want to watch the three minute video, they basically created a device which could simulate arbitrary latencies down to a fraction of a millisecond. At 100ms (1/10th of a second), which is typical of consumer tablets, the experience is terrible. At 10ms (1/100th of a second), the latency is noticeable, but the experience is ok, and at < 1ms the experience is great, as good as pen and paper. If you want to see a mini version of this for yourself, you can try a random Android tablet with a stylus vs. the current generation iPad Pro with the Apple stylus. The Apple device has well above 10ms end-to-end latency, but the difference is still quite dramatic -- it’s enough that I’ll actually use the new iPad Pro to take notes or draw diagrams, whereas I find Android tablets unbearable as a pen-and-paper replacement.

You can also see something similar if you try VR headsets with different latencies. 20ms feels fine, 50ms feels laggy, and 150ms feels unbearable.

Curiously, I rarely hear complaints about keyboard and mouse input being slow. One reason might be that keyboard and mouse input are quick and that inputs are reflected nearly instantaneously, but I don’t think that’s true. People often tell me that’s true, but I think it’s just the opposite. The idea that computers respond quickly to input, so quickly that humans can’t notice the latency, is the most common performance-related fallacy I hear from professional programmers.

When people measure actual end-to-end latency for games on normal computer setups, they usually find latencies in the 100ms range.

If we look at Robert Menzel’s breakdown of the the end-to-end pipeline for a game, it’s not hard to see why we expect to see 100+ ms of latency:

Note that this assumes a gaming mouse and a pretty decent LCD; it’s common to see substantially slower latency for the mouse and for pixel switching.

It’s possible to tune things to get into the 40ms range, but the vast majority of users don’t do that kind of tuning, and even if they do, that’s still quite far from the 10ms to 20ms range, where tablets and VR start to feel really “right”.

Keypress-to-display measurements are mostly done in games because gamers care more about latency than most people, but I don’t think that most applications are all that different from games in terms of latency. While games often do much more work per frame than “typical” applications, they’re also much better optimized than “typical” applications. Menzel budgets 33ms to the game, half for game logic and half for rendering. How much time do non-game applications take? Pavel Fatin measured this for text editors and found latencies ranging from a few milliseconds to hundreds of milliseconds and he did this with an app he wrote that we can use to measure the latency of other applications that uses java.awt.Robot to generate keypresses and do screen captures.

Personally, I’d like to see the latency of different terminals and shells for a couple of reasons. First, I spend most of my time in a terminal and usually do editing in a terminal, so the latency I see is at least the latency of the terminal. Second, the most common terminal benchmark I see cited (by at least two orders of magnitude) is the rate at which a terminal can display output, often measured by running cat on a large file. This is pretty much as useless a benchmark as I can think of. I can’t recall the last task I did which was limited by the speed at which I can cat a file to stdout on my terminal (well, unless I’m using eshell in emacs), nor can I think of any task for which that sub-measurement is useful. The closest thing that I care about is the speed at which I can ^C a command when I’ve accidentally output too much to stdout, but as we’ll see when we look at actual measurements, a terminal’s ability to absorb a lot of input to stdout is only weakly related to its responsiveness to ^C. The speed at which I can scroll up or down an entire page sounds related, but in actual measurements the two are not highly correlated (e.g., emacs-eshell is quick at scrolling but extremely slow at sinking stdout). Another thing I care about is latency, but knowing that a particular terminal has high stdout throughput tells me little to nothing about its latency.

Let’s look at some different terminals to see if any terminals add enough latency that we’d expect the difference to be noticeable. If we measure the latency from keypress to internal screen capture on my laptop, we see the following latencies for different terminals

Plot of terminal tail latency Plot of terminal tail latency

These graphs show the distribution of latencies for each terminal. The y-axis has the latency in milliseconds. The x-axis is the percentile (e.g., 50 means represents 50%-ile keypress i.e., the median keypress). Measurements are with macOS unless otherwise stated. The graph on the left is when the machine is idle, and the graph on the right is under load. If we just look at median latencies, some setups don’t look too bad -- terminal.app and emacs-eshell are at roughly 5ms unloaded, small enough that many people wouldn’t notice. But most terminals (st, alacritty, hyper, and iterm2) are in the range where you might expect people to notice the additional latency even when the machine is idle. If we look at the tail when the machine is idle, say the 99.9%-ile latency, every terminal gets into the range where the additional latency ought to be perceptible, according to studies on user interaction. For reference, the internally generated keypress to GPU memory trip for some terminals is slower than the time it takes to send a packet from Boston to Seattle and back, about 70ms.

All measurements were done with input only happening on one terminal at a time, with full battery and running off of A/C power. The loaded measurements were done while compiling Rust (as before, with full battery and running off of A/C power, and in order to make the measurements reproducible, each measurement started 15s after a clean build of Rust after downloading all dependencies, with enough time between runs to avoid thermal throttling interference across runs).

If we look at median loaded latencies, other than emacs-term, most terminals don’t do much worse than at idle. But as we look at tail measurements, like 90%-ile or 99.9%-ile measurements, every terminal gets much slower. Switching between macOS and Linux makes some difference, but the difference is different for different terminals.

These measurements aren't anywhere near the worst case (if we run off of battery when the battery is low, and wait 10 minutes into the compile in order to exacerbate thermal throttling, it’s easy to see latencies that are multiple hundreds of ms) but even so, every terminal has tail latency that should be observable. Also, recall that this is only a fraction of the total end-to-end latency.

Why don’t people complain about keyboard-to-display latency the way they complain stylus-to-display latency or VR latency? My theory is that, for both VR and tablets, people have a lot of experience with a much lower latency application. For tablets, the “application” is pen-and-paper, and for VR, the “application” is turning your head without a VR headset on. But input-to-display latency is so bad for every application that most people just expect terrible latency.

An alternate theory might be that keyboard and mouse input are fundamentally different from tablet input in a way that makes latency less noticeable. Even without any data, I’d find that implausible because, when I access a remote terminal in a way that adds tens of milliseconds of extra latency, I find typing to be noticeably laggy. And it turns out that when extra latency is A/B tested, people can and do notice latency in the range we’re discussing here.

Just so we can compare the most commonly used benchmark (throughput of stdout) to latency, let’s measure how quickly different terminals can sink input on stdout:

terminal stdout
(MB/s)
idle50
(ms)
load50
(ms)
idle99.9
(ms)
load99.9
(ms)
mem
(MB)
^C
alacritty 39 31 28 36 56 18 ok
terminal.app 20 6 13 25 30 45 ok
st 14 25 27 63 111 2 ok
alacritty tmux 14
terminal.app tmux 13
iterm2 11 44 45 60 81 24 ok
hyper 11 32 31 49 53 178 fail
emacs-eshell 0.05 5 13 17 32 30 fail
emacs-term 0.03 13 30 28 49 30 ok

The relationship between the rate that a terminal can sink stdout and its latency is non-obvious. For the matter, the relationship between the rate at which a terminal can sink stdout and how fast it looks is non-obvious. During this test, terminal.app looked very slow. The text that scrolls by jumps a lot, as if the screen is rarely updating. Also, hyper and emacs-term both had problems with this test. Emacs-term can’t really keep up with the output and it takes a few seconds for the display to finish updating after the test is complete (the status bar that shows how many lines have been output appears to be up to date, so it finishes incrementing before the test finishes). Hyper falls further behind and pretty much doesn’t update the screen after a flickering a couple of times. The Hyper Helper process gets pegged at 100% CPU for about two minutes and the terminal is totally unresponsive for that entire time.

Alacritty was tested with tmux because alacritty doesn’t support scrolling back up, and the docs indicate that you should use tmux if you want to be able to scroll up. Just to have another reference, terminal.app was also tested with tmux. For most terminals, tmux doesn’t appear to reduce stdout speed, but alacritty and terminal.app are fast enough that they’re actually limited by the speed of tmux.

Emacs-eshell is technically not a terminal, but I also tested eshell because it can be used as a terminal alternative for some use cases. Emacs, with both eshell and term, is actually slow enough that I care about the speed at which it can sink stdout. When I’ve used eshell or term in the past, I find that I sometimes have to wait for a few thousand lines of text to scroll by if I run a command with verbose logging to stdout or stderr. Since that happens very rarely, it’s not really a big deal to me unless it’s so slow that I end up waiting half a second or a second when it happens, and no other terminal is slow enough for that to matter.

Conversely, I type individual characters often enough that I’ll notice tail latency. Say I type at 120wpm and that results in 600 characters per minute, or 10 characters per second of input. Then I’d expect to see the 99.9% tail (1 in 1000) every 100 seconds!

Anyway, the cat “benchmark” that I care about more is whether or not I can ^C a process when I’ve accidentally run a command that outputs millions of lines to the screen instead of thousands of lines. For that benchmark, every terminal is fine except for hyper and emacs-eshell, both of which hung for at least ten minutes (I killed each process after ten minutes, rather than waiting for the terminal to catch up).

Memory usage at startup is also included in the table for reference because that's the other measurement I see people benchmark terminals with. While I think that it's a bit absurd that terminals can use 40MB at startup, even the three year old hand-me-down laptop I'm using has 16GB of RAM, so squeezing that 40MB down to 2MB doesn't have any appreciable affect on user experience. Heck, even the $300 chromebook we recently got has 16GB of RAM.

Conclusion

Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, benchmark the speed of sinking stdout or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink stdout many orders of magnitude faster than humans can keep up with, so further optimizing stdout throughput has a relatively small impact on actual user experience for most users. Likewise for reducing memory usage when an idle terminal uses 0.01% of the memory on my old and now quite low-end laptop.

If you work on a terminal, perhaps consider relatively more latency and interactivity (e.g., responsiveness to ^C) optimization and relatively less throughput and idle memory usage optimization.

Update: In response to this post, the author of alacritty explains where alacritty's latency comes from and describes how alacritty could reduce its latency

Appendix: negative results

Tmux and latency: I tried tmux and various terminals and found that the the differences were within the range of measurement noise.

Shells and latency: I tried a number of shells and found that, even in the quickest terminal, the difference between shells was within the range of measurement noise. Powershell was somewhat problematic to test with the setup I was using because it doesn’t handle colors correctly (the first character typed shows up with the color specified by the terminal, but other characters are yellow regardless of setting, which appears to be an open issue), which confused the image recognition setup I used. Powershell also doesn’t consistently put the cursor where it should be -- it jumps around randomly within a line, which also confused the image recognition setup I used. However, despite its other problems, powershell had comparable performance to other shells.

Shells and stdout throughput: As above, the speed difference between different shells was within the range of measurement noise.

Single-line vs. multiline text and throughput: Although some text editors bog down with extremely long lines, throughput was similar when I shoved a large file into a terminal whether the file was all one line or was line broken every 80 characters.

Head of line blocking / coordinated omission: I ran these tests with input at a rate of 10.3 characters per second. But it turns out this doesn't matter much and input rates that humans are capapable of and the latencies are quite similar to doing input once every 10.3 seconds. It's possible to overwhelm a terminal, and hyper is the first to start falling over at high input rates, but the speed necessary to make the tail latency worse is beyond the rate at which any human I know of can type.

Appendix: experimental setup

All tests were done on a dual core 2.6GHz 13” Mid-2014 Macbook pro. The machine has 16GB of RAM and a 2560x1600 screen. The OS X version was 10.12.5. Some tests were done in Linux (Lubuntu 16.04) to get a comparison between macOS and Linux. 10k keypresses were for each latency measurements.

Latency measurements were done with the . key and throughput was done with default base32 output, which is all plain ASCII text. George King notes that different kinds of text can change output speed:

I’ve noticed that Terminal.app slows dramatically when outputting non-latin unicode ranges. I’m aware of three things that might cause this: having to load different font pages, and having to parse code points outside of the BMP, and wide characters.

The first probably boils down to a very complicated mix of lazy loading of font glyphs, font fallback calculations, and caching of the glyph pages or however that works.

The second is a bit speculative, but I would bet that Terminal.app uses Cocoa’s UTF16-based NSString, which almost certainly hits a slow path when code points are above the BMP due to surrogate pairs.

Terminals were fullscreened before running tests. This affects test results, and resizing the terminal windows can and does significantly change performance (e.g., it’s possible to get hyper to be slower than iterm2 by changing the window size while holding everything else constant). st on macOS was running as an X client under XQuartz. To see if XQuartz is inherently slow, I tried runes, another "native" Linux terminal that uses XQuartz; runes had much better tail latency than st and iterm2.

The “idle” latency tests were done on a freshly rebooted machine. All terminals were running, but input was only fed to one terminal at a time.

The “loaded” latency tests were done with rust compiling in the background, 15s after the compilation started.

Terminal bandwidth tests were done by creating a large, pseudo-random, text file with

timeout 64 sh -c 'cat /dev/urandom | base32 > junk.txt'

and then running

timeout 8 sh -c 'cat junk.txt | tee junk.term_name'

Terminator and urxvt weren’t tested because they weren’t completely trivial to install on mac and I didn’t want to futz around to make them work. Terminator was easy to build from source, but it hung on startup and didn’t get to a shell prompt. Urxvt installed through brew, but one of its dependencies (also installed through brew) was the wrong version, which prevented it from starting.

Thanks to Kamal Marhubi, Leah Hanson, Wesley Aptekar-Cassels, David Albert, Vaibhav Sagar, Indradhanush Gupta, Rudi Chen, Laura Lindzey, Ahmad Jarara, George King, Tim Dierks, Nikith Naide, Veit Heller, and Nick Bergson-Shilcock for comments/corrections/discussion.

Tue, 18 Jul 2017 00:00:00 +0000


Keyboard v. mouse

Which is faster, keyboard or mouse? A large number of programmers believe that the keyboard is faster for all (programming-related) tasks. However, there are a few widely cited webpages that claim that “studies” show that using the mouse is faster than using the keyboard for everything and that people who think that using the keyboard is faster are just deluding themselves. This might sound extreme, but, just for example, one page says that the author has “never seen [the keyboard] outperform the mouse”.

But it can’t be the case that the mouse is faster for everything -- almost no one is faster at clicking on an on-screen keyboard with a mouse than typing at a physical keyboard. Conversely, there are tasks for which mice are much better suited than keyboards (e.g., aiming in FPS games). For someone without an agenda, the question shouldn’t be, which is faster at all tasks, but which tasks are faster with a keyboard, which are faster with a mouse, and which are faster when both are used?

You might ask if any of this matters. It depends! One of the best programmrers I know is a hunt-and-peck typist, so it's clearly possible to be a great programmer without having particularly quick input speed. But I'm in the middle of an easy data munging task where I'm limited by the speed at which I can type in a large amount of boring code. If I were quicker, this task would be quicker, and there are tasks that I don't do that I might do. I can type at > 100 wpm, which isn't bad, but I can talk at > 400 wpm and I can think much faster than I can talk. I'm often rate limited even when talking; typing is much worse and the half-a-second here and one-second there I spent on navigation certainly doesn't help. When I first got started in tech, I had a mundane test/verification/QA role where my primary job was to triage test failures. Even before I started automating tasks, I could triage nearly twice as many bugs per day as other folks in the same role because I took being efficient at basic navigation tasks seriously. Nowadays, my jobs aren't 90% rote anymore, but my guess is that about a third of the time I spend in front of a computer is spent on mindless tasks that are rate-limited by my input and navigation speed. If I could get faster at those mundane tasks and have to spend less time on them and more time doing things that are fun, that would be great.

Anyway, to start, let’s look at the cited studies to see where the mouse is really faster. Most references on the web, when followed all the way back, point to the AskTog, a site by Bruce Tognazzini, who describes himself as a “recognized leader in human/computer interaction design”.

The most cited AskTog page on the topic claims that they've spent $50M of R&D and done all kinds of studies; the page claims that, among other things, the $50M in R&D showed “Test subjects consistently report that keyboarding is faster than mousing” and “The stopwatch consistently proves mousing is faster than keyboarding. ”. The claim is that this both proves that the mouse is faster than the keyboard, and explains why programmers think the keyboard is faster than the mouse even though it’s slower. However, the result is unreproducible because “Tog” not only doesn’t cite the details of the experiments, Tog doesn’t even describe the experiments and just makes a blanket claim.

The second widely cited AskTog page is in response to a response to the previous page, and it simply repeats that the first page showed that keyboard shortcuts are slower. While there’s a lot of sarcasm, like “Perhaps we have all been misled these years. Perhaps the independent studies that show over and over again that Macintosh users are more productive, can learn quicker, buy more software packages, etc., etc., etc., are somehow all flawed. Perhaps....” no actual results are cited, as before. There is, however, a psuedo-scientific explanation of why the mouse is faster than the keyboard:

Command Keys Aren’t Faster. As you know from my August column, it takes just as long to decide upon a command key as it does to access the mouse. The difference is that the command-key decision is a high-level cognitive function of which there is no long-term memory generated. Therefore, subjectively, keys seem faster when in fact they usually take just as long to use.

Since mouse acquisition is a low-level cognitive function, the user need not abandon cognitive process on the primary task during the acquisition period. Therefore, the mouse acquirer achieves greater productivity.

One question this raises is, why should typing on the keyboard be any different from using command keys? There certainly are people who aren’t fluent at touch typing who have to think about which key they’re going to press when they type. Those people are very slow typists, perhaps even slower than someone who’s quick at using the mouse to type via an on screen keyboard. But there are also people who are fluent with the keyboard and can type without consciously thinking about which keys they’re going to press. The implicit claim here is that it’s not possible to be fluent with command keys in the same way it’s possible to be fluent with the keyboard for typing. It’s possible that’s true, but I find the claim to be highly implausible, both in principle, and from having observed people who certainly seem to be fluent with command keys, and the claim has no supporting evidence.

The third widely cited AskTog page cites a single experiment, where the author typed a paragraph and then had to replace every “e” with a “|”, either using cursor keys or the mouse. The author found that the average time for using cursor keys was 99.43 seconds and the average time for the mouse was 50.22 seconds. No information about the length of the paragraph or the number of “e”s was given. The third page was in response to a user who cited specific editing examples where they found that they were faster with a keyboard than with a mouse.

My experience with benchmarking is that the vast majority of microbenchmarks have wrong or misleading results because they’re difficult to set up properly, and even when set up properly, understanding how the microbenchmark results relate to real-world world results requires a deep understanding of the domain. As a result, I’m deeply skeptical of broad claims that come from microbenchmarks unless the author has a demonstrated, deep, understanding of benchmarking their particular domain, and even then I’ll ask why they believe their result generalizes. The opinion that microbenchmarks are very difficult to interpret properly is widely shared among people who understand benchmarking.

The e -> | replacement task described is not only a microbenchmark, it's a bizarrely artificial microbenchmark.

Based on the times given in the result, the task was either for very naive users, or disallowed any kind of search and replace functionality. This particular AskTog column is in response to a programmer who mentioned editing tasks, so the microbenchmark is meaningless unless that programmer is trapped in an experiment where they’re not allowed to use their editor’s basic functionality. Moreover, the replacement task itself is unrealistic -- how often do people replace e with |?

I timed this task without the bizarre no-search-and-replace restriction removed and got the following results:

The first result was from using a keyboard shortcut. The second result is something I might do if I were in someone else’s emacs setup, which has different keyboard shortcuts mapped; emacs lets you run a command by hitting “M-x” and typing the entire name of the command. That’s much slower than using a keyboard shortcut directly, but still faster than using the mouse (at least for me, here) Does this mean that keyboards are great and mice are terrible? No, the result is nearly totally meaningless because I spend almost none of my time doing single-character search-and-replace, making the speed of single-character search-and-replace irrelevant.

Also, since I’m used to using the keyboard, the mouse speed here is probably unusually slow. That’s doubly true here because my normal editor setup (emacs -nw) doesn’t allow for mouse usage, so I ended up using an unfamiliar editor, TextEdit, for the mouse test. I did each task once in order to avoid “practicing” the exact task, which could unrealistically make the keyboard-shortcut version nearly instantaneous because it’s easy to hit a practiced sequence of keys very quickly. However, this meant that I was using an unfamiliar mouse in an unfamiliar set of menus for the mouse. Furthermore, like many people who’ve played FPS games in the distant past, I’m used to having “mouse acceleration” turned off, but the Mac has this on by default and I didn’t go through the rigmarole necessary to disable mouse acceleration. Additionally, recording program I used (quicktime) made the entire machine laggy, which probably affects mousing speed more than keyboard speed, and the menu setup for the program I happened to use forced me to navigate through two levels of menus.

That being said, despite not being used to the mouse, if I want to find a microbenchmark where I’m faster with the mouse than with the keyboard, that’s easy: let me try selecting a block of text that’s on the screen but not near my cursor:

I tend to do selection of blocks in emacs by searching for something at the start of the block, setting a mark, and then searching for something at the end of the mark. I typically type three characters to make sure that I get a unique chunk of text (and I’ll type more if it’s text where I don’t think three characters will cut it). This makes the selection task somewhat slower than the replacement task because the replacement task used single characters and this task used multiple characters.

The mouse is so much better suited for selecting a block of text that even with an unfamiliar mouse setup where I end up having to make a correction instead of being able to do the selection in one motion, the mouse is over twice as fast. But, if I wanted select something that was off screen and the selection was so large that it wouldn’t fit on one screen, the keyboard time wouldn’t change and the mouse time would get much slower, making the keyboard faster.

In addition to doing the measurements, I also (informally) polled people to ask if they thought the keyboard or the mouse would be faster for specific tasks. Both search-and-replace and select-text are tasks where the result was obvious to most people. But not all tasks are obvious; scrolling was one where people didn’t have strong opinions one way or another. Let’s look at scrolling, which is a task both the keyboard and the mouse are well suited for. To have something concrete, let’s look at scrolling down 4 pages:

While there’s some difference, and I suspect that if I repeated the experiment enough times I could get a statistically significant result, but the difference is small enough that the difference isn’t of practical significance.

Contra Tog’s result, which was that everyone believes the keyboard was faster even though the mouse is faster, I find that people are pretty good at estimating what’s which device is faster for which tasks and also at estimate when both devices will give a similar result. One possible reason is that I’m polling programmers, and in particular, programmers at RC, who are probably a different population than whoever Tog might’ve studied in his studies. He was in a group that was looking at how to design the UI for a general purpose computer in the 80s, where it would have been actually been unreasonable to focus on studying people, many of whom grew up using computers, and then chose a career where you use computers all day. The equivalent population would’ve had to start using computers in the 60s or even earlier, but even if they had, input devices were quite different (the ball mouse wasn’t invented until 1972, and it certainly wasn’t in wide use the moment it was invented). There’s nothing wrong with studying populations who aren’t relatively expert at using computer input devices, but there is something wrong with generalizing those results to people who are relatively expert.

Unlike claims by either keyboard or mouse advocates, when I do experiments myself, the results are mixed. Some tasks are substantially faster if I use the keyboard and some are substantially faster if I use the mouse. Moreover, most of the results are easily predictable (when the results are similar, the prediction is that it would be hard to predict). If we look at the most widely cited, authoritative, results on the web, we find that they make very strong claims that the mouse is much faster than the keyboard but back up the claim with nothing but a single, bogus, experiment. It’s possible that some of the vaunted $50M in R&D went into valid experiments, but those experiments, if they exist, aren’t cited.

I spent some time reviewing the literature on the subject, but couldn’t find anything conclusive. Rather than do a point-by-point summary of each study (like I did here for here for another controversial topic), I’ll mention the high-level issues that make the studies irrelevant to me. All studies I could find had at least one of the issues listed below; if you have a link to a study that isn’t irrelevant for one of the following reasons, I’d love to hear about it!

  1. Age of study: it’s unclear how a study on interacting with computers from the mid-80s transfers to how people interact with computers today. Even ignoring differences in editing programs, there are large differences in the interface. Mice are more precise and a decent modern optical mouse can be moved as fast as a human can move it without the tracking becoming erratic, something that isn’t true of any mouse I’ve tried from the 80s and was only true of high quality mice from the 90s when the balls were recently cleaned and the mouse was on a decent quality mousepad. Keyboards haven’t improved as much, but even so, I can type substantially faster a modern, low-travel, keyboard than on any keyboard I’ve tried from the 80s.
  2. Narrow microbenchmarking: not all of these are as irrelevant as the e -> | without search and replace task, but even in the case of tasks that aren’t obviously irrelevant, it’s not clear what the impact of the result is on actual work I might do.
  3. Not keyboard vs. mouse: a tiny fraction of published studies are on keyboard vs. mouse interaction. When a study is on device interaction, it’s often about some new kind of device or a new interaction model.
  4. Vague description: a lot of studies will say something like they found a 7.8% improvement, with results being significant with p < 0.005, without providing enough information to tell if the results are actually significant or merely statistically significant (recall that the practically insignificant scrolling result was a 0.08s difference, which could also be reported as a 16.3% improvement).
  5. Unskilled users: in one, typical, paper, they note that it can take users as long as two seconds to move the mouse from one side of the screen to a scrollbar on the other side of the screen. While there’s something to be said for doing studies on unskilled users in order to figure out what sorts of interfaces are easiest for users who have the hardest time, a study on users who take 2 seconds to get their mouse onto the scrollbar doesn’t appear to be relevant to my user experience. When I timed this for myself, it took 0.21s to get to the scrollbar from the other side of the screen and scroll a short distance, despite using an unfamiliar mouse with different sensitivity than I’m used to and running a recording program which made mousing more difficult than usual.
  6. Seemingly unreasonable results: some studies claim to show large improvements in overall productivity when switching from type of device to another (e.g., a 20% total productivity gain from switching types of mice).

Conclusion

It’s entirely possible that the mysterious studies Tog’s org spent $50M on prove that the mouse is faster than the keyboard for all tasks other than raw text input, but there doesn’t appear to be enough information to tell what the actual studies were. There are many public studies on user input, but I couldn’t find any that are relevant to whether or not I should use the mouse more or less at the margin.

When I look at various tasks myself, the results are mixed, and they’re mixed in the way that most programmers I polled predicted. This result is so boring that it would barely be worth mentioning if not for the large groups of people who believe that either the keyboard is always faster than the mouse or vice versa.

Please let me know if there are relevant studies on this topic that I should read! I’m not familiar with the relevant fields, so it’s possible that I’m searching with the wrong keywords and reading the wrong papers.

Appendix: note to self

I didn't realize that scrolling was so fast relative to searching (not explicitly mentioned in the blog post, but 1/2 of the text selection task). I tend to use search to scroll to things that are offscreen, but it appears that I should consider scrolling instead when I don't want to drop my cursor in a specific position.

Thanks to Leah Hanson, Quentin Pradet, Alex Wilson, and Gaxun for comments/corrections on this post and to Annie Cherkaev, Chris Ball, Stefan Lesser, and David Isaac Lee for related discussion.

Tue, 13 Jun 2017 00:00:00 +0000


Options v. cash

I often talk to startups that claim that their compensation package has a higher expected value than the equivalent package at a place like Facebook, Google, Twitter, or Snapchat. One thing I don’t understand about this claim is, if the claim is true, why shouldn’t the startup go to an investor, sell their options for what they claim their options to be worth, and then pay me in cash? The non-obvious value of options combined with their volatility is a barrier for recruiting.

Additionally, given my risk function and the risk function of VCs, this appears to be a better deal for everyone. Like most people, extra income gives me diminishing utility, but VCs have an arguably nearly linear utility in income. Moreover, even if VCs shared my risk function, because VCs hold a diversified portfolio of investments, the same options would be worth more to them than they are to me because they can diversify away downside risk much more effectively than I can. If these startups are making a true claim about the value of their options, there should be a trade here that makes all parties better off.

In a classic series of essays written a decade ago, seemingly aimed at convincing people to either found or join startups, Paul Graham stated, “Risk and reward are always proportionate.” This assertion is used to back the claim that people can make more money, in expectation, by joining startups and taking risky equity packages than they can by taking jobs that pay cash or cash plus public equity. However, the premise -- that risk and reward are always proportionate -- isn’t true in the general case. Only assets whose risk cannot be diversified away carry a risk premium (on average). Since VCs can and do diversify risk away, there’s no reason to believe that an individual employee who “invests” in startup options by working at a startup is getting a deal because of the risk involved. And by the way, when you look at historical returns, VC funds don’t appear to outperform other investment classes even though they get to buy a kind of startup equity that has less downside risk than the options you get as a normal employee.

So how come startups can’t or won’t take on more investment and pay their employees in cash? Let’s start by looking at some cynical reasons, followed by some less cynical reasons.

Cynical reasons

One possible answer, perhaps the simplest possible answer, is that options aren’t worth what startups claim they’re worth and startups prefer options because their lack of value is less obvious than it would be with cash. A simplistic argument that this might be the case is, if you look at the amount investors pay for a fraction of an early-stage or mid-stage startup and look at the extra cash the company would have been able to raise if they gave their employee option pool to investors, it usually isn’t enough to pay employees competitive compensation packages. Given that VCs don’t, on average, have outsized returns, this seems to imply that employee options aren’t worth as much as startups often claim. Compensation is much cheaper if you can convince people to take an arbirary number of lottery tickets in a lottery of unknown value instead of cash.

Some common ways that employee options are misrepresented are:

Strike price as value

A company that gives you 1M options with a strike price of $10 might claim that those are “worth” $10M. However, if the share price stays at $10 for the lifetime of the option, the options will end up being worth $0 because an option with a $10 strike price is an option to buy the stock at $10, which is not the same as a grant of actual shares worth $10 a piece.

Public valuation as value

Let’s say a company raised $300M by selling 30% of the company, giving the company an implied valuation of $1B. The most common misrepresentation I see is that the company will claim that because they’re giving an option for, say, 0.1% of the company, your option is worth $1B * 0.001 = $1M. A related, common, misrepresentation is that the company raised money last year and has increased in value since then, e.g., the company has since doubled in value, so your option is worth $2M. Even if you assume the strike price was $0 and and go with the last valuation at which the company raised money, the implied value of your option isn’t $1M because investors buy a different class of stock than you get as an employee.

There are a lot of differences between the preferred stock that VCs get and the common stock that employees get; let’s look at a couple of concrete scenarios.

Let’s say those investors that paid $300M for 30% of the company have a straight (1x) liquidation preference, and the company sells for $500M. The 1x liquidation preference means that the investors will get 1x of their investment back before lowly common stock holders get anything, so the investors will get $300M for their 30% of the company. The other 70% of equity will split $200M: your 0.1% common stock option with a $0 strike price is worth $285k (instead of the $500k you might expect it to be worth if you multiply $500M by 0.001).

The preferred stock VCs get usually has at least a 1x liquidation preference. Let’s say the investors had a 2x liquidation preference in the above scenario. They would get 2x their investment back before the common stockholders split the rest of the company. Since 2 * $300M is greater than $500M, the investors would get everything and the remaining equity holders would get $0.

Another difference between your common stock and preferred stock is that preferred stock sometimes comes with an anti-dilution clause, which you have no chance of getting as a normal engineering hire. Let’s look at an actual example of dilution at a real company. Mayhar got 0.4% of a company when it was valued at $5M. By the time the company was worth $1B, Mayhar’s share of the company was diluted by 8x, which made his share of the company worth less than $500k (minus the cost of exercising his options) instead of $4M (minus the cost of exercising his options).

This story has a few additional complications which illustrate other reasons options are often worth less than they seem. Mayhar couldn’t afford to exercise his options (by paying the strike price times the number of shares he had an option for) when he joined, which is common for people who take startup jobs out of college who don’t come from wealthy families. When he left four years later, he could afford to pay the cost of exercising the options, but due to a quirk of U.S. tax law, he either couldn’t afford the tax bill or didn’t want to pay that cost for what was still a lottery ticket -- when you exercise your options, you’re effectively taxed on the difference between the current valuation and the strike price. Even if the company has a successful IPO for 10x as much in a few years, you’re still liable for the tax bill the year you exercise (and if the company stays private indefinitely or fails, you get nothing but a future tax deduction). Because, like most options, Mayhar’s option has a 90-day exercise window, he didn’t get anything from his options.

While that’s more than the average amount of dilution, there are much worse cases, for example, cases where investors and senior management basically get to keep their equity and everyone else gets diluted to the point where their equity is worthless.

Those are just a few of the many ways in which the differences between preferred and common stock can cause the value of options to be wildly different from a value naively calculated from a public valuation. I often see both companies and employees use public preferred stock valuations as a benchmark in order to precisely value common stock options, but this isn’t possible, even in principle, without access to a company’s cap table (which shows how much of the company different investors own) as well as access to the specific details of each investment. Even if you can get that (which you usually can’t), determining the appropriate numbers to plug into a model that will give you the expected value is non-trivial because it requires answering questions like “what’s the probability that, in an acquisition, upper management will collude with investors to keep everything and leave the employees with nothing?”

Black-Scholes valuation as value

Because of the issues listed above, people will sometimes try to use a model to estimate the value of options. Black-Scholes is commonly used because well known and has an easy to use closed form solution, it’s the most commonly used model. Unfortunately, most of the major assumptions for Black-Scholes are false for startup options, making the relationship between the output between Black-Scholes and the actual value of your options non-obvious.

Options are often free to the company

A large fraction of options get returned to the employee option pool when employees leave, either voluntarily or involuntarily. I haven’t been able to find comprehensive numbers on this, but anecdotally, I hear that more than 50% of options end up getting taken back from employees and returned to the general pool. Dan McKinley points out an (unvetted) analysis that shows that only 5% of employee grants are exercised. Even with a conservative estimate, a 50% discount on options granted sounds pretty good. A 20x discount sounds amazing, and would explain why companies like options so much.

Present value of a future sum of money

When someone says that a startup’s compensation package is worth as much as Facebook’s, they often mean that the total value paid out over N years is similar. But a fixed nominal amount of money is worth more the sooner you get it because you can (at a minimum) invest it in a low-risk asset, like Treasury bonds, and get some return on the money.

That’s an abstract argument you’ll hear in an econ 101 class, but in practice, if you live somewhere with a relatively high cost of living, like SF or NYC, there’s an even greater value to getting paid sooner rather than later because it lets you live in a relatively nice place (however you define nice) without having to cram into a space with more roommates than would be considered reasonable elsewhere in the U.S. Many startups from the last two generations seem to be putting off their IPOs; for folks in those companies with contracts that prevent them from selling options on a secondary market, that could easily mean that the majority of their potential wealth is locked up for the first decade of their working life. Even if the startup’s compensation package is worth more when adjusting for inflation and interest, it’s not clear if that’s a great choice for most people who aren’t already moderately well off.

Non-cynical reasons

We’ve looked at some cynical reasons companies might want to offer options instead of cash, namely that they can claim that their options are worth more than they’re actually worth. Now, let’s look at some non-cynical reasons companies might want to give out stock options.

From an employee standpoint, one non-cynical reason might have been stock option backdating, at least until that loophole was mostly closed. Up until late early 2000s, many companies backdated the date of options grants. Let’s look at this example, explained by Jessie M. Fried

Options covering 1.2 million shares were given to Reyes. The reported grant date was October 1, 2001, when the firm's stock was trading at around $13 per share, the lowest closing price for the year. A week later, the stock was trading at $20 per share, and a month later the stock closed at almost $26 per share.

Brocade disclosed this grant to investors in its 2002 proxy statement in a table titled "Option Grants in the Last Fiscal Year, prepared in the format specified by SEC rules. Among other things, the table describes the details of this and other grants to executives, including the number of shares covered by the option grants, the exercise price, and the options' expiration date. The information in this table is used by analysts, including those assembling Standard & Poor's well-known ExecuComp database, to calculate the Black Scholes value for each option grant on the date of grant. In calculating the value, the analysts assumed, based on the firm's representations about its procedure for setting exercise prices, that the options were granted at-the-money. The calculated value was then widely used by shareholders, researchers, and the media to estimate the CEO's total pay. The Black Scholes value calculated for Reyes' 1.2 million stock option grant, which analysts assumed was at-the-money, was $13.2 million.

However, the SEC has concluded that the option grant to Reyes was backdated, and the market price on the actual date of grant may have been around $26 per share. Let us assume that the stock was in fact trading at $26 per share when the options were actually granted. Thus, if Brocade had adhered to its policy of giving only at-the-money options, it should have given Reyes options with a strike price of $26 per share. Instead, it gave Reyes options with a strike price of $13 per share, so that the options were $13 in the money. And it reported the grant as if it had given Reyes at-the-money options when the stock price was $13 per share.

Had Brocade given Reyes at-the-money options at a strike price of $26 per share, the Black Scholes value of the option grant would have been approximately $26 million. But because the options were $13 million in the money, they were even more valuable. According to one estimate, they were worth $28 million. Thus, if analysts had been told that Reyes received options with a strike price of $13 when the stock was trading for $26, they would have reported their value as $28 million rather than $13.2 million. In short, backdating this particular option grant, in the scenario just described, would have enabled Brocade to give Reyes $2 million more in options (Black Scholes value) while reporting an amount that was $15 million less.

While stock options backdating isn’t (easily) possible anymore, there might be other loopholes or consequences of tax law that make options a better deal than cash. I could only think of one reason off the top of my head, so I spent a couple weeks asking folks (including multiple founders) for their non-cynical reasons why startups might prefer options to an equivalent amount of cash.

Tax benefit of ISOs

In the U.S., Incentive stock options (ISOs) have the property that, if held for one year after the exercise date and two years after the grant date, the owner of the option pays long-term capital gains tax instead of ordinary income tax on the difference between the exercise price and the strike price.

This isn’t quite as good as it sounds because the difference between the exercise price and the strike price is subject to the Alternative Minimum Tax (AMT). I don’t find this personally relevant since I prefer to sell employer stock as quickly as possible in order to be as diversified as possible, but if you’re interested in figuring out how the AMT affects your tax bill when you exercise ISOs, see this explanation for more details.

Tax benefit of QSBS

There’s a certain class of stock that is exempt from federal capital gains tax and state tax in many states (though not in CA). This is interesting, but it seems like people rarely take advantage of this when eligible, and many startups aren’t eligible.

Tax benefit of other options

The IRS says:

Most nonstatutory options don't have a readily determinable fair market value. For nonstatutory options without a readily determinable fair market value, there's no taxable event when the option is granted but you must include in income the fair market value of the stock received on exercise, less the amount paid, when you exercise the option. You have taxable income or deductible loss when you sell the stock you received by exercising the option. You generally treat this amount as a capital gain or loss.

Valuations are bogus

One quirk of stock options is that, to qualify as ISOs, the strike price must be at least the fair market value. That’s easy to determine for public companies, but the fair market value of a share in a private company is somewhat arbitrary. For ISOs, my reading of the requirement is that companies must make “an attempt, made in good faith” to determine the fair market value. For other types of options, there’s other regulation which which determines the definition of fair market value. Either way, startups usually go to an outside firm between 1 and N times a year to get an estimate of the fair market value for their common stock. This results in at least two possible gaps between a hypothetical “real” valuation and the fair market value for options purposes.

First, the valuation is updated relatively infrequently. A common pitch I’ve heard is that the company hasn’t had its valuation updated for ages, and the company is worth twice as much now, so you’re basically getting a 2x discount.

Second, the firms doing the valuations are poorly incentivized to produce “correct” valuations. The firms are paid by startups, which gain something when the legal valuation is as low as possible.

I don’t really believe that these things make options amazing, because I hear these exact things from startups and founders, which means that their offers take these into account and are priced accordingly. However, if there’s a large gap between the legal valuation and the “true” valuation and this allows companies to effectively give out higher compensation, the way stock option backdating did, I could see how this would tilt companies towards favoring options.

Control

Even if employees got the same class of stock that VCs get, founders would retain less control if they transferred the equity from employees to VCs because employee-owned equity is spread between a relatively large number of people.

Retention

This answer was commonly given to me as a non-cynical reason. The idea is that, if you offer employees options and have a clause that prevents them from selling options on a secondary market, many employees won’t be able to leave without walking away from the majority of their compensation. Personally, this strikes me as a cynical reason, but that’s not how everyone sees it. For example, Andreessen Horowitz managing partner Scott Kupor recently proposed a scheme under which employees would lose their options under all circumstances if they leave before a liquidity event, supposedly in order to help employees.

Whether or not you view employers being able to lock in employees for indeterminate lengths of time as good or bad, options lock-in appears to be a poor retention mechanism -- companies that pay cash seem to have better retention. Just for example, Netflix pays salaries that are comparable to the total compensation in the senior band at places like Google and, anecdotally, they seem to have less attrition than trendy Bay Area startups. In fact, even though Netflix makes a lot of noise about showing people the door if they’re not a good fit, they don’t appear to have a higher involuntary attrition rate than trendy Bay Area startups -- they just seem more honest about it, something which they can do because their recruiting pitch doesn’t involve you walking away with below-market compensation if you leave. If you think this comparison is unfair because Netflix hasn’t been a startup in recent memory, you can compare to finance startups, e.g. Headlands, which was founded in the same era as Uber, Airbnb, and Stripe. They (and some other finance startups) pay out hefty sums of cash and this does not appear to result in higher attrition than similarly aged startups which give out illiquid option grants.

In the cases where this results in the employee staying longer than they otherwise would, options lock-in is often a bad deal for all parties involved. The situation is obviously bad for employees and, on average, companies don’t want unhappy people who are just waiting for a vesting cliff or liquidity event.

Incentive alignment

Another commonly stated reason is that, if you give people options, they’ll work harder because they’ll do well when the company does well. This was the reason that was given most vehemently (“you shouldn’t trust someone who’s only interested in a paycheck”, etc.)

However, as far as I can tell, paying people in options almost totally decouples job performance and compensation. If you look at companies that have made a lot of people rich, like Microsoft, Google, Apple, and Facebook, almost none of the employees who became rich had an instrumental role in the company’s success. Google and Microsoft each made thousands of people rich, but the vast majority of those folks just happened to be in the right place at the right time and could have just as easily taken a different job where they didn't get rich. Conversely, the vast majority of startup option packages end up being worth little to nothing, but nearly none of the employees whose options end up being worthless were instrumental in causing their options to become worthless.

If options are a large fraction of compensation, choosing a company that’s going to be successful is much more important than working hard. For reference, Microsoft is estimated to have created roughly 10^3 millionaires by 1992 (adjusted for inflation, that's $1.75M). The stock then went up by more than 20x. Microsoft was legendary for making people who didn't particularly do much rich; all told, it's been estimated that they made 10^4 people rich by the late 90s. The vast majority of those people were no different from people in similar roles at Microsoft's competitors. They just happened to pick a winning lottery ticket. This is the opposite of what founders claim they get out of giving options. As above, companies that pay cash, like Netflix, don’t seem to have a problem with employee productivity.

By the way, a large fraction of the people who were made rich by working at Microsoft joined after their IPO, which was in 1986. The same is true of Google, and while Facebook is too young for us to have a good idea what the long-term post-IPO story is, the folks who joined a year or two after the IPO (5 years ago, in 2012) have done quite well for themselves. People who joined pre-IPO have done better, but as mentioned above, most people have diminishing returns to individual wealth. The same power-law-like distribution that makes VC work also means that it's entirely plausible that Microsoft alone made more post-IPO people rich from 1986-1999 than all pre-IPO tech companies combined during that period. Something similar is plausibly true for Google from 2004 until FB's IPO in 2012, even including the people who got rich from FB's IPO as people who were made rich by a pre-IPO company, and you can do a similar calculation for Apple.

VC firms vs. the market

There are several potential counter-arguments to the statement that VC returns (and therefore startup equity) don’t beat the market.

One argument is, when people say that, they typically mean that after VCs take their fees, returns to VC funds don’t beat the market. As an employee who gets startup options, you don’t (directly) pay VC fees, which means you can beat the market by keeping the VC fees for yourself.

Another argument is that, some investors (like YC) seem to consistently do pretty well. If you join a startup that’s funded by a savvy investors, you too can do pretty well. For this to make sense, you have to realize that the company is worth more than “expected” while the company doesn’t have the same realization because you need the company to give you an option package without properly accounting for its value. For you to have that expectation and get a good deal, this requires the founders to not only not be overconfident in the company’s probability of success, but actually requires that the founders are underconfident. While this isn’t impossible, the majority of startup offers I hear about have the opposite problem.

Conclusion

There are a number of factors that can make options more or less valuable than they seem. From an employee standpoint, the factors that make options more valuable than they seem can cause equity to be worth tens of percent more than a naive calculation. The factors that make options less valuable than they seem do so in ways that mostly aren’t easy to quantify.

Whether or not the factors that make options relatively more valuable dominate or the factors that make options relatively less valuable dominate is an empirical question. My intuition is that the factors that make options relatively less valuable are stronger, but that’s just a guess. A way to get an idea about this from public data would be to go through through successful startup S-1 filing. Since this post is already ~5k words, I’ll leave that for another post, but I’ll note that in my preliminary skim of a handful of 99%-ile exits (> $1B), the median employee seems to do worse than someone who’s on the standard Facebook/Google/Amazon career trajectory.

From a company standpoint, there are a couple factors that allow companies to retain more leverage/control by giving relatively more options to employees and relatively less equity to investors.

All of this sounds fine for founders and investors, but I don’t see what’s in it for employees. If you have additional reasons that I’m missing, I’d love to hear them.

_If you liked this post, you may also like this other post on the tradeoff between working at a big company and working at a startup.

Appendix: caveats

Many startups don’t claim that their offers are financially competitive. As time goes on, I hear less “If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years.” (that’s an actual Paul Graham quote) and more “we’re not financially competitive with Facebook, but…”. I’ve heard from multiple founders that joining as an early employee is an incredibly bad deal when you compare early-employee equity and workload vs. founder equity and workload.

Some startups are giving out offers that are actually competitive with large company offers. Something I’ve seen from startups that are trying to give out compelling offers is that, for “senior” folks, they’re willing to pay substantially higher salaries than public companies because it’s understood that options aren’t great for employees because of their timeline, risk profile, and expected value.

There’s a huge amount of variation in offers, much of which is effectively random. I know of cases where an individual got a more lucrative offer from a startup (that doesn’t tend to give particular strong offers) than from Google, and if you ask around you’ll hear about a lot of cases like that. It’s not always true that startup offers are lower than Google/Facebook/Amazon offers, even at startups that don’t pay competitively (on average).

Anything in this post that’s related to taxes is U.S. specific. For example, I’m told that in Canada, “you can defer the payment of taxes when exercising options whose strike price is way below fair market valuation until disposition, as long as the company is Canadian-controlled and operated in Canada”.

You might object that the same line of reasoning we looked at for options can be applied to RSUs, even RSUs for public companies. That’s true, although the largest downsides of startup options are mitigated or non-existent, cash still has significant advantages to employees over RSUs. Unfortunately, the only non-finance company I know of that uses this to their advantage in recruiting is Netflix; please let me know if you can think of other tech companies that use the same compensation model.

Some startups have a sliding scale that lets you choose different amounts of option/salary compensation. I haven't seen an offer that will let you put the slider to 100% cash and 0% options (or 100% options and 0% cash), but someone out there will probably be willing to give you an all-cash offer.

In the current environment, looking at public exits may bias the data towards less sucessful companies. The most sucessful startups from the last couple generations of startups that haven't exited by acquisition have so far chosen not to IPO. It's possible that, once all the data are in, the average returns to joining a startup will look quite different (although I doubt the median return will change much).

BTW, I don't have anything against taking a startup offer, even if it's low. When I graduated from college, I took the lowest offer I had, and my partner recently took the lowest offer she got (nearly a 2x difference over the highest offer). There are plenty of reasons you might want to take an offer that isn't the best possible financial offer. However, I think you should know what you're getting into and not take an offer that you think is financially great when it's merely mediocre or even bad.

Appendix: non-counterarguments

The most common objection I’ve heard to this is that most startups don’t have enough money to pay equivalent cash and couldn’t raise that much money by selling off what would “normally” be their employee option pool. Maybe so, but that’s not a counter-argument -- it’s an argument that the most startups don’t have options that are valuable enough to be exchanged for the equivalent sum of money, i.e., that the options simply aren’t as valuable as claimed. This argument can be phrased in a variety of ways (e.g., paying salary instead of options increases burn rate, reduces runway, makes the startup default dead, etc.), but arguments of this form are fundamentally equivalent to admitting that startup options aren’t worth much because they wouldn't hold up if the options were worth enough that a typical compensation package was worth as much as a typical "senior" offer at Google or Facebook.

If you don't buy this, imagine a startup with a typical valuation that's at a stage where they're giving out 0.1% equity in options to new hires. Now imagine that some irrational bystander is willing to make a deal where they take 0.1% of the company for $1B. Is it worth it to take the money and pay people out of the $1B cash pool instead of paying people with 0.1% slices of the option pool? Your answer should be yes, unless you believe that the ratio between the value of cash on hand and equity is nearly infinite. Absolute statements like "options are preferred to cash because paying cash increases burn rate, making the startup default dead" at any valuation are equivalent to stating that the correct ratio is infinity. That's clearly nonsensical; there's some correct ratio, and we might disagree over what the correct ratio is, but for typical startups it should not be the case that the correct ratio is infinite. Since this was such a common objection, if you have this objection, my question to you is, why don't you argue that startups should pay even less cash and even more options? Is the argument that the current ratio is exactly optimal, and if so, why? Also, why does the ratio vary so much between different companies at the same stage which have raised roughly the same amount of money? Are all of those companies giving out optimal deals?

The second most common objection is that startup options are actually worth a lot, if you pick the right startup and use a proper model to value the options. Perhaps, but if that’s true, why couldn’t they have raised a bit more money by giving away more equity to VCs at its true value, and then pay cash?

Another common objection is something like "I know lots of people who've made $1m from startups". Me too, but I also know lots of people who've made much more than that working at public companies. This post is about the relative value of compensation packages, not the absolute value.

Acknowledgements

Thanks to Leah Hanson, Ben Kuhn, Tim Abbott, David Turner, Nick Bergson-Shilcock, Peter Fraenkel, Joe Ardent, Chris Ball, Anton Dubrau, Sean Talts, Danielle Sucher, Dan McKinley, Bert Muthalaly, Dan Puttick, Indradhanush Gupta, and Gaxun for comments and corrections.

Wed, 07 Jun 2017 00:00:00 +0000


Web bloat

A couple years ago, I took a road trip from Wisconsin to Washington and mostly stayed in rural hotels on the way. I expected the internet in rural areas too sparse to have cable internet to be slow, but I was still surprised that a large fraction of the web was inaccessible. Some blogs with lightweight styling were readable, as were pages by academics who hadn’t updated the styling on their website since 1995. But very few commercial websites were usable (other than Google). When I measured my connection, I found that the bandwidth was roughly comparable to what I got with a 56k modem in the 90s. The latency and packetloss were significantly worse than the average day on dialup: latency varied between 500ms and 1000ms and packetloss varied between 1% and 10%. Those numbers are comparable to what I’d see on dialup on a bad day.

Despite my connection being only a bit worse than it was in the 90s, the vast majority of the web wouldn’t load. Why shouldn’t the web work with dialup or a dialup-like connection? It would be one thing if I tried to watch youtube and read pinterest. It’s hard to serve videos and images without bandwidth. But my online interests are quite boring from a media standpoint. Pretty much everything I consume online is plain text, even if it happens to be styled with images and fancy javascript. In fact, I recently tried using w3m (a terminal-based web browser that, by default, doesn’t support css, javascript, or even images) for a week and it turns out there are only two websites I regularly visit that don’t really work in w3m (twitter and zulip, both fundamentally text based sites, at least as I use them)1.

More recently, I was reminded of how poorly the web works for people on slow connections when I tried to read a joelonsoftware post while using a flaky mobile connection. The HTML loaded but either one of the five CSS requests or one of the thirteen javascript requests timed out, leaving me with a broken page. Instead of seeing the article, I saw three entire pages of sidebar, menu, and ads before getting to the title because the page required some kind of layout modification to display reasonably. Pages are often designed so that they're hard or impossible to read if some dependency fails to load. On a slow connection, it's quite common for at least one depedency to fail. After refreshing the page twice, the page loaded as it was supposed to and I was able to read the blog post, a fairly compelling post on eliminating dependencies.

Complaining that people don’t care about performance like they used to and that we’re letting bloat slow things down for no good reason is “old man yells at cloud” territory; I probably sound like that dude who complains that his word processor, which used to take 1MB of RAM, takes 1GB of RAM. Sure, that could be trimmed down, but there’s a real cost to spending time doing optimization and even a $300 laptop comes with 2GB of RAM, so why bother? But it’s not quite the same situation -- it’s not just nerds like me who care about web performance. When Microsoft looked at actual measured connection speeds, they found that half of Americans don't have broadband speed. Heck, AOL had 2 million dial-up subscribers in 2015, just AOL alone. Outside of the U.S., there are even more people with slow connections. I recently chatted with Ben Kuhn, who spends a fair amount of time in Africa, about his internet connection:

I've seen ping latencies as bad as ~45 sec and packet loss as bad as 50% on a mobile hotspot in the evenings from Jijiga, Ethiopia. (I'm here now and currently I have 150ms ping with no packet loss but it's 10am). There are some periods of the day where it ~never gets better than 10 sec and ~10% loss. The internet has gotten a lot better in the past ~year; it used to be that bad all the time except in the early mornings.

Speedtest.net reports 2.6 mbps download, 0.6 mbps upload. I realized I probably shouldn't run a speed test on my mobile data because bandwidth is really expensive.

Our server in Ethiopia is has a fiber uplink, but it frequently goes down and we fall back to a 16kbps satellite connection, though I think normal people would just stop using the Internet in that case.

If you think browsing on a 56k connection is bad, try a 16k connection from Ethiopia!

Everything we’ve seen so far is anecdotal. Let’s load some websites that programmers might frequent with a variety of simulated connections to get data on page load times. webpagetest lets us see how long it takes a web site to load (and why it takes that long) from locations all over the world. It even lets us simulate different kinds of connections as well as load sites on a variety of mobile devices. The times listed in the table below are the time until the page is “visually complete”; as measured by webpagetest, that’s the time until the above-the-fold content stops changing.

URL Size C Load time in seconds
MB FIOS Cable LTE 3G 2G Dial Bad 😱
0 http://bellard.org 0.01 5 0.40 0.59 0.60 1.2 2.9 1.8 9.5 7.6
1 http://danluu.com 0.02 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
2 news.ycombinator.com 0.03 1 0.30 0.49 0.69 1.6 5.5 5.0 14 27
3 danluu.com 0.03 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15
4 http://jvns.ca 0.14 7 0.49 0.69 1.2 2.9 10 19 29 108
5 jvns.ca 0.15 4 0.50 0.80 1.2 3.3 11 21 31 97
6 fgiesen.wordpress.com 0.37 12 1.0 1.1 1.4 5.0 16 66 68 FAIL
7 google.com 0.59 6 0.80 1.8 1.4 6.8 19 94 96 236
8 joelonsoftware.com 0.72 19 1.3 1.7 1.9 9.7 28 140 FAIL FAIL
9 bing.com 1.3 12 1.4 2.9 3.3 11 43 134 FAIL FAIL
10 reddit.com 1.3 26 7.5 6.9 7.0 20 58 179 210 FAIL
11 signalvnoise.com 2.1 7 2.0 3.5 3.7 16 47 173 218 FAIL
12 amazon.com 4.4 47 6.6 13 8.4 36 65 265 300 FAIL
13 steve-yegge.blogspot.com 9.7 19 2.2 3.6 3.3 12 36 206 188 FAIL
14 blog.codinghorror.com 23 24 6.5 15 9.5 83 235 FAIL FAIL FAIL

Each row is a website. For sites that support both plain HTTP as well as HTTPS, both were tested; URLs are HTTPS except where explicitly specified as HTTP. The first two columns show the amount of data transferred over the wire in MB (which includes headers, handshaking, compression, etc.) and the number of TCP connections made. The rest of the columns show the time in seconds to load the page on a variety of connections from fiber (FIOS) to less good connections. “Bad” has the bandwidth of dialup, but with 1000ms ping and 10% packetloss, which is roughly what I saw when using the internet in small rural hotels. “😱” simulates a 16kbps satellite connection from Jijiga, Ethiopia. Rows are sorted by the measured amount of data transferred.

The timeout for tests was 6 minutes; anything slower than that is listed as FAIL. Pages that failed to load are also listed as FAIL. A few things that jump out from the table are:

  1. A large fraction of the web is unusable on a bad connection. Even on a good (0% packetloss, no ping spike) dialup connection, some sites won’t load.
  2. Some sites will use a lot of data!

The web on bad connections

As commercial websites go, Google is basically as good as it gets for people on a slow connection. On dialup, the 50%-ile page load time is a minute and a half. But at least it loads -- when I was on a slow, shared, satellite connection in rural Montana, virtually no commercial websites would load at all. I could view websites that only had static content via Google cache, but the live site had no hope of loading.

Some sites will use a lot of data

Although only two really big sites were tested here, there are plenty of sites that will use 10MB or 20MB of data. If you’re reading this from the U.S., maybe you don’t care, but if you’re browsing from Mauritania, Madagascar, or Vanuatu, loading codinghorror once will cost you more than 10% of the daily per capita GNI.

Page weight matters

Despite the best efforts of Maciej, the meme that page weight doesn’t matter keeps getting spread around. AFAICT, the top HN link of all time on web page optimization is to an article titled “Ludicrously Fast Page Loads - A Guide for Full-Stack Devs”. At the bottom of the page, the author links to another one of his posts, titled “Page Weight Doesn’t Matter”.

Usually, the boogeyman that gets pointed at is bandwidth: users in low-bandwidth areas (3G, developing world) are getting shafted. But the math doesn’t quite work out. Akamai puts the global connection speed average at 3.9 megabits per second.

The “ludicrously fast” guide fails to display properly on dialup or slow mobile connections because the images time out. On reddit, it also fails under load: "Ironically, that page took so long to load that I closed the window.", "a lot of … gifs that do nothing but make your viewing experience worse", "I didn't even make it to the gifs; the header loaded then it just hung.", etc.

The flaw in the “page weight doesn’t matter because average speed is fast” is that if you average the connection of someone in my apartment building (which is wired for 1Gbps internet) and someone on 56k dialup, you get an average speed of 500 Mbps. That doesn’t mean the person on dialup is actually going to be able to load a 5MB website. The average speed of 3.9 Mbps comes from a 2014 Akamai report, but it’s just an average. If you look at Akamai’s 2016 report, you can find entire countries where more than 90% of IP addresses are slower than that!

Yes, there are a lot of factors besides page weight that matter, and yes it's possible to create a contrived page that's very small but loads slowly, as well as a huge page that loads ok because all of the weight isn't blocking, but total page weight is still pretty decently correlated with load time.

Since its publication, the "ludicrously fast" guide was updated with some javascript that only loads images if you scroll down far enough. That makes it look a lot better on webpagetest if you're looking at the page size number (if webpagetest isn't being scripted to scroll), but it's a worse user experience for people on slow connections who want to read the page. If you're going to read the entire page anyway, the weight increases, and you can no longer preload images by loading the site. Instead, if you're reading, you have to stop for a few minutes at every section to wait for the images from that section to load. And that's if you're lucky and the javascript for loading images didn't fail to load.

The average user fallacy

Just like many people develop with an average connection speed in mind, many people have a fixed view of who a user is. Maybe they think there are customers with a lot of money with fast connections and customers who won't spend money on slow connections. That is, very roughly speaking, perhaps true on average, but sites don't operate on average, they operate in particular domains. Jamie Brandon writes the following about his experience with Airbnb:

I spent three hours last night trying to book a room on airbnb through an overloaded wifi and presumably a satellite connection. OAuth seems to be particularly bad over poor connections. Facebook's OAuth wouldn't load at all and Google's sent me round a 'pick an account' -> 'please reenter you password' -> 'pick an account' loop several times. It took so many attempts to log in that I triggered some 2fa nonsense on airbnb that also didn't work (the confirmation link from the email led to a page that said 'please log in to view this page') and eventually I was just told to send an email to account.disabled@airbnb.com, who haven't replied.

It's particularly galling that airbnb doesn't test this stuff, because traveling is pretty much the whole point of the site so they can't even claim that there's no money in servicing people with poor connections.

What about tail latency?

My original plan for this was post was to show 50%-ile, 90%-ile, 99%-ile, etc., tail load times. But the 50%-ile results are so bad that I don’t know if there’s any point to showing the other results. If you were to look at the 90%-ile results, you’d see that most pages fail to load on dialup and the “Bad” and “😱” connections are hopeless for almost all sites.

HTTP vs HTTPs

URL Size C Load time in seconds
kB FIOS Cable LTE 3G 2G Dial Bad 😱
1 http://danluu.com 21.1 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
3 https://danluu.com 29.3 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15

You can see that for a very small site that doesn’t load many blocking resources, HTTPS is noticeably slower than HTTP, especially on slow connections. Practically speaking, this doesn’t matter today because virtually no sites are that small, but if you design a web site as if people with slow connections actually matter, this is noticeable.

How to make pages usable on slow co