home

Dan Luu

http://danluu.com/atom/index.xml
Recent content on Dan Luu
http://danluu.com/atom.xml


Randomized trial on gender in Overwatch

A recurring discussion in Overwatch (as well as other online games) is whether or not women are treated differently from men. If you do a quick search, you can find hundreds of discussions about this, some of which have well over a thousand comments. These discussions tend to go the same way and involve the same debate every time, with the same points being made on both sides. Just for example, these three threads on reddit that spun out of a single post that have a total of 10.4k comments. On one side, you have people saying "sure, women get trash talked, but I'm a dude and I get trash talked, everyone gets trash talked there's no difference", "I've never seen this, it can't be real", etc., and on the other side you have people saying things like "when I play with my boyfriend, I get accused of being carried by him all the time but the reverse never happens", "people regularly tell me I should play mercy[, a character that's a female healer]", and so on and so forth. In less time than has been spent on a single large discussion, we could just run the experiment, so here it is.

This is the result of playing 339 games in the two main game modes, quick play (QP) and competitive (comp), where roughly half the games were played with a masculine name (where the username was a generic term for a man) and half were played with a feminine name (where the username was a woman's name). I recorded all of the comments made in each of the games and then classified the comments by type. Classes of comments were "sexual/gendered comments", "being told how to play", "insults", and "compliments".

In each game that's included, I decided to include the game (or not) in the experiment before the character selection screen loaded. In games that were included, I used the same character selection algorithm, I wouldn't mute anyone for spamming chat or being a jerk, I didn't speak on voice chat (although I had it enabled), I never sent friend requests, and I was playing outside of a group in order to get matched with 5 random players. When playing normally, I might choose a character I don't know how to use well and I'll mute people who pollute chat with bad comments. There are a lot of games that weren't included in the experiment because I wasn't in a mood to listen to someone rage at their team for fifteen minutes and the procedure I used involved pre-committing to not muting people who do that.

Sexual or sexually charged comments

I thought I'd see more sexual comments when using the feminine name as opposed to the masculine name, but that turned out to not be the case. There was some mention of sex, genitals, etc., in both cases and the rate wasn't obviously different and was actually higher in the masculine condition.

Zero games featured comments were directed specifically at me in the masculine condition and two (out of 184) games in the feminine condition featured comments that were directed at me. Most comments were comments either directed at other players or just general comments to team or game chat.

Examples of typical undirected comments that would occur in either condition include ""my girlfriend keeps sexting me how do I get her to stop?", "going in balls deep", "what a surprise. *strokes dick* [during the post-game highlight]", and "support your local boobies".

The two games that featured sexual comments directed at me had the following comments:

During games not included in the experiment (I generally didn't pay attention to which username I was on when not in the experiment), I also got comments like "send nudes". Anecdotally, there appears to be a different in the rate of these kinds of comments directed at the player, but the rate observed in the experiment is so low that uncertainty intervals around any estimates of the true rate will be similar in both conditions unless we use a strong prior.

The fact that this difference couldn't be observed in 339 games was surprising to me, although it's not inconsistent with McDaniel's thesis, a survey of women who play video games. 339 games probably sounds like a small number to serious gamers, but the only other randomized experiment I know of on this topic (besides this experiment) is Kasumovic et al., which notes that "[w]e stopped at 163 [games] as this is a substantial time effort".

All of the analysis uses the number of games in which a type of comment occured and not tone to avoid having to code comments as having a certain tone in order to avoid possibly injecting bias into the process. Sentiment analysis models, even state-of-the-art ones often return nonsensical results, so this basically has to be done by hand, at least today. With much more data, some kind of sentiment analysis, done with liberal spot checking and re-training of the model, could work, but the total number of comments is so small in this case that it would amount to coding each comment by hand.

Coding comments manually in an unbiased fashion can also be done with a level of blinding, but doing that would probably require getting more people involved (since I see and hear comments while I'm playing) and relying on unpaid or poorly paid labor.

Being told how to play

The most striking, easy to quantify, difference was the rate at which I played games in which people told me how I should play. Since it's unclear how much confidence we should have in the difference if we just look at the raw rates, we'll use a simple statistical model to get the uncertainty interval around the estimates. Since I'm not sure what my belief about this should be, this uses an uninformative prior, so the estimate is close to the actual rate. Anyway, here are the uncertainty intervals a simple model puts on the percent of games where at least one person told me I was playing wrong, that I should change how I'm playing, or that I switch characters:

Cond Est P25 P75
F comp 19 13 25
M comp 6 2 10
F QP 4 3 6
M QP 1 0 2

The experimental conditions in this table are masculine vs. feminine name (M/F) and competitive mode vs quick play (comp/QP). The numbers are percents. Est is the estimate, P25 is the 25%-ile estimate, and P75 is the 75%-ile estimate. Competitive mode and using a feminine name are both correlated with being told how to play. See this post by Andrew Gelman for why you might want to look at the 50% interval instead of the 95% interval.

For people not familiar with overwatch, in competitive mode, you're explicitly told what your ELO-like rating is and you get a badge that reflects your rating. In quick play, you have a rating that's tracked, but it's never directly surfaced to the user and you don't get a badge.

It's generally believed that people are more on edge during competitive play and are more likely to lash out (and, for example, tell you how you should play). The data is consistent with this common belief.

Per above, I didn't want to code tone of messages to avoid bias, so this table only indicates the rate at which people told me I was playing incorrectly or asked that I switch to a different character. The qualitative difference in experience is understated by this table. For example, the one time someone asked me to switch characters in the masculine condition, the request was a one sentence, polite, request ("hey, we're dying too quickly, could we switch [from the standard one primary healer / one off healer setup] to double primary healer or switch our tank to [a tank that can block more damage]?"). When using the feminine name, a typical case would involve 1-4 people calling me human garbage for most of the game and consoling themselves with the idea that the entire reason our team is losing is that I won't change characters.

The simple model we're using indicates that there's probably a difference between both competitive and QP and playing with a masculine vs. a feminine name. However, most published results are pretty bogus, so let's look at reasons this result might be bogus and then you can decide for yourself.

Threats to validity

The biggest issue is that this wasn't a pre-registered trial. I'm obviously not going to go and officially register a trial like this, but I also didn't informally "register" this by having this comparison in mind when I started the experiment. A problem with non-pre-registered trials is that there are a lot of degrees of freedom, both in terms of what we could look at, and in terms of the methodology we used to look at things, so it's unclear if the result is "real" or an artifact of fishing for something that looks interesting. A standard example of this is that, if you look for 100 possible effects, you're likely to find 1 that appears to be statistically significant with p = 0.01.

There are standard techniques to correct for this problem (e.g., Bonferroni correction), but I don't find these convincing because they usually don't capture all of the degrees of freedom that go into a statistical model. An example is that it's common to take a variable and discretize it into a few buckets. There are many ways to do this and you generally won't see papers talk about the impact of this or correct for this in any way, although changing how these buckets are arranged can drastically change the results of a study. Another common knob people can use to manipulate results is curve fitting to an inappropriate curve (often a 2nd a 3rd degree polynomial when a scatterplot shows that's clearly incorrect). Another way to handle this would be to use a more complex model, but I wanted to keep this as simple as possible.

If I wanted to really be convinced on this, I'd want to, at a minimum, re-run this experiment with this exact comparison in mind. As a result, this experiment would need to be replicated to provide more than a preliminary result that is, at best, weak evidence.

One other large class of problem with randomized controlled trials (RCTs) is that, despite randomization, the two arms of the experiment might be different in some way that wasn't randomized. Since Overwatch doesn't allow you to keep changing your name, this experiment was done with two different accounts and these accounts had different ratings in competitive mode. On average, the masculine account had a higher rating due to starting with a higher rating, which meant that I was playing against stronger players and having worse games on the masculine account. In the long run, this will even out, but since most games in this experiment were in QP, this didn't have time to even out in comp. As a result, I had a higher win rate as well as just generally much better games with the feminine account in comp.

With no other information, we might expect that people who are playing worse get told how to play more frequently and people who are playing better should get told how to play less frequently, which would mean that the table above understates the actual difference.

However Kasumovic et al., in a gender-based randomized trial of Halo 3, found that players who were playing poorly were more negative towards women, especially women who were playing well (there's enough statistical manipulation of the data that a statement this concise can only be roughly correct, see study for details). If that result holds, it's possible that I would've gotten fewer people telling me that I'm human garbage and need to switch characters if I was average instead of dominating most of my games in the feminine condition.

If that result generalizes to OW, that would explain something which I thought was odd, which was that a lot of demands to switch and general vitriol came during my best performances with the feminine account. A typical example of this would be a game where we have a 2-2-2 team composition (2 players playing each of the three roles in the game) where my counterpart in the same role ran into the enemy team and died at the beginning of the fight in almost every engagement. I happened to be having a good day and dominated the other team (37-2 in a ten minute comp game, while focusing on protecting our team's healers) while only dying twice, once on purpose as a sacrifice and second time after a stupid blunder. Immediately after I died, someone asked me to switch roles so they could take over for me, but at no point did someone ask the other player in my role to switch despite their total uselesses all game (for OW players this was a Rein who immediately charged into the middle of the enemy team at every opportunity, from a range where our team could not possibly support them; this was Hanamura 2CP, where it's very easy for Rein to set up situations where their team cannot help them). This kind of performance was typical of games where my team jumped on me for playing incorrectly. This isn't to say I didn't have bad games; I had plenty of bad games, but a disproportionate number of the most toxic experiences came when I was having a great game.

I tracked how well I did in games, but this sample doesn't have enough ranty games to do a meaningful statistical analysis of my performance vs. probability of getting thrown under the bus.

Games at different ratings are probably also generally different environments and get different comments, but it's not clear if there are more negative comments at 2000 than 2500 or vice versa. There are a lot of online debates about this; for any rating level other than the very lowest or the very highest ratings, you can find a lot of people who say that the rating band they're in has the highest volume of toxic comments.

Other differences

Here are some things that happened while playing with the feminine name that didn't happen with the masculine name during this experiment or in any game outside of this experiment:

The rate of all these was low enough that I'd have to play many more games to observe something without a huge uncertainty interval.

I didn't accept any friend requests from people I had no interaction with. Anecdotally, some people report people will send sexual comments or berate them after an unsolicited friend request. It's possible that the effect show in the table would be larger if I accepted these friend requests and it couldn't be smaller.

I didn't attempt to classify comments as flirty or not because, unlike the kinds of commments I did classify, this is often somewhat subtle and you could make a good case that any particular comment is or isn't flirting. Without responding (which I didn't do), many of these kinds of comments are ambiguous

Another difference was in the tone of the compliments. The rate of games where I was complimented wasn't too different, but compliments under the masculine condition tended to be short and factual (e.g., someone from the other team saying "no answer for [name of character I was playing]" after a dominant game) and compliments under the feminine condition tended to be more effusive and multiple people would sometimes chime in about how great I was.

Non differences

The rate of complements and the rate of insults in games that didn't include explanations of how I'm playing wrong or how I need to switch characters were similar in both conditions.

Other factors

Some other factors that would be interesting to look at would be time of day, server, playing solo or in a group, specific character choice, being more or less communicative, etc., but it would take a lot more data to be able to get good estimates when adding it more variables. Blizzard should have the data necessary to do analyses like this in aggregate, but they're notoriously private with their data, so someone at Blizzard would have to do the work and then publish it publicly, and they're not really in the habit of doing that kind of thing. If you work at Blizzard and are interested in letting a third party do some analysis on an anonymized data set, let me know and I'd be happy to dig in.

Experimental minutiae

Under both conditions, I avoided ever using voice chat and would call things out in text chat when time permitted. Also under both conditions, I mostly filled in with whatever character class the team needed most, although I'd sometimes pick DPS (in general, DPS are heavily oversubscribed, so you'll rarely play DPS if you don't pick one even when unnecessary).

For quickplay, backfill games weren't counted (backfill games are games where you join after the game started to fill in for a player who left; comp doesn't allow backfills). 6% of QP games were backfills.

These games are from before the "endorsements" patch; most games were played around May 2018. All games were played in "solo q" (with 5 random teammates). In order to avoid correlations between games depending on how long playing sessions were, I quit between games and waited for enough time (since you're otherwise likely to end up in a game with some or many of the same players as before).

The model used probability of a comment happening in a game to avoid the problem that Kasumovic et al. ran into, where a person who's ranting can skew the total number of comments. Kasumovic et al. addressed this by removing outliers, but I really don't like manually reaching in and removing data to adjust results. This could also be addressed by using a more sophisticated model, but a more sophisticated model means more knobs which means more ways for bias to sneak in. Using the number of players who made comments instead would be one way to mitigate this problem, but I think this still isn't ideal because these aren't independent -- when one player starts being negative, this greatly increases the odds that another player in that game will be negative, but just using the number of players makes four games with one negative person the same as one game with four negative people. This can also be accounted for with a slightly more sophisticated model, but that also involves adding more knobs to the model.

Appendix: comments / advice to overwatch players

A common complaint, perhaps the most common complaint by people below 2000 SR (roughly 30%-ile) or perhaps 1500 SR (roughly 10%-ile) is that they're in "ELO hell" and are kept down because their teammates are too bad. Based on my experience, I find this to be extremely unlikely.

People often split skill up into "mechanics" and "gamesense". My mechanics are pretty much as bad as it's possible to get. The last game I played seriously was a 90s video game that's basically online asteroids and the last game before that I put any time into was the original SNES super mario kart. As you'd expect from someone who hasn't put significant time into a post-90s video game or any kind of FPS game, my aim and dodging are both atrocious. On top of that, I'm an old dude with slow reflexes and I was able to get to 2500 SR (roughly 60%-ile) by avoiding a few basic fallacies and blunders despite have approximately zero mechanical skill. If you're also an old dude with basically no FPS experience, you can do the same thing; if you have good reflexes or enough FPS experience to actually aim or dodge, you basically can't be worse mechnically than I am and you can do much better by avoiding a few basic mistakes.

The most common fallacy I see repeated is that you have to play DPS to move out of bronze or gold. The evidence people give for this is that, when a GM streamer plays flex, tank, or healer, they sometimes lose in bronze. I guess the idea is that, because the only way to ensure a 99.9% win rate in bronze is to be a GM level DPS player and play DPS, the best way to maintain a 55% or a 60% win rate is to play DPS, but this doesn't follow.

Healers and tanks are both very powerful in low ranks. Because low ranks feature both poor coordination and relatively poor aim (players with good coordination or aim tend to move up quickly), time-to-kill is very slow compared to higher ranks. As a result, an off healer can tilt the result of a 1v1 (and sometimes even a 2v1) matchup and a primary healer can often determine the result of a 2v1 matchup. Because coordination is poor, most matchups end up being 2v1 or 1v1. The flip side of the lack of coordination is that you'll almost never get help from teammates. It's common to see an enemy player walk into the middle of my team, attack someone, and then walk out while literally no one else notices. If the person being attacked is you, the other healer typically won't notice and will continue healing someone at full health and none of the classic "peel" characters will help or even notice what's happening. That means it's on you to pay attention to your surroundings and watching flank routes to avoid getting murdered.

If you can avoid getting murdered constantly and actually try to heal (as opposed to many healers at low ranks, who will try to kill people or stick to a single character and continue healing them all the time even if they're at full health), you outheal a primary healer half the time when playing an off healer and, as a primary healer, you'll usually be able to get 10k-12k healing per 10 min compared to 6k to 8k for most people in Silver (sometimes less if they're playing DPS Moira). That's like having an extra half a healer on your team, which basically makes the game 6.5 v 6 instead of 6v6. You can still lose a 6.5v6 game, and you'll lose plenty of games, but if you're consistently healing 50% more than an normal healer at your rank, you'll tend to move up even if you get a lot of major things wrong (heal order, healing when that only feeds the other team, etc.).

A corollary to having to watch out for yourself 95% when playing a healer is that, as a character who can peel, you can actually watch out for your teammates and put your team at a significant advantage in 95% of games. As Zarya or Hog, if you just boringly play towards the front of your team, you can basically always save at least one teammate from death in a team fight, and you can often do this 2 or 3 times. Meanwhile, your counterpart on the other team is walking around looking for 1v1 matchups. If they find a good one, they'll probably kill someone, and if they don't (if they run into someone with a mobility skill or a counter like brig or reaper), they won't. Even in the case where they kill someone and you don't do a lot, you still provide as much value as them and, on average, you'll provide more value. A similar thing is true of many DPS characters, although it depends on the character (e.g., McCree is effective as a peeler, at least at the low ranks that I've played in). If you play a non-sniper DPS that isn't suited for peeling, you can find a DPS on your team who's looking for 1v1 fights and turn those fights into 2v1 fights (at low ranks, there's no shortage of these folks on both teams, so there are plenty of 1v1 fights you can control by making them 2v1).

All of these things I've mentioned amount to actually trying to help your team instead of going for flashy PotG setups or trying to dominate the entire team by yourself. If you say this in the abstract, it seems obvious, but most people think they're better than their rating. A survey of people's perception of their own skill level vs. their actual rating found that 1% of people thought they were overrated, 32% of people thought they were rated accurately, and the other 77% of people thought they were underrated. It doesn't help that OW is designed to make people think they're doing well when they're not and the best way to get "medals" or "play of the game" is to play in a way that severely reduces your odds of actually winning each game.

Outside of obvious gameplay mistakes, the other big thing that loses games is when someone tilts and either starts playing terribly or flips out and says something to enrage someone else on the team, who then starts playing terribly. I don't think you can actually do much about this directly, but you can never do this, so 5/6th of your team will do this at some base rate, whereas 6/6 of the other team will do this. Like all of the above, this won't cause you to win all of your games, but everything you do that increases your win rate makes a difference.

Poker players have the right attitude when they talk about leaks. The goal isn't to win every hand, it's to increase your EV by avoiding bad blunders (at high levels, it's about more than avoiding bad blunders, but we're talking about getting out of below median ranks, not becoming GM here). You're going to have terrible games where you get 5 people instalocking DPS. Your odds of winning a game are low, say 10%. If you get mad and pick DPS and reduce your odds even further (say this is to 2%), all that does is create a leak in your win rate during games when your teammates are being silly.

If you gain/lose 25 rating per game for a win or a loss, your average rating change from a game is 25 (W_rate - L_rate) = 25 (2W_rate - 1). Let's say 1/40 games are these silly games where your team decides to go all DPS. The per-game SR difference of trying to win these vs. soft throwing is maybe something like 1/40 * 25 (2 * 0.08) = 0.1. That doesn't sound like much and these numbers are just guesses, but everyone outside of very high-level games is full of leaks like these, and they add up. And if you look at a 60% win rate, which is pretty good considering that your influence is limited because you're only one person on a 6 person team, that only translates to an average of 5SR per game, so it doesn't actually take that many small leaks to really move your average SR gain or loss.

Appendix: general comments on online gaming, 20 years ago vs. today

Since I'm unlikely to write another blog post on gaming any time soon, here are some other random thoughts that won't fit with any other post. My last serious experience with online games was with a game from the 90s. Even though I'd heard that things were a lot worse, I was still surprised by it. IRL, the only time I encounter the same level and rate of pointless nastiness in a recreational activity is down at the bridge club (casual bridge games tend to be very nice). When I say pointless nastiness, I mean things like getting angry and then making nasty comments to a teammate mid-game. Even if your "criticism" is correct (and, if you review OW games or bridge hands, you'll see that these kinds of angry comments are almost never correct), this has virtually no chance of getting your partner to change their behavior and it has a pretty good chance of tilting them and making them play worse. If you're trying to win, there's no reason to do this and good reason to avoid this.

If you look at the online commentary for this, it's common to see people blaming kids, but this doesn't match my experience at all. For one thing, when I was playing video games in the 90s, a huge fraction of the online gaming population was made up of kids, and online game communities were nicer than they are today. Saying that "kids nowadays" are worse than kids used to be is a pastime that goes back thousands of years, but it's generally not true and there doesn't seem to be any reason to think that it's true here.

Additionally, this simply doesn't match what I saw. If I just look at comments over audio chat, there were a couple of times when some kids were nasty, but almost all of the comments are from people who sound like adults. Moreover, if I look at when I played games that were bad, a disproportionately large number of those games were late (after 2am eastern time, on the central/east server), where the relative population of adults is larger.

And if we look at bridge, the median age of an ACBL member is in the 70s, with an increase in age of a whopping 0.4 years per year.

Sure, maybe people tend to get more mature as they age, but in any particular activity, that effect seems to be dominated by other factors. I don't have enough data at hand to make a good guess as to what happened, but I'm entertained by the idea that this might have something to do with it:

I’ve said this before, but one of the single biggest culture shocks I’ve ever received was when I was talking to someone about five years younger than I was, and she said “Wait, you play video games? I’m surprised. You seem like way too much of a nerd to play video games. Isn’t that like a fratboy jock thing?”

Appendix: FAQ

Here are some responses to the most common online comments.

Plat? You suck at Overwatch

Yep. But I sucked roughly equally on both accounts (actually somewhat more on the masculine account because it was rated higher and I was playing a bit out of my depth). Also, that's not a question.

This is just a blog post, it's not an academic study, the results are crap.

There's nothing magic about academic papers. I have my name on a few publications, including one that won best paper award at the top conference in its field. My median blog post is more rigorous than my median paper or, for that matter, the median paper that I read.

When I write a paper, I have to deal with co-authors who push for putting in false or misleading material that makes the paper look good and my ability to push back against this has been fairly limited. On my blog, I don't have to deal with that and I can write up results that are accurate (to the best of my abillity) even if it makes the result look less interesting or less likely to win an award.

Gamers have always been toxic, that's just nostalgia talking.

If I pull game logs for subspace, this seems to be false. YMMV depending on what games you played, I suppose. FWIW, airmash seems to be the modern version of subspace, and (until the game died), it was much more toxic than subspace even if you just compare on a per-game basis despite having much smaller games (25 people for a good sized game in airmash, vs. 95 for subsace).

This is totally invalid because you didn't talk on voice chat.

At the ranks I played, not talking on voice was the norm. It would be nice to have talking or not talking on voice chat be an indepedent variable, but that would require playing even more games to get data for another set of conditions, and if I wasn't going to do that, choosing the condition that's most common doesn't make the entire experiment invalid, IMO.

Some people report that, post "endorsements" patch, talking on voice chat is much more common. I tested this out by playing 20 (non-comp) games just after the "Paris" patch. Three had comments on voice chat. One was someone playing random music clips, one had someone screaming at someone else for playing incorrectly, and one had useful callouts on voice chat. It's possible I'd see something different with more games or in comp, but I don't think it's obvious that voice chat is common for most people after the "endorsements" patch.

Appendix: code and data

If you want to play with this data and model yourself, experiment with different priors, run a posterior predictive check, etc., here's a snippet of R code that embeds the data:

library(brms)
library(modelr)
library(tidybayes)
library(tidyverse)

d <- tribble(
  ~game_type, ~gender, ~xplain, ~games,
  "comp", "female", 7, 35,
  "comp", "male", 1, 23,
  "qp", "female", 6, 149,
  "qp", "male", 2, 132
)

d <- d %>% mutate(female = ifelse(gender == "female", 1, 0), comp = ifelse(game_type == "comp", 1, 0))


result <-
  brm(data = d, family = binomial,
      xplain | trials(games) ~ female + comp,
      prior = c(set_prior("normal(0,10)", class = "b")),
      iter = 25000, warmup = 500, cores = 4, chains = 4)

The model here is simple enough that I wouldn't expect the version of software used to significantly affect results, but in case you're curious, this was done with brms 2.7.0, rstan 2.18.2, on R 3.5.1.

Thanks to Leah Hanson, Sean Talts and Sean's math/stats reading group, Annie Cherkaev, Robert Schuessler, Wesley Aptekar-Cassels, Julia Evans, Paul Gowder, Jonathan Dahan, Bradley Boccuzzi, Akiva Leffert, and one or more anonymous commenters for comments/corrections/discussion.

Tue, 19 Feb 2019 00:00:00 +0000


Computer latency: 1977-2017

I've had this nagging feeling that the computers I use today feel slower than the computers I used as a kid. As a rule, I don’t trust this kind of feeling because human perception has been shown to be unreliable in empirical studies, so I carried around a high-speed camera and measured the response latency of devices I’ve run into in the past few months. Here are the results:

computerlatency
(ms)
yearclock# T
apple 2e3019831 MHz3.5k
ti 99/4a4019813 MHz8k
custom haswell-e 165Hz5020143.5 GHz2G
commodore pet 40166019771 MHz3.5k
sgi indy601993.1 GHz1.2M
custom haswell-e 120Hz6020143.5 GHz2G
thinkpad 13 chromeos7020172.3 GHz1G
imac g4 os 9702002.8 GHz11M
custom haswell-e 60Hz8020143.5 GHz2G
mac color classic90199316 MHz273k
powerspec g405 linux 60Hz9020174.2 GHz2G
macbook pro 201410020142.6 GHz700M
thinkpad 13 linux chroot10020172.3 GHz1G
lenovo x1 carbon 4g linux11020162.6 GHz1G
imac g4 os x1202002.8 GHz11M
custom haswell-e 24Hz14020143.5 GHz2G
lenovo x1 carbon 4g win15020162.6 GHz1G
next cube150198825 MHz1.2M
powerspec g405 linux17020174.2 GHz2G
packet around the world190
powerspec g405 win20020174.2 GHz2G
symbolics 362030019865 MHz390k

These are tests of the latency between a keypress and the display of a character in a terminal (see appendix for more details). The results are sorted from quickest to slowest. In the latency column, the background goes from green to yellow to red to black as devices get slower and the background gets darker as devices get slower. No devices are green. When multiple OSes were tested on the same machine, the os is in bold. When multiple refresh rates were tested on the same machine, the refresh rate is in italics.

In the year column, the background gets darker and purple-er as devices get older. If older devices were slower, we’d see the year column get darker as we read down the chart.

The next two columns show the clock speed and number of transistors in the processor. Smaller numbers are darker and blue-er. As above, if slower clocked and smaller chips correlated with longer latency, the columns would get darker as we go down the table, but it, if anything, seems to be the other way around.

For reference, the latency of a packet going around the world through fiber from NYC back to NYC via Tokyo and London is inserted in the table.

If we look at overall results, the fastest machines are ancient. Newer machines are all over the place. Fancy gaming rigs with unusually high refresh-rate displays are almost competitive with machines from the late 70s and early 80s, but “normal” modern computers can’t compete with thirty to forty year old machines.

We can also look at mobile devices. In this case, we’ll look at scroll latency in the browser:

devicelatency
(ms)
year
ipad pro 10.5" pencil302017
ipad pro 10.5"702017
iphone 4s702011
iphone 6s702015
iphone 3gs702009
iphone x802017
iphone 8802017
iphone 7802016
iphone 6802014
gameboy color801998
iphone 5902012
blackberry q101002013
huawei honor 81102016
google pixel 2 xl1102017
galaxy s71202016
galaxy note 31202016
moto x1202013
nexus 5x1202015
oneplus 3t1302016
blackberry key one1302017
moto e (2g)1402015
moto g4 play1402017
moto g4 plus1402016
google pixel1402016
samsung galaxy avant1502014
asus zenfone3 max1502016
sony xperia z5 compact1502015
htc one m41602013
galaxy s4 mini1702013
lg k41802016
packet190
htc rezound2402011
palm pilot 10004901996
kindle oasis 25702017
kindle paperwhite 36302015
kindle 48602011

As above, the results are sorted by latency and color-coded from green to yellow to red to black as devices get slower. Also as above, the year gets purple-er (and darker) as the device gets older.

If we exclude the game boy color, which is a different class of device than the rest, all of the quickest devices are Apple phones or tablets. The next quickest device is the blackberry q10. Although we don’t have enough data to really tell why the blackberry q10 is unusually quick for a non-Apple device, one plausible guess is that it’s helped by having actual buttons, which are easier to implement with low latency than a touchscreen. The other two devices with actual buttons are the gameboy color and the kindle 4.

After that iphones and non-kindle button devices, we have a variety of Android devices of various ages. At the bottom, we have the ancient palm pilot 1000 followed by the kindles. The palm is hamstrung by a touchscreen and display created in an era with much slower touchscreen technology and the kindles use e-ink displays, which are much slower than the displays used on modern phones, so it’s not surprising to see those devices at the bottom.

Why is the apple 2e so fast?

Compared to a modern computer that’s not the latest ipad pro, the apple 2 has significant advantages on both the input and the output, and it also has an advantage between the input and the output for all but the most carefully written code since the apple 2 doesn’t have to deal with context switches, buffers involved in handoffs between different processes, etc.

On the input, if we look at modern keyboards, it’s common to see them scan their inputs at 100 Hz to 200 Hz (e.g., the ergodox claims to scan at 167 Hz). By comparison, the apple 2e effectively scans at 556 Hz. See appendix for details.

If we look at the other end of the pipeline, the display, we can also find latency bloat there. I have a display that advertises 1 ms switching on the box, but if we look at how long it takes for the display to actually show a character from when you can first see the trace of it on the screen until the character is solid, it can easily be 10 ms. You can even see this effect with some high-refresh-rate displays that are sold on their allegedly good latency.

At 144 Hz, each frame takes 7 ms. A change to the screen will have 0 ms to 7 ms of extra latency as it waits for the next frame boundary before getting rendered (on average,we expect half of the maximum latency, or 3.5 ms). On top of that, even though my display at home advertises a 1 ms switching time, it actually appears to take 10 ms to fully change color once the display has started changing color. When we add up the latency from waiting for the next frame to the latency of an actual color change, we get an expected latency of 7/2 + 10 = 13.5ms

With the old CRT in the apple 2e, we’d expect half of a 60 Hz refresh (16.7 ms / 2) plus a negligible delay, or 8.3 ms. That’s hard to beat today: a state of the art “gaming monitor” can get the total display latency down into the same range, but in terms of marketshare, very few people have such displays, and even displays that are advertised as being fast aren’t always actually fast.

iOS rendering pipeline

If we look at what’s happening between the input and the output, the differences between a modern system and an apple 2e are too many to describe without writing an entire book. To get a sense of the situation in modern machines, here’s former iOS/UIKit engineer Andy Matuschak’s high-level sketch of what happens on iOS, which he says should be presented with the disclaimer that “this is my out of date memory of out of date information”:

Andy says “the actual amount of work happening here is typically quite small. A few ms of CPU time. Key overhead comes from:”

By comparison, on the Apple 2e, there basically aren’t handoffs, locks, or process boundaries. Some very simple code runs and writes the result to the display memory, which causes the display to get updated on the next scan.

Refresh rate vs. latency

One thing that’s curious about the computer results is the impact of refresh rate. We get a 90 ms improvement from going from 24 Hz to 165 Hz. At 24 Hz each frame takes 41.67 ms and at 165 Hz each frame takes 6.061 ms. As we saw above, if there weren’t any buffering, we’d expect the average latency added by frame refreshes to be 20.8ms in the former case and 3.03 ms in the latter case (because we’d expect to arrive at a uniform random point in the frame and have to wait between 0ms and the full frame time), which is a difference of about 18ms. But the difference is actually 90 ms, implying we have latency equivalent to (90 - 18) / (41.67 - 6.061) = 2 buffered frames.

If we plot the results from the other refresh rates on the same machine (not shown), we can see that they’re roughly in line with a “best fit” curve that we get if we assume that, for that machine running powershell, we get 2.5 frames worth of latency regardless of refresh rate. This lets us estimate what the latency would be if we equipped this low latency gaming machine with an infinity Hz display -- we’d expect latency to be 140 - 2.5 * 41.67 = 36 ms, almost as fast as quick but standard machines from the 70s and 80s.

Complexity

Almost every computer and mobile device that people buy today is slower than common models of computers from the 70s and 80s. Low-latency gaming desktops and the ipad pro can get into the same range as quick machines from thirty to forty years ago, but most off-the-shelf devices aren’t even close.

If we had to pick one root cause of latency bloat, we might say that it’s because of “complexity”. Of course, we all know that complexity is bad. If you’ve been to a non-academic non-enterprise tech conference in the past decade, there’s a good chance that there was at least one talk on how complexity is the root of all evil and we should aspire to reduce complexity.

Unfortunately, it's a lot harder to remove complexity than to give a talk saying that we should remove complexity. A lot of the complexity buys us something, either directly or indirectly. When we looked at the input of a fancy modern keyboard vs. the apple 2 keyboard, we saw that using a relatively powerful and expensive general purpose processor to handle keyboard inputs can be slower than dedicated logic for the keyboard, which would both be simpler and cheaper. However, using the processor gives people the ability to easily customize the keyboard, and also pushes the problem of “programming” the keyboard from hardware into software, which reduces the cost of making the keyboard. The more expensive chip increases the manufacturing cost, but considering how much of the cost of these small-batch artisanal keyboards is the design cost, it seems like a net win to trade manufacturing cost for ease of programming.

We see this kind of tradeoff in every part of the pipeline. One of the biggest examples of this is the OS you might run on a modern desktop vs. the loop that’s running on the apple 2. Modern OSes let programmers write generic code that can deal with having other programs simultaneously running on the same machine, and do so with pretty reasonable general performance, but we pay a huge complexity cost for this and the handoffs involved in making this easy result in a significant latency penalty.

A lot of the complexity might be called accidental complexity, but most of that accidental complexity is there because it’s so convenient. At every level from the hardware architecture to the syscall interface to the I/O framework we use, we take on complexity, much of which could be eliminated if we could sit down and re-write all of the systems and their interfaces today, but it’s too inconvenient to re-invent the universe to reduce complexity and we get benefits from economies of scale, so we live with what we have.

For those reasons and more, in practice, the solution to poor performance caused by “excess” complexity is often to add more complexity. In particular, the gains we’ve seen that get us back to the quickness of the quickest machines from thirty to forty years ago have come not from listening to exhortations to reduce complexity, but from piling on more complexity.

The ipad pro is a feat of modern engineering; the engineering that went into increasing the refresh rate on both the input and the output as well as making sure the software pipeline doesn’t have unnecessary buffering is complex! The design and manufacture of high-refresh-rate displays that can push system latency down is also non-trivially complex in ways that aren’t necessary for bog standard 60 Hz displays.

This is actually a common theme when working on latency reduction. A common trick to reduce latency is to add a cache, but adding a cache to a system makes it more complex. For systems that generate new data and can’t tolerate a cache, the solutions are often even more complex. An example of this might be large scale RoCE deployments. These can push remote data access latency from from the millisecond range down to the microsecond range, which enables new classes of applications. However, this has come at a large cost in complexity. Early large-scale RoCE deployments easily took tens of person years of effort to get right and also came with a tremendous operational burden.

Conclusion

It’s a bit absurd that a modern gaming machine running at 4,000x the speed of an apple 2, with a CPU that has 500,000x as many transistors (with a GPU that has 2,000,000x as many transistors) can maybe manage the same latency as an apple 2 in very carefully coded applications if we have a monitor with nearly 3x the refresh rate. It’s perhaps even more absurd that the default configuration of the powerspec g405, which had the fastest single-threaded performance you could get until October 2017, had more latency from keyboard-to-screen (approximately 3 feet, maybe 10 feet of actual cabling) than sending a packet around the world (16187 mi from NYC to Tokyo to London back to NYC, more due to the cost of running the shortest possible length of fiber).

On the bright side, we’re arguably emerging from the latency dark ages and it’s now possible to assemble a computer or buy a tablet with latency that’s in the same range as you could get off-the-shelf in the 70s and 80s. This reminds me a bit of the screen resolution & density dark ages, where CRTs from the 90s offered better resolution and higher pixel density than affordable non-laptop LCDs until relatively recently. 4k displays have now become normal and affordable 8k displays are on the horizon, blowing past anything we saw on consumer CRTs. I don’t know that we’ll see the same kind improvement with respect to latency, but one can hope. There are individual developers improving the experience for people who use certain, very carefully coded, applications, but it's not clear what force could cause a significant improvement in the default experience most users see.

Appendix: why measure latency?

Latency matters! For very simple tasks, people can perceive latencies down to 2 ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

The most commonly cited document on response time is the nielsen group article on response times, which claims that latncies below 100ms feel equivalent and perceived as instantaneous. One easy way to see that this is false is to go into your terminal and try sleep 0; echo "pong" vs. sleep 0.1; echo "test" (or for that matter, try playing an old game that doesn't have latency compensation, like quake 1, with 100 ms ping, or even 30 ms ping, or try typing in a terminal with 30 ms ping). For more info on this and other latency fallacies, see this document on common misconceptions about latency.

Throughput also matters, but this is widely understood and measured. If you go to pretty much any mainstream review or benchmarking site, you can find a wide variety of throughput measurements, so there’s less value in writing up additional throughput measurements.

Appendix: apple 2 keyboard

The apple 2e, instead of using a programmed microcontroller to read the keyboard, uses a much simpler custom chip designed for reading keyboard input, the AY 3600. If we look at the AY 3600 datasheet,we can see that the scan time is (90 * 1/f) and the debounce time is listed as strobe_delay. These quantities are determined by some capacitors and a resistor, which appear to be 47pf, 100k ohms, and 0.022uf for the Apple 2e. Plugging these numbers into the AY3600 datasheet, we can see that f = 50 kHz, giving us a 1.8 ms scan delay and a 6.8 ms debounce delay (assuming the values are accurate -- capacitors can degrade over time, so we should expect the real delays to be shorter on our old Apple 2e), giving us less than 8.6 ms for the internal keyboard logic.

Comparing to a keyboard with a 167 Hz scan rate that scans two extra times to debounce, the equivalent figure is 3 * 6 ms = 18 ms. With a 100Hz scan rate, that becomes 3 * 10 ms = 30 ms. 18 ms to 30 ms of keyboard scan plus debounce latency is in line with what we saw when we did some preliminary keyboard latency measurements.

For reference, the ergodox uses a 16 MHz microcontroller with ~80k transistors and the apple 2e CPU is a 1 MHz chip with 3.5k transistors.

Appendix: why should android phones have higher latency than old apple phones?

As we've seen, raw processing power doesn't help much with many of the causes of latency in the pipeline, like handoffs between different processes, so phones that an android phone with a 10x more powerful processor than an ancient iphone isn't guaranteed to be quicker to respond, even if it can render javascript heavy pages faster.

If you talk to people who work on non-Apple mobile CPUs, you'll find that they run benchmarks like dhrystone (a synthetic benchmark that was irrelevant even when it was created, in 1984) and SPEC2006 (an updated version of a workstation benchmark that was relevant in the 90s and perhaps even as late as the early 2000s if you care about workstation workloads, which are completely different from mobile workloads). This problem where the vendor who makes the component has an intermediate target that's only weakly correlated to the actual user experience. I've heard that there are people working on the pixel phones who care about end-to-end latency, but it's difficult to get good latency when you have to use components that are optimized for things like dhrystone and SPEC2006.

If you talk to people at Apple, you'll find that they're quite cagey, but that they've been targeting the end-to-end user experience for quite a long time and they they can do "full stack" optimizations that are difficult for android vendors to pull of. They're not literally impossible, but making a change to a chip that has to be threaded up through the OS is something you're very unlikely to see unless google is doing the optimization, and google hasn't really been serious about the end-to-end experience until recently.

Having relatively poor performance in aspects that aren't measured is a common theme and one we saw when we looked at terminal latency. Prior to examining temrinal latency, public benchmarks were all throughput oriented and the terminals that priortized performance worked on increasing throughput, even though increasing terminal throughput isn't really useful. After those terminal latency benchmarks, some terminal authors looked into their latency and found places they could trim down buffering and remove latency. You get what you measure.

Appendix: experimental setup

Most measurements were taken with the 240fps camera (4.167 ms resolution) in the iPhone SE. Devices with response times below 40 ms were re-measured with a 1000fps camera (1 ms resolution), the Sony RX100 V in PAL mode. Results in the tables are the results of multiple runs and are rounded to the nearest 10 ms to avoid the impression of false precision. For desktop results, results are measured from when the key started moving until the screen finished updating. Note that this is different from most key-to-screen-update measurements you can find online, which typically use a setup that effectively removes much or all of the keyboard latency, which, as an end-to-end measurement, is only realistic if you have a psychic link to your computer (this isn't to say the measurements aren't useful -- if, as a programmer, you want a reproducible benchmark, it's nice to reduce measurement noise from sources that are beyond your control, but that's not relevant to end users). People often advocate measuring from one of: {the key bottoming out, the tactile feel of the switch}. Other than for measurement convenience, there appears to be no reason to do any of these, but people often claim that's when the user expects the keyboard to "really" work. But these are independent of when the switch actually fires. Both the distance between the key bottoming out and activiation as well as the distance between feeling feedback and activation are arbitrary and can be tuned. See this post on keyboard latency measurements for more info on keyboard fallacies.

Another significant difference is that measurements were done with settings as close to the default OS settings as possible since approximately 0% of users will futz around with display settings to reduce buffering, disable the compositor, etc. Waiting until the screen has finished updating is also different from most end-to-end measurements do -- most consider the update "done" when any movement has been detected on the screen. Waiting until the screen is finished changing is analogous to webpagetest's "visually complete" time.

Computer results were taken using the “default” terminal for the system (e.g., powershell on windows, lxterminal on lubuntu), which could easily cause 20 ms to 30 ms difference between a fast terminal and a slow terminal. Between measuring time in a terminal and measuring the full end-to-end time, measurements in this article should be slower than measurements in other, similar, articles (which tend to measure time to first change in games).

The powerspec g405 baseline result is using integrated graphics (the machine doesn’t come with a graphics card) and the 60 Hz result is with a cheap video card. The baseline was result was at 30 Hz because the integrated graphics only supports hdmi output and the display it was attached to only runs at 30 Hz over hdmi.

Mobile results were done by using the default browser, browsing to https://danluu.com, and measuring the latency from finger movement until the screen first updates to indicate that scrolling has occurred. In the cases where this didn’t make sense, (kindles, gameboy color, etc.), some action that makes sense for the platform was taken (changing pages on the kindle, pressing the joypad on the gameboy color in a game, etc.). Unlike with the desktop/laptop measurements, this end-time for the measurement was on the first visual change to avoid including many frames of scrolling. To make the measurement easy, the measurement was taken with a finger on the touchscreen and the timer was started when the finger started moving (to avoid having to determine when the finger first contacted the screen).

In the case of “ties”, results are ordered by the unrounded latency as a tiebreaker, but this shouldn’t be considered significant. Differences of 10 ms should probably also not be considered significant.

The custom haswell-e was tested with gsync on and there was no observable difference. The year for that box is somewhat arbitrary, since the CPU is from 2014, but the display is newer (I believe you couldn’t get a 165 Hz display until 2015.

The number of transistors for some modern machines is a rough estimate because exact numbers aren’t public. Feel free to ping me if you have a better estimate!

The color scales for latency and year are linear and the color scales for clock speed and number of transistors are log scale.

All Linux results were done with a pre-KPTI kernel. It's possible that KPTI will impact user perceivable latency.

Measurements were done as cleanly as possible (without other things running on the machine/device when possible, with a device that was nearly full on battery for devices with batteries). Latencies when other software is running on the device or when devices are low on battery might be much higher.

If you want a reference to compare the kindle against, a moderately quick page turn in a physical book appears to be about 200 ms.

This is a work in progress. I expect to get benchmarks from a lot more old computers the next time I visit Seattle. If you know of old computers I can test in the NYC area (that have their original displays or something like them), let me know! If you have a device you’d like to donate for testing, feel free to mail it to

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

Thanks to RC, David Albert, Bert Muthalaly, Christian Ternus, Kate Murphy, Ikhwan Lee, Peter Bhat Harkins, Leah Hanson, Alicia Thilani Singham Goodwin, Amy Huang, Dan Bentley, Jacquin Mininger, Rob, Susan Steinman, Raph Levien, Max McCrea, Peter Town, Jon Cinque, Anonymous, and Jonathan Dahan for donating devices to test and thanks to Leah Hanson, Andy Matuschak, Milosz Danczak, amos (@fasterthanlime), @emitter_coupled, Josh Jordan, mrob, and David Albert for comments/corrections/discussion.

Sun, 24 Dec 2017 00:00:00 +0000


How good are decisions?

A statement I commonly hear in tech-utopian circles is that some seeming inefficiency can’t actually be inefficient because the market is efficient and inefficiencies will quickly be eliminated. A contentious example of this is the claim that companies can’t be discriminating because the market is too competitive to tolerate discrimination. A less contentious example is that when you see a big company doing something that seems bizarrely inefficient, maybe it’s not inefficient and you just lack the information necessary to understand why the decision was efficient.

Unfortunately, arguments like this are difficult to settle because, even in retrospect, it’s usually not possible to get enough information to determine the precise “value” of a decision. Even in cases where the decision led to an unambiguous success or failure, there are so many factors that led to the result that it’s difficult to figure out precisely why something happened.

One nice thing about sports is that they often have detailed play-by-play data and well-defined win criteria which lets us tell, on average, what the expected value of a decision is. In this post, we’ll look at the cost of bad decision making in one sport and then briefly discuss why decision quality in sports might be the same or better as decision quality in other fields.

Just to have a concrete example, we’re going to look at baseball, but you could do the same kind of analysis for football, hockey, basketball, etc., and my understanding is that you’d get a roughly similar result in all of those cases.

We’re going to model baseball as a state machine, both because that makes it easy to understand the expected value of particular decisions and because this lets us talk about the value of decisions without having to go over most of the rules of baseball.

We can treat each baseball game as an independent event. In each game, two teams play against each other and the team that scores more runs (points) wins. Each game is split into 9 “innings” and in each inning each team will get one set of chances on offense. In each inning, each team will play until it gets 3 “outs”. Any given play may or may not result in an out.

One chunk of state in our state machine is the number of outs and the inning. The other chunks of state we’re going to track are who’s “on base” and which player is “at bat”. Each teams defines some order of batters for their active players and after each player bats once this repeats in a loop until the team collects 3 outs and the inning is over. The state of who is at bat is saved between innings. Just for example, you might see batters 1-5 bat in the first inning, 6-9 and then 1 again in the second inning, 2- … etc.

When a player is at bat, the player may advance to a base and players who are on base may also advance, depending on what happens. When a player advances 4 bases (that is, through 1B, 2B, 3B, to what would be 4B except that it isn’t called that) a run is scored and the player is removed from the base. As mentioned above, various events may cause a player to be out, in which case they also stop being on base.

An example state from our state machine is:

{1B, 3B; 2 outs}

This says that there’s a player on 1B, a player on 3B, there are two outs. Note that this is independent of the score, who’s actually playing, and the inning.

Another state is:

{--; 0 outs}

With a model like this, if we want to determine the expected value of the above state, we just need to look up the total number of runs across all innings played in a season divided by the number of innings to find the expected number of runs from the state above (ignoring the 9th inning because a quirk of baseball rules distorts statistics from the 9th inning). If we do this, we find that, from the above state, a team will score .555 runs in expectation.

We can then compute the expected number of runs for all of the other states:

012
basesouts
--.555.297.117
1B.953.573.251
2B1.189.725.344
3B1.482.983.387
1B,2B1.573.971.466
1B,3B1.9041.243.538
2B,3B2.0521.467.634
1B,2B,3B2.4171.650.815

In this table, each entry is the expected number of runs from the remainder of the inning from some particular state. Each column shows the number of outs and each row shows the state of the bases. The color coding scheme is: the starting state (.555 runs) has a white background. States with higher run expectation are more blue and states with lower run expectation are more red.

This table and the other stats in this post come from The Book by Tango et al., which mostly discussed baseball between 1999 and 2002. See the appendix if you're curious about how things change if we use a more detailed model.

The state we’re tracking for an inning here is who’s on base and the number of outs. Innings start with nobody on base and no outs.

As above, we see that we start the inning with .555 runs in expectation. If a play puts someone on 1B without getting an out, we now have .953 runs in expectation, i.e., putting someone on first without an out is worth .953 - .555 = .398 runs.

This immediately gives us the value of some decisions, e.g., trying to “steal” 2B with no outs and someone on first. If we look at cases where the batter’s state doesn’t change, a successful steal moves us to the {2B, 0 outs} state, i.e., it gives us 1.189 - .953 = .236 runs. A failed steal moves us to the {--, 1 out} state, i.e., it gives us .953 - .297 = -.656 runs. To break even, we need to succeed .656 / .236 = 2.78x more often than we fail, i.e., we need a .735 success rate to break even. If we want to compute the average value of a stolen base, we can compute the weighted sum over all states, but for now, let’s just say that it’s possible to do so and that you need something like a .735 success rate for stolen bases to make sense.

We can then look at the stolen base success rate of teams to see that, in any given season, maybe 5-10 teams are doing better than breakeven, leaving 20-25 teams at breakeven or below (mostly below). If we look at a bad but not historically bad stolen-base team of that era, they might have a .6 success rate. It wouldn’t be unusual for a team from that era to make between 100 and 200 attempts. Just so we can compute an approximation, if we assume they were all attempts from the {1B, 0 outs} state, the average run value per attempt would be .4 * (-.656) + .6 * .236 = -0.12 runs per attempt. Another first-order approximation is that a delta of 10 runs is worth 1 win, so at 100 attempts we have -1.2 wins and at 200 attempts we have -2.4 wins.

If we run the math across actual states instead of using the first order approximation, we see that the average stolen base is worth -.467 runs and the average successful steal is worth .175 runs. In that case, a steal attempt with a .6 success rate is worth .4 * (-.467) + .6 * .175 = -0.082 runs. With this new approximation, our estimate for the approximate cost in wins of stealing “as normal” vs. having a “no stealing” rule for a team that steals badly and often is .82 to 1.64 wins per season. Note that this underestimates the cost of stealing since getting into position to steal increases the odds of a successful “pickoff”, which we haven’t accounted for. From our state-machine standpoint, a pickoff is almost equivalent to a failed steal, but the analysis necessary to compute the difference in pickoff probability is beyond the scope of this post.

We can also do this for other plays coaches can cause (or prevent). For the “intentional walk”, we see that an intentional walk appears to be worth .102 runs for the opposing team. In 2002, a team that issued “a lot” of intentional walks might have issued 50, resulting in 50 * .102 runs for the opposing team, giving a loss of roughly 5 runs or .5 wins.

If we optimistically assume a “sac bunt” never fails, the cost of a sac bunt is .027 runs per attempt. If we look at the league where pitchers don’t bat, a team that was heavy on sac bunts might’ve done 49 sac bunts (we do this to avoid “pitcher” bunts, which add complexity to the approximation), costing a total of 49 * .027 = 1.32 runs or .132 wins.

Another decision that’s made by a coach is setting the batting order. Players bat (take a turn) in order, 1-9, mod 9. That is, when the 10th “player” is up, we actually go back around and the 1st player bats. At some point the game ends, so not everyone on the team ends up with the same number of “at bats”.

There’s a just-so story that justifies putting the fastest player first, someone with a high “batting average” second, someone pretty good third, your best batter fourth, etc. This story, or something like it, has been standard for over 100 years.

I’m not going to walk through the math for computing a better batting order because I don’t think there’s a short, easy to describe, approximation. It turns out that if we compute the difference between an “optimal” order and a “typical” order justified by the story in the previous paragraph, using an optimal order appears to be worth between 1 and 2 wins per season.

These approximations all leave out important information. In three out of the four cases, we assumed an average player at all times and didn’t look at who was at bat. The information above actually takes this into account to some extent, but not fully. How exactly this differs from a better approximation is a long story and probably too much detail for a post that’s using baseball to talk about decisions outside of baseball, so let’s just say that we have pretty decent but not amazing approximation that says that a coach who makes bad decisions following conventional wisdom that are in the normal range of bad decisions during a baseball season might be able cost their team something like 1 + 1.2 + .5 + .132 = 2.83 wins on these three decisions alone vs. a decision rule that says “never do these actions that, on average, have negative value”. If we compare to a better decision rule such as “do these actions when they have positive value and not when they have negative value” or a manager that generally makes good decisions, let’s conservatively estimate that’s maybe worth 3 wins.

We’ve looked at four decisions (sac bunt, steal, intentional walk, and batting order). But there are a lot of other decisions! Let’s arbitrarily say that if we look at all decisions and not just these four decisions, having a better heuristic for all decisions might be worth 4 or 5 wins per season.

What does 4 or 5 wins per season really mean? One way to look at it is that baseball teams play 162 games, so an “average” team wins 81 games. If we look at the seasons covered, the number of wins that teams that made the playoffs had was {103, 94, 103, 99, 101, 97, 98, 95, 95, 91, 116, 102, 88, 93, 93, 92, 95, 97, 95, 94, 87, 91, 91, 95, 103, 100, 97, 97, 98, 95, 97, 94}. Because of the structure of the system, we can’t name a single number for a season and say that N wins are necessary to make the playoffs and that teams with fewer than N wins won’t make the playoffs, but we can say that 95 wins gives a team decent odds of making the playoffs. 95 - 81 = 14. 5 wins is more than a third of the difference between an average team and a team that makes the playoffs. This a huge deal both in terms of prestige and also direct economic value.

If we want to look at it at the margin instead of on average, the smallest delta in wins between teams that made the playoffs and teams that didn’t in each league was {1, 7, 8, 1, 6, 2, 6, 3}. For teams that are on the edge, a delta of 5 wins wouldn’t always be the difference between a successful season (making playoffs) and an unsuccessful season (not making playoffs), but there are teams within a 5 win delta of making the playoffs in most seasons. If we were actually running a baseball team, we’d want to use a much more fine-grained model, but as a first approximation we can say that in-game decisions are a significant factor in team performance and that, using some kind of computation, we can determine the expected cost of non-optimal decisions.

Another way to look at what 5 wins is worth is to look at what it costs to get a player who’s not a pitcher that’s 5 wins above average (WAA) (we look at non-pitchers because non-pitchers tend to play in every game and pitchers tend to play in parts of some games, making a comparison between pitchers and non-pitchers more complicated). Of the 8 non-pitcher positions (we look at non-pitcher positions because it makes comparisons simpler), there are 30 teams, so we have 240 team-positions pairs. In 2002, of these 240 team-position pairs, there are two that were >= 5 WAA, Texas-SS (Alex Rodriguez, paid $22m) and SF-LF (Barry Bonds, paid $15m). If we look at the other seasons in the range of dates we’re looking at, there are either 2 or 3 team-position pairs where a team is able to get >= 5 WAA in a season These aren’t stable across seasons because player performance is volatile, so it’s not as easy as finding someone great and paying them $15m. For example, in 2002, there were 7 non-pitchers paid $14m or more and only two of them we worth 5 WAA or more. For reference, the average total team payroll (teams have 26 players per) in 2002 was $67m, with a minimum of $34m and a max of $126m. At the time a $1m salary for a manager would’ve been considered generous, making a 5 WAA manager an incredible deal.

5 WAA assumes typical decision making lining up with events in a bad, but not worst-case way. A more typical case might be that a manager costs a team 3 wins. In that case, in 2002, there were 25 team-position pairs out of 240 where a single player could make up for the loss caused from management by conventional wisdom. Players who provide that much value and who aren’t locked up in artificially cheap deals with particular teams due to the mechanics of player transfers are still much more expensive than managers.

If we look at how teams have adopted data analysis in order to improve both in-game decision making and team-composition decisions, it’s been a slow, multi-decade, process. Moneyball describes part of the shift from using intuition and observation to select players to incorporating statistics into the process. Stats nerds were talking about how you could do this at least since 1971 and no team really took it seriously until the 90s and the ideas didn’t really become mainstream until the mid 2000s, after a bestseller had been published.

If we examine how much teams have improved at the in-game decisions we looked at here, the process has been even slower. It’s still true today that statistics-driven decisions aren’t mainstream. Things are getting better, and if we look at the aggregate cost of the non-optimal decisions mentioned here, the aggregate cost has been getting lower over the past couple decades as intuition-driven decisions slowly converge to more closely match what stats nerds have been saying for decades. For example, if we look at the total number of sac bunts recorded across all teams from 1999 until now, we see:

1999200020012002200320042005200620072008200920102011201220132014201520162017
160416281607163316261731162016511540152616351544166714791383134312001025925

Despite decades of statistical evidence that sac bunts are overused, we didn’t really see a decline across all teams until 2012 or so. Why this is varies on a team-by-team and case-by-case basis, but the fundamental story that’s been repeated over and over again both for statistically-driven team composition and statistically driven in-game decisions is that the people who have the power to make decisions often stick to conventional wisdom instead of using “radical” statistically-driven ideas. There are a number of reasons as to why this happens. One high-level reason is that the change we’re talking about was a cultural change and cultural change is slow. It doesn’t surprise people when it takes a generation for scientific consensus to shift and why should be baseball be any different?

One specific lower-level reason “obviously” non-optimal decisions can persist for so long is that there’s a lot of noise in team results. You sometimes see a manager make some radical decisions (not necessarily statistics-driven), followed by some poor results, causing management to fire the manager. There’s so much volatility that you can’t really judge players or managers based on small samples, but this doesn’t stop people from doing so. The combination of volatility and skepticism of radical ideas heavily disincentivizes going against conventional wisdom.

Among the many consequences of this noise is the fact that the winner of the "world series" (the baseball championship) is heavily determined by randomness. Whether or not a team makes the playoffs is determined over 162 games, which isn't enough to remove all randomness, but is enough that the result isn't mostly determined by randomness. This isn't true of the playoffs, which are too short for the outcome to be primarily determined by the difference in the quality of teams. Once a team wins the world series, people come up with all kinds of just-so stories to justify why the team should've won, but if we look across all games, we can see that the stories are just stories. This is, perhaps, not so different to listening to people tell you why their startup was successful.

There are metrics we can use that are better predictors of future wins and losses (i.e., are less volatile than wins and losses), but, until recently, convincing people that those metrics were meaningful was also a radical idea.

If we think about the general case, what’s happening is that decisions have probabilistic payoffs. There’s very high variance in actual outcomes (wins and losses), so it’s possible to make good decisions and not see the direct effect of them for a long time. Even if there are metrics that give us a better idea of what the “true” value of a decision is, if you’re operating in an environment where your management doesn’t believe in those metrics, you’re going to have a hard time keeping your job (or getting a job in the first place) if you want to do something radical whose value is only demonstrated by some obscure-sounding metric unless they take a chance on you for a year or two. There have been some major phase changes in what metrics are accepted, but they’ve taken decades.

If we look at business or engineering decisions, the situation is much messier. If we look at product or infrastructure success as a “win”, there seems to be much more noise in whether or not a team gets a “win”. Moreover, unlike in baseball, the sort of play-by-play or even game data that would let someone analyze “wins” and “losses” to determine the underlying cause isn’t recorded, so it’s impossible to determine the true value of decisions. And even if the data were available, there are so many more factors that determine whether or not something is a “win” that it’s not clear if we’d be able to determine the expected value of decisions even if we had the data.

We’ve seen that in a field where one can sit down and determine the expected value of decisions, it can take decades for this kind of analysis to influence some important decisions. If we look at fields where it’s more difficult to determine the true value of decisions, how long should we expect it to take for “good” decision making to surface? It seems like it would be a while, perhaps forever, unless there’s something about the structure of baseball and other sports that makes it particularly difficult to remove a poor decision maker and insert a better decision maker.

One might argue that baseball is different because there are a fixed number of teams and it’s quite unusual for a new team to enter the market, but if you look at things like public clouds, operating systems, search engines, car manufacturers, etc., the situation doesn’t look that different. If anything, it appears to be much cheaper to take over a baseball team and replace management (you sometimes see baseball teams sell for roughly a billion dollars) and there are more baseball teams than there are competitive products in the markets we just discussed, at least in the U.S. One might also argue that, if you look at the structure of baseball teams, it’s clear that positions are typically not handed out based on decision-making merit and that other factors tend to dominate, but this doesn’t seem obviously more true in baseball than in engineering fields.

This isn’t to say that we expect obviously bad decisions everywhere. You might get that idea if you hung out on baseball stats nerd forums before Moneyball was published (and for quite some time after), but if you looked at formula 1 (F1) around the same time, you’d see teams employing PhDs who are experts in economics and game theory to make sure they were making reasonable decisions. This doesn’t mean that F1 teams always make perfect decisions, but they at least avoided making decisions that interested amateurs could identify as inefficient for decades. There are some fields where competition is cutthroat and you have to do rigorous analysis to survive and there are some fields where competition is more sedate. In living memory, there was a time when training for sports was considered ungentlemanly and someone who trained with anything resembling modern training techniques would’ve had a huge advantage. Over the past decade or so, we’re seeing the same kind of shift but for statistical techniques in baseball instead of training in various sports.

If we want to look at the quality of decision making, it's too simplistic to say that we expect a firm to make good decisions because they're exposed to markets and there's economic value in making good decisions and people within the firm will probably be rewarded greatly if they make good decisions. You can't even tell if this is happening by asking people if they're making rigorous, data-driven, decisions. If you'd ask people in baseball they were using data in their decisions, they would've said yes throughout the 70s and 80s. Baseball has long been known as a sport where people track all kinds of numbers and then use those numbers. It's just that people didn't backtest their predictions, let alone backtest their predictions with holdouts.

The paradigm shift of using data effectively to drive decisions has been hitting different fields at different rates over the past few decades, both inside and outside of sports. Why this change happened in F1 before it happened in baseball is due to a combination of the difference in incentive structure in F1 teams vs. baseball teams and the difference in institutional culture. We may take a look at this in a future post, but this turns out to be a fairly complicated issue that requires a lot more background. We’ll try to explore the necessary background in future posts.

Appendix: non-idealities in our baseball analysis

In order to make this a short blog post and not a book, there are a lot of simplifications the approximation we discussed. One major simplification is the idea that all runs are equivalent. This is close enough to true that this is a decent approximation. But there are situations where the approximation isn’t very good, such as when it’s the 9th inning and the game is tied. In that case, a decision that increases the probability of scoring 1 run but decreases the probability of scoring multiple runs is actually the right choice.

This is often given as a justification for a relatively late-game sac bunt. But if we look at the probability of a successful sac bunt, we see that it goes down in later innings. We didn’t talk about how the defense is set up, but defenses can set up in ways that reduce the probability of a successful sac bunt but increase the probability of success of non-bunts and vice versa. Before the last inning, this actually makes sac bunt worse late in the game and not better! If we take all of that into account in the last inning of a tie game, the probability that a sac bunt is a good idea then depends on something else we haven’t discussed, the batter at the plate.

In our simplified model, we computed the expected value in runs across all batters. But at any given time, a particular player is batting. A successful sac bunt advances runners and increases the number of outs by one. The alternative is to let the batter “swing away”, which will result in some random outcome. The better the batter, the higher the probability of an outcome that’s better than the outcome of a sac bunt. To determine the optimal decision, we not only need to know how good the current batter is but how good the subsequent batters are. One common justification for the sac bunt is that pitchers are terrible hitters and they’re not bad at sac bunting because they have so much practice doing it (because they’re terrible hitters), but it turns out that pitchers are also below average sac bunters and that the argument that we should expect pitchers to sac because they’re bad hitters doesn’t hold up if we look at the data in detail.

Another reason to sac bunt (or bunt in general) is that the tendency to sometimes do this induces changes in defense which make non-bunt plays work better.

A full computation should also take into account the number of balls and strikes a current batter has, which is a piece of state we haven’t discussed at all as well as the speed of the batter and the players on base as well as the particular stadium the game is being played in and the opposing pitcher as well as the quality of their defense. All of this can be done, even on a laptop -- this is all “small data” as far as computers are concerned, but walking through the analysis even for one particular decision would be substantially longer than everything in this post combined including this disclaimer. It’s perhaps a little surprising that taking all of these non-idealities into account doesn’t overturn the general result, but it turns out that it doesn’t (it finds that there are many situations in which sac bunts have positive expected value, but that sac bunts were still heavily overused for decades).

There’s a similar situation for intentional walks, where the non-idealities in our analysis appear to support issuing intentional walks. In particular, the two main conventional justifications for an intentional walk are

  1. By walking the current batter, we can set up a “force” or a “double play” (increase the probability of getting one out or two outs in one play). If the game is tied in the last inning, putting another player on base has little downside and has the upside of increasing the probability of allowing zero runs and continuing the tie.
  2. By walking the current batter, we can get to the next, worse batter.

An example situation where people apply the justification in (1) is in the {1B, 3B; 2 out} state. The team that’s on defense will lose if the player at 3B advances one base. The reasoning goes, walking a player and changing the state to {1B, 2B, 3B; 2 out} won’t increase the probability that the player at 3B will score and end the game if the current batter “puts the ball into play”, and putting another player on base increases the probability that the defense will be able to get an out.

The hole in this reasoning is that the batter won’t necessarily put the ball into play. After the state is {1B, 2B, 3B; 2 out}, the pitcher may issue an unintentional walk, causing each runner to advance and losing the game. It turns out that being in this state doesn’t affect the the probability of an unintentional walk very much. The pitcher tries very hard to avoid a walk but, at the same time, the batter tries very hard to induce a walk!

On (2), the two situations where the justification tend to be applied are when the current player at bat is good or great, or the current player is batting just before the pitcher. Let’s look at these two separately.

Barry Bonds’s seasons from 2001, 2002, and 2004 were some of the statistically best seasons of all time and are as extreme a case as one can find in modern baseball. If we run our same analysis and account for the quality of the players batting after Bonds, we find that it’s sometimes the correct decision for the opposing team to intentionally walk Bonds, but it was still the case that most situations do not warrant an intentional walk and that Bonds was often intentionally walked in a situation that didn’t warrant an intentional walk. In the case of a batter who is not having one of the statistically best seasons on record in modern baseball, intentional walks are even less good.

In the case of the pitcher batting, doing the same kind of analysis as above also reveals that there are situations where an intentional walk are appropriate (not-late game, {1B, 2B; 2 out}, when the pitcher is not a significantly above average batter for a pitcher). Even though it’s not always the wrong decision to issue an intentional walk, the intentional walk is still grossly overused.

One might argue the fact that our simple analysis has all of these non-idealities that could have invalidated the analysis is a sign that decision making in baseball wasn’t so bad after all, but I don’t think that holds. An first-order approximation that someone could do in an hour or two finds that decision making seems quite bad, on average. If a team was interested in looking at data, that ought to lead them into doing a more detailed analysis that takes into account the conventional-wisdom based critiques of the obvious one-hour analysis. It appears that this wasn’t done, at least not for decades.

The problem is that before people started running the data, all we had to go by were stories. Someone would say "with 2 outs, you should walk the batter before the pitcher to get to the pitcher [in some situations] to get to the pitcher and get the guaranteed out". Someone else might respond "we obviously shouldn't do that late game because the pitcher will get subbed out for a pinch hitter and early game, we shouldn't do it because even if it works and we get the easy out, it sets the other team up to lead off the next inning with their #1 hitter instead of an easy out". Which of these stories is the right story turns out to be an empirical question. The thing that I find most unfortunate is that, after started people running the numbers and the argument became one of stories vs. data, people persisted in sticking with the story-based argument for decades. We see the same thing in business and engineering, but it's arguably more excusable there because decisions in those areas tend to be harder to quantify. Even if you can reduce something to a simple engineering equation, someone can always argue that the engineering decision isn't what really matters and this other business concern that's hard to quantify is the most important thing.

Appendix: possession

Something I find interesting is that statistical analysis in football, baseball, and basketball has found that teams have overwhelmingly undervalued possessions for decades. Baseball doesn't have the concept of possession per se, but if you look at being on offense as "having posession" and getting 3 outs as "losing posession", it's quite similar.

In football, we see that maintaining posession is such a big deal that it is usually an error to punt on 4th down, but this hasn't stopped teams from punting by default basically forever. And in basketball, players who shoot a lot with a low shooting percentage were (and arguably still are) overrated.

I don't think this is fundamental -- that possessions are as valuable as they are comes out of the rules of each game. It's arbitrary. I still find it interesting, though.

Appendix: other analysis of management decisions

Bloom et al., Does management matter? Evidence from India looks at the impact of management interventions and the effect on productivity.

Other work by Bloom.

DellaVigna et al., Uniform pricing in US retail chains allegedly finds a significant amount of money left on the table by retail chains (seven percent of profits) and explores why that might happen and what the impacts are.

The upside of work like this vs. sports work is that it attempts to quanity the impact of things outside of a contrived game. The downside is that the studies are on things that are quite messy and it's hard to tell what the study actually means. Just for example, if you look at studies on innovation, economists often use patents as a proxy for innovation and then come to some conclusion based on some variable vs. number of patents. But if you're familiar with engineering patents, you'll know that number of patents is an incredibly poor proxy for innovation. In the hardware world, IBM is known for cranking out a very large number of useless patents (both in the sense of useless for innovation and also in the narrow sense of being useless as a counter-attack in patent lawsuits) and there are some companies that get much more mileage out of filing many fewer patents.

AFAICT, our options here are to know a lot about decisions in a context that's arguably completely irrelevant, or to have ambiguous information and probably know very little about a context that seems relevant to the real world. I'd love to hear about more studies in either camp (or even better, studies that don't have either problem).

Thanks to Leah Hanson, David Turner, Milosz Dan, Andrew Nichols, Justin Blank, @hoverbikes, Kate Murphy, Ben Kuhn, Patrick Collison, and an anonymous commenter for comments/corrections/discussion.

Tue, 21 Nov 2017 00:00:00 +0000


How out of date are Android devices?

It's common knowledge that Android device tend to be more out of date than iOS devices, but what does this actually mean? Let’s look at android marketshare data to see how old devices in the wild are. The x axis of the plot below is date, and the y axis is Android marketshare. The share of all devices sums to 100% (with some artifacts because the public data Google provides is low precision).

Color indicates age:

If we look at the graph, we see a number of reverse-S shaped contours; between each pair of contours, devices get older as we go from left to right. Each contour corresponds to the release of a new android version and the associated devices running that android version. As time passes, devices on that version get older. When a device is upgraded, they’re effectively removed from one contour into a new contour and the color changes to a less outdated color.

Markshare of outdated android devices is increasing

There are three major ways in which this graph understates the number of outdated devices:

First, we’re using API version data for this and don’t have access to the marketshare of point releases and minor updates, so we assume that all devices on the same API version are up to date until the moment a new API version is released, but many (and perhaps most) devices won’t receive updates within an API version.

Second, this graph shows marketshare, but the number of Android devices has dramatically increased over time. For example, if we look at the 80%-ile most outdated devices (i.e., draw a line 20% up from the bottom), it the 80%-ile device today is a few months more outdated than it was in 2014. The huge growth of Android means that there are many many more outdated devices now than there were in 2014.

Third, this data comes from scraping Google Play Store marketshare info. That data shows marketshare of devices that have visited in the Play Store in the last 7 days. In general, it seems reasonable to believe that devices that visit the play store are more up to date than devices that don’t, so we should expect an unknown amount of bias in this data that causes the graph to show that devices are newer than they actually are. This seems plausible both for devices that are used as conventional mobile devices as well as for mobile devices that have replaced things liked traditonally embedded devices, PoS boxes, etc.

If we're looking at this from a security standpoint, some devices will receive updates without updating their major version, skewing the date to look more outdated than it used it. However, when researchers have used more fine-grained data to see which devices are taking updates, they found that this was not a large effect.

One thing we can see from that graph is that, as time goes on, the world accumulates a larger fraction of old devices over time. This makes sense and we could have figured this out without looking at the data. After all, back at the beginning of 2010, Android phones couldn’t be much more than a year old, and now it’s possible to have Android devices that are nearly a decade old.

Something that wouldn’t have been obvious without looking at the data is that the uptake of new versions seems to be slowing down -- we can see this by looking at the last few contour lines at the top right of the graph, corresponding to the most recent Android releases. These lines have a shallower slope than the contour lines for previous releases. Unfortunately, with this data alone, we can’t tell why the slope is shallower. Some possible reasons might be:

Without more data, it’s impossible to tell how much each of these is contributing to the problem. BTW, let me know if you know of a reasonable source for the active number of Android devices going back to 2010! I’d love to produce a companion graph of the total number of outdated devices.

But even with the data we have, we can take a guess at how many outdated devices are in use. In May 2017, Google announced that there are over two billion active Android devices. If we look at the latest stats (the far right edge), we can see that nearly half of these devices are two years out of date. At this point, we should expect that there are more than one billion devices that are two years out of date! Given Android's update model, we should expect approximately 0% of those devices to ever get updated to a modern version of Android.

Percentiles

Since there’s a lot going on in the graph, we might be able to see something if we look at some subparts of the graph. If we look at a single horizontal line across the graph, that corresponds to the device age at a certain percentile:

Over time, the Nth percentile out of date device is getting more out of date

In this graph, the date is on the x axis and the age in months is on the y axis. Each line corresponds to a different percentile (higher percentile is older), which corresponds to a horizontal slice of the top graph at that percentile.

Each individual line seems to have two large phases (with some other stuff, too). There’s one phase where devices for that percentile get older as quickly as time is passing, followed by a phase where, on average, devices only get slightly older. In the second phase, devices sometimes get younger as new releases push younger versions into a certain percentile, but this doesn’t happen often enough to counteract the general aging of devices. Taken as a whole, this graph indicates that, if current trends continue, we should expect to see proportionally more old Android devices as time goes on, which is exactly what we’d expect from the first, busier, graph.

Dates

Another way to look at the graph is to look at a vertical slice instead of a horizontal slice. In that case, each slice corresponds to looking at the ages of devices at one particular date:

In this plot, the x axis indicates the age percentile and the y axis indicates the raw age in months. Each line is one particular date, with older dates being lighter / yellower and newer dates being darker / greener.

As with the other views of the same data, we can see that Android devices appear to be getting more out of date as time goes on. This graph would be too busy to read if we plotted data for all of the dates that are available, but we can see it as an animation:

iOS

For reference, iOS 11 was released two months ago and it now has just under 50% iOS marketshare despite November’s numbers coming before the release of the iPhone X (this is compared to < 1% marketshare for the latest Android version, which was released in August). It’s overwhelmingly likely that, by the start of next year, iOS 11 will have more than 50% marketshare and there’s an outside chance that it will have 75% marketshare, i.e., it’s likely that the corresponding plot for iOS would have the 50%-ile (red) line in the second plot at age = 0 and it’s not implausible that the 75%-ile (orange) line would sometimes dip down to 0. As is the case with Android, there are some older devices that stubbornly refuse to update; iOS 9.3, released a bit over two years ago, sits at just a bit above 5% marketshare. This means that, in the iOS version of the plot, it’s plausible that we’d see the corresponding 99%-ile (green) line in the second plot at a bit over two years (half of what we see for the Android plot).

Windows XP

People sometimes compare Android to Windows XP because there are a large number of both in the wild and in both cases, most devices will not get security updates. However, this is tremendously unfair to Windows XP, which was released on 10/2001 and got security updates until 4/2014, twelve and a half years later. Additionally, Microsoft has released at least one security update after the official support period (there was an update in 5/2017 in response to the WannaCry ransomware). It's unfortunate that Microsoft decided to end support for XP while there are still so many XP boxes in the wild, but supporting an old OS for over twelve years and then issuing an emergency security patch after more fifteen years puts Microsoft into a completely different league than Google and Apple when it comes to device support.

Another difference between Android and Windows is that Android's scale is unprecedented in the desktop world. The were roughly 200 million PCs sold in 2017. Samsung alone has been selling that many mobile devices per year since 2008. Of course, those weren't Android devices in 2008, but Android's dominance in the non-iOS mobile space means that, overall, those have mostly been Android devices. Today, we still see nearly 50 year old PDP-11 devices in use. There are few enough PDPs around that running into one is a cute, quaint, surprise (0.6 million PDP-11s were sold). Desktops boxes age out of service more quickly than PDPs and mobile devices age out of service even more quickly, but the sheer difference in number of devices caused by the ubiquity of modern computing devices means that we're going to see many more XP-era PCs in use 50 years after the release of XP and it's plausible we'll see even more mobile devices around 50 years from now. Many of these ancient PDP, VAX, DOS, etc. boxes are basically safe because they're run in non-networked configurations, but it looks like the same thing is not going to be true for many of these old XP and Android boxes that are going to stay in service for decades.

Conclusion

We’ve seen that Android devices appear to be getting more out of date over time. This makes it difficult for developers to target “new” Android API features, where new means anything introduced in the past few years. It also means that there are a lot of Android devices out there that are behind in terms of security. This is true both in absolute terms and also relative to iOS.

Until recently, Android was directly tied to the hardware it ran on, making it very painful to keep old devices up to date because that requiring a custom Android build with phone-specific (or at least SoC-specific work). Google claims that this problem is fixed in the latest Android version (8.0, Oreo). People who remember Google's "Android update alliance" annoucement in 2011 may be a bit skeptical of the more recent annoucement. In 2011, Google and U.S. carries announced that they'd keep devices up to date for 18 months, which mostly didn't happen. However, even if the current annoucement isn't smoke and mirrors and the latest version of Android solves the update probem, we've seen that it takes years for Android releases to get adopted and we've also seen that the last few Android releases have significantly slower uptake than previous releases. Additionally, even though this is supposed to make updates easier, it looks like Android is still likely to stay behind iOS in terms of updates for a while. Google has promised that its latest phone (Pixel 2, 10/2017) will get updates for three years. That seems like a step in the right direction, but as we’ve seen from the graphs above, extending support by a year isn’t nearly enough to keep most Android devices up to date. But if you have an iPhone, the latest version of iOS (released 9/2017) works on devices back to the iPhone 5S (released 9/2013).

If we look at the newest Android release (8.0, 8/2017), it looks like you’re quite lucky if you have a two year old device that will get the latest update. The oldest “Google” phone supported is the Nexus 6P (9/2015), giving it just under two years of support.

If you look back at devices that were released around when the iPhone5S, the situation looks even worse. Back then, I got a free Moto X for working at Google; the Moto X was about as close to an official Google phone as you could get at the time (this was back when Google owned Moto). The Moto X was released on 8/2013 (a month before the iPhone 5S) and the latest version of Android it supports is 5.1, which was released on 2/2015, a little more than a year and a half later. For an Android phone of its era, the Moto X was supported for an unusually long time. It's a good sign that things look worse as look further back in time, but at the rate things are improving, it will be years before there's a decently supported Android device released and then years beyond those years before that Android version is in widespread use. It's possible that Fuchsia will fix this, but Fuchsia is also many years away from widespread use.

In a future post, we'll look at Android response latency is also quite interesting. It’s much more variable between phones than iOS response latency is between different models of iPhone.

The main thing I’m missing from my analysis of phone latency is older phones. If you have an old phone I haven’t tested and want to donate it for testing, you can mail it to:

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

Thanks to Leah Hanson, Kate Murphy, Daniel Thomas, Marek Majkowski, @zofrex, @Aissn, Chris Palmer, JonLuca De Caro, and an anonymous person for comments/corrections/related discussion.

Also, thanks to Victorien Villard for making the data these graphs were based on available!

Sun, 12 Nov 2017 00:00:00 +0000


UI backwards compatibility

About once a month, an app that I regularly use will change its UI in a way that breaks muscle memory, basically tricking the user into doing things they don’t want.

Zulip

In recent memory, Zulip (a slack competitor) changed its newline behavior so that ctrl + enter sends a message instead of inserting a new line. After this change, I sent a number of half-baked messages and it seemed like some other people did too.

Around the time they made that change, they made another change such that a series of clicks that would cause you to send a private message to someone would instead cause you to send a private message to the alphabetically first person who was online. Most people didn’t notice that this was a change, but when I mentioned that this had happened to me a few times in the past couple weeks, multiple people immediately said that the exact same thing happened to them. Some people also mentioned that the behavior of navigation shortcut keys was changed in a way that could cause people to broadcast a message instead of sending a private message. In both cases, some people blamed themselves and didn’t know why they’d just started making mistakes that caused them to send messages to the wrong place.

Doors

A while back, I was at Black Seed Bagel, which has a door that looks 75% like a “push” door from both sides when it’s actually a push door from the outside and a pull door from the inside. An additional clue that makes it seem even more like a "push" door from the inside is that most businesses have outward opening doors (this is required for exit doors in the U.S. when the room occupancy is above 50 and many businesses in smaller spaces voluntarily follow the same convention). During the course of an hour long conversation, I saw a lot of people go in and out and my guess is that ten people failed on their first attempt to use the door while exiting. When people were travelling in pairs or groups, the person in front would often say something like “I’m dumb. We just used this door a minute ago”. But the people were not, in fact, acting dumb. If anything is dumb, it’s designing doors such that are users have to memorize which doors act like “normal” doors and which doors have their cues reversed.

If you’re interested in the physical world, The Design of Everyday Things, gives many real-world examples where users are subtly nudged into doing the wrong thing. It also discusses general principles in a way that allows you to see the general idea and apply and avoid the same issues when designing software.

Facebook

Last week, FB changed its interface so that my normal sequence of clicks to hide a story saves the story instead of hiding it. Saving is pretty much the opposite of hiding! It’s the opposite both from the perspective of the user and also as a ranking signal to the feed ranker. The really “great” thing about a change like this is that it A/B tests incredibly well if you measure new feature “engagement” by number of clicks because many users will accidentally save a story when they meant to hide it. Earlier this year, twitter did something similar by swapping the location of “moments” and “notifications”.

Even if the people making the change didn’t create the tricky interface in order to juice their engagement numbers, this kind of change is still problematic because it poisons analytics data. While it’s technically possible to build a model to separate out accidental clicks vs. purposeful clicks, that’s quite rare (I don’t know of any A/B tests where people have done that) and even in cases where it’s clear that users are going to accidentally trigger an action, I still see devs and PMs justify a feature because of how great it looks on naive statistics like DAU/MAU.

API backwards compatibility

When it comes to software APIs, there’s a school of thought that says that you should never break backwards compatibility for some classes of widely used software. A well-known example is Linus Torvalds:

People should basically always feel like they can update their kernel and simply not have to worry about it.

I refuse to introduce "you can only update the kernel if you also update that other program" kind of limitations. If the kernel used to work for you, the rule is that it continues to work for you. … I have seen, and can point to, lots of projects that go "We need to break that use case in order to make progress" or "you relied on undocumented behavior, it sucks to be you" or "there's a better way to do what you want to do, and you have to change to that new better way", and I simply don't think that's acceptable outside of very early alpha releases that have experimental users that know what they signed up for. The kernel hasn't been in that situation for the last two decades. ... We do API breakage inside the kernel all the time. We will fix internal problems by saying "you now need to do XYZ", but then it's about internal kernel API's, and the people who do that then also obviously have to fix up all the in-kernel users of that API. Nobody can say "I now broke the API you used, and now you need to fix it up". Whoever broke something gets to fix it too. ... And we simply do not break user space.

Raymond Chen quoting Colen:

Look at the scenario from the customer’s standpoint. You bought programs X, Y and Z. You then upgraded to Windows XP. Your computer now crashes randomly, and program Z doesn’t work at all. You’re going to tell your friends, "Don’t upgrade to Windows XP. It crashes randomly, and it’s not compatible with program Z." Are you going to debug your system to determine that program X is causing the crashes, and that program Z doesn’t work because it is using undocumented window messages? Of course not. You’re going to return the Windows XP box for a refund. (You bought programs X, Y, and Z some months ago. The 30-day return policy no longer applies to them. The only thing you can return is Windows XP.)

While this school of thought is a minority, it’s a vocal minority with a lot of influence. It’s much rarer to hear this kind of case made for UI backwards compatibility. You might argue that this is fine -- people are forced to upgrade nowadays, so it doesn’t matter if stuff breaks. But even if users can’t escape, it’s still a bad user experience.

The counterargument to this school of thought is that maintaining compatibility creates technical debt. It’s true! Just for example, Linux is full of slightly to moderately wonky APIs due to the “do not break user space” dictum. One example is int recvmmsg(int sockfd, struct mmsghdr *msgvec, unsigned int vlen, unsigned int flags, struct timespec *timeout); . You might expect the timeout to fire if you don’t receive a packet, but the manpage reads:

The timeout argument points to a struct timespec (see clock_gettime(2)) defining a timeout (seconds plus nanoseconds) for the receive operation (but see BUGS!).

The BUGS section reads:

The timeout argument does not work as intended. The timeout is checked only after the receipt of each datagram, so that if up to vlen-1 datagrams are received before the timeout expires, but then no further datagrams are received, the call will block forever.

This is arguably not even the worst mis-feature of recvmmsg, which returns an ssize_t into a field of size int.

If you have a policy like “we simply do not break user space”, this sort of technical debt sticks around forever. But it seems to me that it’s not a coincidence that the most widely used desktop, laptop, and server operating systems in the world bend over backwards to maintain backwards compatibility.

The case for UI backwards compatability is arguably stronger than the case for API backwards compatability because breaking API changes can be mechanically fixed and, with the proper environment, all callers can be fixed at the same time as the API changes. There's no equivalent way to reach into people's brains and change user habits, so a breaking UI change inevitably results in pain for some users.

The case for the case for UI backwards compatibility is arguably weaker than the case for API backwards compatibility because API backwards compatibility has a lower cost -- if some API is problematic, you can make a new API and then document the old API as something that shouldn’t be used (you’ll see lots of these if you look at Linux syscalls). This doesn’t really work with GUIs since UI elements compete with each other for a small amount of screen real-estate. An argument that I think is underrated is that changing UIs isn’t as great as most companies seem to think -- very dated looking UIs that haven’t been refreshed to keep up with trends can be successful (e.g., plentyoffish and craigslist). Companies can even become wildly successful without any significant UI updates, let alone UI redesigns -- a large fraction of linkedin’s rocketship growth happened in a period where the UI was basically frozen. I’m told that freezing the UI wasn’t a deliberate design decision; instead, it was a side effect of severe technical debt, and that the UI was unfrozen the moment a re-write allowed people to confidently change the UI. Linkedin has managed to add a lot of dark patterns since they unfroze their front-end, but the previous UI seemed to work just fine in terms of growth.

Despite the success of a number of UIs which aren’t always updated to track the latest trends, at most companies, it’s basically impossible to make the case that UIs shouldn’t be arbitrarily changed without adding functionality, let alone make the case that UIs shouldn’t push out old functionality with new functionality.

UI deprecation

A case that might be easier to make is that shortcuts and shortcut-like UI elements can be deprecated before removal, similar to the way evolving APIs will add deprecation warnings before making breaking changes. Instead of regularly changing UIs so that users’ muscle memory is used against them and causes users to do the opposite of what they want, UIs can be changed so that doing the previously trained set of actions causes nothing to happen. For example, FB could have moved “hide post” down and inserted a no-op item in the old location, and then after people had gotten used to not clicking in the old “hide post” location for “hide post”, they could have then put “save post” in the old location for “hide post”.

Zulip could’ve done something similar and caused the series of actions that used to let you send a private message to the person you want cause no message to be sent instead of sending a private message to the alphabetically first person on the online list.

These solutions aren’t ideal because the user still has to retrain their muscle memory on the new thing, but it’s still a lot better than the current situation, where many UIs regularly introduce arbitrary-seeming changes that sow confusion and chaos.

In some cases (e.g., the no-op menu item), this presents a pretty strange interface to new users. Users don’t expect to see a menu item that does nothing with an arrow that says to click elsewhere on the menu instead. This can be fixed by only rolling out deprecation “warnings” to users who regularly use the old shortcut or shortcut-like path. If there are multiple changes being deprecated, this results in a combinatorial explosion of possibilities, but if you're regularly deprecating multiple independent items, that's pretty extreme and users are probably going to be confused regardless of how it's handled. Given the amount of effort made to avoid user hostile changes and the dominance of the “move fast and break things” mindset, the case for adding this kind of complexity just to avoid giving users a bad experience probably won’t hold at most companies, but this at least seems plausible in principle.

Breaking existing user workflows arguably doesn’t matter for an app like FB, which is relatively sticky as a result of its dominance in its area, but most applications are more like Zulip than FB. Back when Zulip and Slack were both young, Zulip messages couldn’t be edited or deleted. This was on purpose -- messages were immutable and everyone I know who suggested allowing edits was shot down because mutable messages didn’t fit into the immutable model. Back then, if there was a UI change or bug that caused users to accidentally send a public message instead of a private message, that was basically permanent. I saw people accidentally send public messages often enough that I got into the habit of moving private message conversations to another medium. That didn’t bother me too much since I’m used to quirky software, but I know people who tried Zulip back then and, to this day, still refuse to use Zulip due to UI issues they hit back then. That’s a bit of an extreme case, but the general idea that users will tend to avoid apps that repeatedly cause them pain isn’t much of a stretch.

In studies on user retention, it appears to be the case that an additional 500ms of page-load latency negative impacts retention. If that's the case, it seems like switching the UI around so that the user has to spend 5s undoing and action or broadcasts a private message publicly in a way that can't be undone should have a noticable impact on retention, although I don't know of any public studies that look at this.

Conclusion

If I worked on UI, I might have some suggestions or a call to action. But as an outsider, I’m wary of making actual suggestions -- programmers seem especially prone to coming into an area they’re not familiar with and telling experts how they should solve their problems. While this occasionally works, the most likely outcome is that the outsider either re-invents something that’s been known for decades or completely misses the most important parts of the problem.

It sure would be nice if shortcuts didn’t break so often that I spend as much time consciously stopping myself from using shortcuts as I do actually using the app. But there are probably reasons this is difficult to test/enforce. The huge number of platforms that need to be tested for robust UI testing make testing hard even without adding this extra kind of test. And, even when we’re talking about functional correctness problems, “move fast and break things” is much trendier than “try to break relatively few things”. Since UI “correctness” often has even lower priority than functional correctness, it’s not clear how someone could successfully make a case for spending more effort on it.

On the other hand, despite all these disclaimers, Google sometimes does the exact things described in this post. Chrome recently removed backspace to go backwards; if you hit backspace, you get a note telling you to use alt+left instead and when maps moved some items around a while back, they put in no-op placeholders that pointed people to the new location. This doesn't mean that Google always does this well -- on April fools day of 2016, gmail replaced send and archive with send and attach a gif that's offensive in some contexts -- but these examples indicate that maintaining backwards compatibility through significant changes isn't just a hypothetical idea, it can and has been done.

Thanks to Leah Hanson, Allie Jones, Randall Koutnik, Kevin Lynagh, David Turner, Christian Ternus, Ted Unangst, Michael Bryc, Tony Finch, Stephen Tigner, Steven McCarthy, Julia Evans, @BaudDev, and an anonymous person who has a moral objection to public acknowledgements for comments/corrections/discussion.

If you're curious why "anon" is against acknowledgements, it's because they first saw these in Paul Graham's writing, whose acknowledgements are sort of a who's who of SV. anon's belief is that these sorts of list serve as a kind of signalling. I won't claim that's wrong, but I get a lot of help with my writing both from people reading drafts and also from the occasional helpful public internet comment and I think it's important to make it clear that this isn't a one-person effort to combat what Bunnie Huang calls "the idol effect".

In a future post, we'll look at empirical work on how line length affects readability. I've read every study I could find, but I might be missing some. If know of a good study you think I should include, please let me know.

Thu, 09 Nov 2017 00:00:00 +0000


Filesystem error handling

We’re going to reproduce some results from papers on filesystem robustness that were written up roughly a decade ago: Prabhakaran et al. SOSP 05 paper, which injected errors below the filesystem and Gunawi et al. FAST 08, which looked at how often filessytems failed to check return codes of functions that can return errors.

Prabhakaran et al. injected errors at the block device level (just underneath the filesystem) and found that ext3, resierfs, ntfs, and jfs mostly handled read errors reasonbly but ext3, ntfs, and jfs mostly ignored write errors. While the paper is interesting, someone installing Linux on a system today is much more likely to use ext4 than any of the now-dated filesystems tested by Prahbhakaran et al. We’ll try to reproduce some of the basic results from the paper on more modern filesystems like ext4 and btrfs, some legacy filesystems like exfat, ext3, and jfs, as well as on overlayfs.

Gunawi et al. found that errors weren’t checked most of the time. After we look at error injection on modern filesystems, we’ll look at how much (or little) filesystems have improved their error handling code.

Error injection

A cartoon view of a file read might be: pread syscall -> OS generic filesystem code -> filesystem specific code -> block device code -> device driver -> device controller -> disk. Once the disk gets the request, it sends the data back up: disk -> device controller -> device driver -> block device code -> filesystem specific code -> OS generic filesystem code -> pread. We’re going to look at error injection at the block device level, right below the file system.

Let’s look at what happened when we injected errors in 2017 vs. what Prabhakaran et al. found in 2005.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfspropproppropproppropprop
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

Each row shows results for one filesystem. read and write indicating reading and writing data, respectively, where the block device returns an error indicating that the operation failed. silent indicates a read failure (incorrect data) where the block device didn’t indicate an error. This could happen if there’s disk corruption, a transient read failure, or a transient write failure silently caused bad data to be written. file indicates that the operation was done on a file opened with open and mmap indicates that the test was done on a file mapped with mmap. ignore (red) indicates that the error was ignored, prop (yellow) indicates that the error was propagated and that the pread or pwrite syscall returned an error code, and fix (green) indicates that the error was corrected. No errors were corrected. Grey entries indicate configurations that weren’t tested.

From the table, we can see that, in 2005, ext3 and jfs ignored write errors even when the block device indicated that the write failed and that things have improved, and that any filesystem you’re likely to use will correctly tell you that a write failed. jfs hasn’t improved, but jfs is now rarely used outside of legacy installations.

No tested filesystem other than btrfs handled silent failures correctly. The other filesystems tested neither duplicate nor checksum data, making it impossible for them to detect silent failures. zfs would probably also handle silent failures correctly but wasn’t tested. apfs, despite post-dating btrfs and zfs, made the explicit decision to not checksum data and silently fail on silent block device errors. We’ll discuss this more later.

In all cases tested where errors were propagated, file reads and writes returned EIO from pread or pwrite, respectively; mmap reads and writes caused the process to receive a SIGBUS signal.

The 2017 tests above used an 8k file where the first block that contained file data either returned an error at the block device level or was corrupted, depending on the test. The table below tests the same thing, but with a 445 byte file instead of an 8k file. The choice of 445 was arbitrary.

20052017
readwritesilentreadwritesilentreadwritesilent
filemmap
btrfsfixfixfixfixfixfix
exfatproppropignoreproppropignore
ext3propignoreignoreproppropignoreproppropignore
ext4proppropignoreproppropignore
fatproppropignoreproppropignore
jfspropignoreignorepropignoreignoreproppropignore
reiserfsproppropignore
xfsproppropignoreproppropignore

In the small file test table, all the results are the same, except for btrfs, which returns correct data in every case tested. What’s happening here is that the filesystem was created on a rotational disk and, by default, btrfs duplicates filesystem metadata on rotational disks (it can be configured to do so on SSDs, but that’s not the default). Since the file was tiny, btrfs packed the file into the metadata and the file was duplicated along with the metadata, allowing the filesystem to fix the error when one block either returned bad data or reported a failure.

Overlay

Overlayfs allows one file system to be “overlaid” on another. As explained in the initial commit, one use case might be to put an (upper) read-write directory tree on top of a (lower) read-only directory tree, where all modifications go to the upper, writable layer.

Although not listed on the tables, we also tested every filesystem other than fat as the lower filesystem with overlay fs (ext4 was the upper filesystem for all tests). Every filessytem tested showed the same results when used as the bottom layer in overlay as when used alone. fat wasn’t tested because mounting fat resulted in a filesystem not supported error.

Error correction

btrfs doesn’t, by default, duplicate metadata on SSDs because the developers believe that redundancy wouldn’t provide protection against errors on SSD (which is the same reason apfs doesn’t have redundancy). SSDs do a kind of write coalescing, which is likely to cause writes which happen consecutively to fall into the same block. If that block has a total failure, the redundant copies would all be lost, so redundancy doesn’t provide as much protection against failure as it would on a rotational drive.

I’m not sure that this means that redundancy wouldn’t help -- Individual flash cells degrade with operation and lose charge as they age. SSDs have built-in wear-leveling and error-correction that’s designed to reduce the probability that a block returns bad data, but over time, some blocks will develop so many errors that the error-correction won’t be able to fix the error and the block will return bad data. In that case, a read should return some bad bits along with mostly good bits. AFAICT, the publicly available data on SSD error rates seems to line up with this view.

Error detection

Relatedly, it appears that apfs doesn’t checksum data because “[apfs] engineers contend that Apple devices basically don’t return bogus data”. Publicly available studies on SSD reliability have not found that there’s a model that doesn’t sometimes return bad data. It’s a common conception that SSDs are less likely to return bad data than rotational disks, but when Google studied this across their drives, they found:

The annual replacement rates of hard disk drives have previously been reported to be 2-9% [19,20], which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad blocks and 2-7% of them develop bad chips. In comparison, previous work [1] on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

While there is one sense in which SSDs are more reliable than rotational disks, there’s also a sense in which they appear to be less reliable. It’s not impossible that Apple uses some kind of custom firmware on its drive that devotes more bits to error correction than you can get in publicly available disks, but even if that’s the case, you might plug a non-apple drive into your apple computer and want some kind of protection against data corruption.

Internal error handling

Now that we’ve reproduced some tests from Prabhakaran et al., we’re going to move on to Gunawi et al.. Since the paper is fairly involved, we’re just going to look at one small part of the paper, the part where they examined three function calls, filemap_fdatawait, filemap_fdatawrite, and sync_blockdev to see how often errors weren’t checked for these functions.

Their justification for looking at these function is given as:

As discussed in Section 3.1, a function could return more than one error code at the same time, and checking only one of them suffices. However, if we know that a certain function only returns a single error code and yet the caller does not save the return value properly, then we would know that such call is really a flaw. To find real flaws in the file system code, we examined three important functions that we know only return single error codes: sync_blockdev, filemap_fdatawrite, and filemap_fdatawait. A file system that does not check the returned error codes from these functions would obviously let failures go unnoticed in the upper layers.

Ignoring errors from these functions appears to have fairly serious consequences. The documentation for filemap_fdatawait says:

filemap_fdatawait — wait for all under-writeback pages to complete ... Walk the list of under-writeback pages of the given address space and wait for all of them. Check error status of the address space and return it. Since the error status of the address space is cleared by this function, callers are responsible for checking the return value and handling and/or reporting the error.

The comment next to the code for sync_blockdev reads:

Write out and wait upon all the dirty data associated with a block device via its mapping. Does not take the superblock lock.

In both of these cases, it appears that ignoring the error code could mean that data would fail to get written to disk without notifying the writer that the data wasn’t actually written?

Let’s look at how often calls to these functions didn’t completely ignore the error code:

fn 2008 '08 % 2017 '17 %
filemap_fdatawait 7 / 29 24 12 / 17 71
filemap_fdatawrite 17 / 47 36 13 / 22 59
sync_blockdev 6 / 21 29 7 / 23 30

This table is for all code in linux under fs. Each row shows data for calls of one function. For each year, the leftmost cell shows the number of calls that do something with the return value over the total number of calls. The cell to the right shows the percentage of calls that do something with the return value. “Do something” is used very loosely here -- branching on the return value and then failing to handle the error in either branch, returning the return value and having the caller fail to handle the return value, as well as saving the return value and then ignoring it are all considered doing something for the purposes of this table.

For example Gunawi et al. noted that cifs/transport.c had

int SendReceive () { 
    int rc;
    rc = cifs_sign_smb(); // 
    ... 
    rc = smb_send();
}

Although cifs_sign_smb returned an error code, it was never checked before being overwritten by smb_send, which counted as being used for our purposes even though the error wasn’t handled.

Overall, the table appears to show that many more errors are handled now than were handled in 2008 when Gunawi et al. did their analysis, but it’s hard to say what this means from looking at the raw numbers because it might be ok for some errors not to be handled and different lines of code are executed with different probabilities.

Conclusion

Filesystem error handling seems to have improved. Reporting an error on a pwrite if the block device reports an error is perhaps the most basic error propagation a robust filesystem should do; few filesystems reported that error correctly in 2005. Today, most filesystems will correctly report an error when the simplest possible error condition that doesn’t involve the entire drive being dead occurs if there are no complicating factors.

Most filesystems don’t have checksums for data and leave error detection and correction up to userspace software. When I talk to server-side devs at big companies, their answer is usually something like “who cares? All of our file accesses go through a library that checksums things anyway and redundancy across machines and datacenters takes care of failures, so we only need error detection and not correction”. While that’s true for developers at certain big companies, there’s a lot of software out there that isn’t written robustly and just assumes that filesystems and disks don’t have errors.

This was a joint project with Wesley Aptekar-Cassels; the vast majority of the work for the project was done while pair programming at RC. We also got a lot of help from Kate Murphy. Both Wesley (w.aptekar@gmail.com) and Kate (hello@kate.io) are looking for work. They’re great and I highly recommend talking to them if you’re hiring!

Appendix: error handling in C

A fair amount of effort has been applied to get error handling right. But C makes it very easy to get things wrong, even when you apply a fair amount effort and even apply extra tooling. One example of this in the code is the submit_one_bio function. If you look at the definition, you can see that it’s annotated with __must_check, which will cause a compiler warning when the result is ignored. But if you look at calls of submit_one_bio, you’ll see that its callers aren’t annotated and can ignore errors. If you dig around enough you’ll find one path of error propagation that looks like:

submit_one_bio
submit_extent_page
__extent_writepage
extent_write_full_page
write_cache_pages
generic_writepages
do_writepages
__filemap_fdatawrite_range
__filemap_fdatawrite
filemap_fdatawrite

Nine levels removed from submit_one_bio, we see our old friend, `filemap_fdatawrite, which we know often doesn’t get checked for errors.

There's a very old debate over how to prevent things like this from accidentally happening. One school of thought, which I'll call the Uncle Bob (UB) school believes that we can't fix these kinds of issues with tools or processes and simply need to be better programmers in order to avoid bugs. You'll often hear people of the UB school say things like, "you can't get rid of all bugs with better tools (or processes)". In his famous and well-regarded talk, Simple Made Easy, Rich Hickey says

What's true of every bug found in the field?

[Audience reply: Someone wrote it?] [Audience reply: It got written.]

It got written. Yes. What's a more interesting fact about it? It passed the type checker.

[Audience laughter]

What else did it do?

[Audience reply: (Indiscernible)]

It passed all the tests. Okay. So now what do you do? Right? I think we're in this world I'd like to call guardrail programming. Right? It's really sad. We're like: I can make change because I have tests. Who does that? Who drives their car around banging against the guardrail saying, "Whoa! I'm glad I've got these guardrails because I'd never make it to the show on time."

[Audience laughter]

If you watch the talk, Rich uses "simplicity" the way Uncle Bob uses "discipline". They way these statements are used, they're roughly equivalent to Ken Thompson saying "Bugs are bugs. You write code with bugs because you do". The UB school throws tools and processes under the bus, saying that it's unsafe to rely solely on tools or processes.

Rich's rhetorical trick is brilliant -- I've heard that line quoted tens of times since the talk to argue against tests or tools or types. But, like guardrails, most tools and processes aren't about eliminating all bugs, they're about reducing the severity or probability of bugs. If we look at this particular function call, we can see that a static analysis tool failed to find this bug. Does that mean that we should give up on static analysis tools? A static analysis tool could look for all calls of submit_one_bio and show you the cases where the error is propogated up N levels only to be dropped. Gunawi et al. did exactly that and found a lot of bugs. A person basically can't do the same thing without tooling. They could try, but people are lucky if they get 95% accuracy when manually digging through things like this. The sheer volume of code guarantees that a human doing this by hand would make mistakes.

Even better than a static analysis tool would be a language that makes it harder to accidentally forget about checking for an error. One of the issues here is that it's sometimes valid to drop an error. There are a number of places where there's no interace that allows an error to get propogated out of the filesystem, making it correct to drop the error, modulo changing the interface. In the current situation, as an outsider reading the code, if you look at a bunch of calls that drop errors, it's very hard to say, for all of them, which of those is a bug and which of those is correct. If the default is that we have a kind of guardrail that says "this error must be checked", people can still incorrectly ignore errors, but you at least get an annotation that the omission was on purpose. For example, if you're forced to specifically write code that indicates that you're ignoring an error, and in code that's inteded to be robust, like filesystem code, code that drops an error on purpose is relatively likely to be accompanied by a comment explaining why the error was dropped.

Appendix: why wasn't this done earlier?

After all, it would be nice if we knew if modern filesystems could do basic tasks correctly. Filesystem developers probably know this stuff, but since I don't follow LKML, I had no idea whether or not things had improved since 2005 until we ran the experiment.

The papers we looked at here came out of Andrea and Remzi Arpaci-Dusseau's research lab. Remzi has a talk where he mentioned that grad students don't want to reproduce and update old work. That's entirely reasonable, given the incentives they face. And I don't mean to pick on academia here -- this work came out of academia, not industry. It's possible this kind of work simply wouldn't have happened if not for the academic incentive system.

In general, it seems to be quite difficult to fund work on correctness. There are a fair number of papers on new ways to find bugs, but that's relatively little work on applying existing techniques to existing code. In academia, that seems to be hard to get a good publication out of, in the open source world, that seems to be less interesting to people than writing new code. That's also entirely reasonable -- people should work on what they want, and even if they enjoy working on correctness, that's probably not a great career decision in general. I was at the RC career fair the other night and my badge said I was interesting in testing. The first person who chatted me up opened with "do you work in QA?". Back when I worked in hardware, that wouldn't have been a red flag, but in software, "QA" is code for a low-skill, tedious, and poorly paid job. Much of industry considers testing and QA to be an afterthought. As a result, open source projects that companies rely on are often woefully underfunded. Google funds some great work (like afl-fuzz), but that's the exception and not the rule, even within Google, and most companies don't fund any open source work. The work in this post was done by a few people who are intentionally temporarily unemployed, which isn't really a scalable model.

Occasionally, you'll see someone spend a lot of effort on immproving correctness, but that's usually done as a massive amount of free labor. Kyle Kingsbury might be the canonical example of this -- my understanding is that he worked on the Jepsen distributed systems testing tool on nights and weekends for years before turning that into a consulting business. It's great that he did that -- he showed that almost every open source distributed system had serious data loss or corruption bugs. I think that's great, but stories about heoric effort like that always worry me because heroism doesn't scale. If Kyle hadn't come along, would most of the bugs that he and his tool found still plague open source distributed systems today? That's a scary thought.

If I knew how to fund more work on correctness, I'd try to convince you that we should switch to this new model, but I don't know of a funding model that works. I've set up a patreon (donation account), but it would be quite extraordinary if that was sufficient to actually fund a signifcant amount of work. If you look at how much programmers make off of donations, if I made two order of magnitude less than I could if I took a job in industry, that would already put me in the top 1% of programmers on patreon. If I made one order of magnitude less than I'd make in industry, that would be extraordinary. Off the top of my head, the only programmers who make more than that off of patreon either make something with much broader appeal (like games) or are Evan You, who makes one of the most widely use front-end libraries in existence. And if I actually made as much as I can make in industry, I suspect that would make me the highest grossing programmer on patreon, even though, by industry standards, my compensation hasn't been anything special.

If I had to guess, I'd say that part of the reason it's hard to fund this kind of work is that consumers don't incentivize companies to fund this sort of work. If you look at "big" tech companies, two of them are substantially more serious about correctness than their competitors. This results in many fewer horror stories about lost emails and documents as well as lost entire accounts. If you look at the impact on consumers, it might be something like the difference between 1% of people seeing lost/corrupt emails vs. 0.001%. I think that's pretty significant if you multiply that cost across all consumers, but the vast majority of consumers aren't going to make decisions based on that kind of difference. If you look at an area where correctness problems are much more apparent, like databases or backups, you'll find that even the worst solutions have defenders who will pop into any dicussions and say "works for me". A backup solution that works 90% of the time is quite bad, but if you have one that works 90% of the time, it will still have staunch defenders who drop into discussions to say things like "I've restored from backup three times and it's never failed! You must be making stuff up!". I don't blame companies for rationally responding to consumers, but I do think that the result is unfortunate for consumers.

Just as an aside, one of the great wonders of doing open work for free is that the more free work you do, the more people complain that you didn't do enough free work. As David MacIver has said, doing open source work is like doing normal paid work, except that you get paid in complaints instead of cash. It's basically guaranteed that the most common comment on this post, for all time, will be that didn't test someone's pet filesystem because we're btrfs shills or just plain lazy, even though we include a link to a repo that lets anyone add tests as they please. Pretty much every time I've done any kind of free experimental work, people who obvously haven't read the experimental setup or the source code complain that the experiment couldn't possibly be right because of [thing that isn't true that anyone could see by looking at the setup] and that it's absolutely inexcusable that I didn't run the experiment on the exact pet thing they wanted to see. Having played video games competitively in the distant past, I'm used to much more intense internet trash talk, but in general, this incentive system seems to be backwards.

Appendix: experimental setup

For the error injection setup, a high-level view of the experimental setup is that dmsetup was used to simulate bad blocks on the disk.

A list of the commands run looks something like:

cp images/btrfs.img.gz /tmp/tmpeas9efr6.gz
gunzip -f /tmp/tmpeas9efr6.gz
losetup -f
losetup /dev/loop19 /tmp/tmpeas9efr6
blockdev --getsize /dev/loop19
#        0 74078 linear /dev/loop19 0
#        74078 1 error
#        74079 160296 linear /dev/loop19 74079
dmsetup create fserror_test_1508727591.4736078
mount /dev/mapper/fserror_test_1508727591.4736078 /mnt/fserror_test_1508727591.4736078/
mount -t overlay -o lowerdir=/mnt/fserror_test_1508727591.4736078/,upperdir=/tmp/tmp4qpgdn7f,workdir=/tmp/tmp0jn83rlr overlay /tmp/tmpeuot7zgu/
./mmap_read /tmp/tmpeuot7zgu/test.txt
umount /tmp/tmpeuot7zgu/
rm -rf /tmp/tmp4qpgdn7f
rm -rf /tmp/tmp0jn83rlr
umount /mnt/fserror_test_1508727591.4736078/
dmsetup remove fserror_test_1508727591.4736078
losetup -d /dev/loop19
rm /tmp/tmpeas9efr6

See this github repo for the exact set of commands run to execute tests.

Note that all of these tests were done on linux, so fat means the linux fat implementation, not the windows fat implementation. zfs and reiserfs weren’t tested because they couldn’t be trivially tested in the exact same way that we tested other filesystems (one of us spent an hour or two trying to get zfs to work, but its configuration interface is inconsistent with all of the filesystems tested; reiserfs appears to have a consistent interface but testing it requires doing extra work for a filesystem that appears to be dead). ext3 support is now provided by the ext4 code, so what ext3 means now is different from what it meant in 2005.

All tests were run on both ubuntu 17.04, 4.10.0-37, as well as on arch, 4.12.8-2. We got the same results on both machines. All filesystems were configured with default settings. For btrfs, this meant duplicated metadata without duplicated data and, as far as we know, the settings wouldn't have made a difference for other filesystems.

The second part of this doesn’t have much experimental setup to speak of. The setup was to grep the linux source code for the relevant functions.

Thanks to Leah Hanson, David Wragg, Ben Kuhn, Wesley Aptekar-Cassels, Joel Borggrén-Franck, and Dan Puttick for comments/corrections on this post.

Mon, 23 Oct 2017 00:00:00 +0000


Keyboard latency

If you look at “gaming" keyboards, a lot of them sell for $100 or more on the promise that they’re fast. Ad copy that you’ll see includes:

Despite all of these claims, I can only find one person who’s publicly benchmarked keyboard latency and they only tested two keyboards. In general, my belief is that if someone makes performance claims without benchmarks, the claims probably aren’t true, just like how code that isn’t tested (or otherwise verified) should be assumed broken.

The situation with gaming keyboards reminds me a lot of talking to car salesmen:

Salesman: this car is super safe! It has 12 airbags! Me: that’s nice, but how does it fare in crash tests? Salesman: 12 airbags!

Sure, gaming keyboards have 1000Hz polling, but so what?

Two obvious questions are:

  1. Does keyboard latency matter?
  2. Are gaming keyboards actually quicker than other keyboards?

Does keyboard latency matter?

A year ago, if you’d asked me if I was going to build a custom setup to measure keyboard latency, I would have said that’s silly, and yet here I am, measuring keyboard latency with a logic analyzer.

It all started because I had this feeling that some old computers feel much more responsive than modern machines. For example, an iMac G4 running macOS 9 or an Apple 2 both feel quicker than my 4.2 GHz Kaby Lake system. I never trust feelings like this because there’s decades of research showing that users often have feelings that are the literal opposite of reality, so got a high-speed camera and started measuring actual keypress-to-screen-update latency as well as mouse-move-to-screen-update latency. It turns out the machines that feel quick are actually quick, much quicker than my modern computer -- computers from the 70s and 80s commonly have keypress-to-screen-update latencies in the 30ms to 50ms range out of the box, whereas modern computers are often in the 100ms to 200ms range when you press a key in a terminal. It’s possible to get down to the 50ms range in well optimized games with a fancy gaming setup, and there’s one extraordinary consumer device that can easily get below 50ms, but the default experience is much slower. Modern computers have much better throughput, but their latency isn’t so great.

Anyway, at the time I did these measurements, my 4.2 GHz kaby lake had the fastest single-threaded performance of any machine you could buy but had worse latency than a quick machine from the 70s (roughly 6x worse than an Apple 2), which seems a bit curious. To figure out where the latency comes from, I started measuring keyboard latency because that’s the first part of the pipeline. My plan was to look at the end-to-end pipeline and start at the beginning, ruling out keyboard latency as a real source of latency. But it turns out keyboard latency is significant! I was surprised to find that the median keyboard I tested has more latency than the entire end-to-end pipeline of the Apple 2. If this doesn’t immedately strike you as absurd, consider that an Apple 2 has 3500 transistors running at 1MHz and an Atmel employee estimates that the core used in a number of high-end keyboards today has 80k transistors running at 16MHz. That's 20x the transistors running at 16x the clock speed -- keyboards are often more powerful than entire computers from the 70s and 80s! And yet, the median keyboard today adds as much latency as the entire end-to-end pipeline as a fast machine from the 70s.

Let’s look at the measured keypress-to-USB latency on some keyboards:

keyboard latency
(ms)
connection gaming
apple magic (usb) 15 USB FS
hhkb lite 2 20 USB FS
MS natural 4000 20 USB
das 3 25 USB
logitech k120 30 USB
unicomp model M 30 USB FS
pok3r vortex 30 USB FS
filco majestouch 30 USB
dell OEM 30 USB
powerspec OEM 30 USB
kinesis freestyle 2 30 USB FS
chinfai silicone 35 USB FS
razer ornata chroma 35 USB FS Yes
olkb planck rev 4 40 USB FS
ergodox 40 USB FS
MS comfort 5000 40 wireless
easterntimes i500 50 USB FS Yes
kinesis advantage 50 USB FS
genius luxemate i200 55 USB
topre type heaven 55 USB FS
logitech k360 60 "unifying"

The latency measurements are the time from when the key starts moving to the time when the USB packet associated with the key makes it out onto the USB bus. Numbers are rounded to the nearest 5 ms in order to avoid giving a false sense of precision. The easterntimes i500 is also sold as the tomoko MMC023.

The connection column indicates the connection used. USB FS stands for the usb full speed protocol, which allows up to 1000Hz polling, a feature commonly advertised by high-end keyboards. USB is the usb low speed protocol, which is the protocol most keyboards use. The ‘gaming’ column indicates whether or not the keyboard is branded as a gaming keyboard. wireless indicates some kind of keyboard-specific dongle and unifying is logitech's wireless device standard.

We can see that, even with the limited set of keyboards tested, there can be as much as a 45ms difference in latency between keyboards. Moreover, a modern computer with one of the slower keyboards attached can’t possibly be as responsive as a quick machine from the 70s or 80s because the keyboard alone is slower than the entire response pipeline of some older computers.

That establishes the fact that modern keyboards contribute to the latency bloat we’ve seen over the past forty years. The other half of the question is, does the latency added by a modern keyboard actually make a difference to users? From looking at the table, we can see that among the keyboard tested, we can get up to a 40ms difference in average latency. Is 40ms of latency noticeable? Let’s take a look at some latency measurements for keyboards and then look at the empirical research on how much latency users notice.

There’s a fair amount of empirical evidence on this and we can see that, for very simple tasks, people can perceive latencies down to 2ms or less. Moreover, increasing latency is not only noticeable to users, it causes users to execute simple tasks less accurately. If you want a visual demonstration of what latency looks like and you don’t have a super-fast old computer lying around, check out this MSR demo on touchscreen latency.

Are gaming keyboards faster than other keyboards?

I’d really like to test more keyboards before making a strong claim, but from the preliminary tests here, it appears that gaming keyboards aren’t generally faster than non-gaming keyboards.

Gaming keyboards often claim to have features that reduce latency, like connecting over USB FS and using 1000Hz polling. The USB low speed spec states that the minimum time between packets is 10ms, or 100 Hz. However, it’s common to see USB devices round this down to the nearest power of two and run at 8ms, or 125Hz. With 8ms polling, the average latency added from having to wait until the next polling interval is 4ms. With 1ms polling, the average latency from USB polling is 0.5ms, giving us a 3.5ms delta. While that might be a significant contribution to latency for a quick keyboard like the Apple magic keyboard, it’s clear that other factors dominate keyboard latency for most keyboards and that the gaming keyboards tested here are so slow that shaving off 3.5ms won’t save them.

Conclusion

Most keyboards add enough latency to make the user experience noticeably worse, and keyboards that advertise speed aren’t necessarily faster. The two gaming keyboards we measured weren’t faster than non-gaming keyboards, and the fastest keyboard measured was a minimalist keyboard from Apple that’s marketed more on design than speed.

Previously, we've seen that terminals can add significant latency, up 100ms in mildly pessimistic conditions if you choose the "right" terminal. In a future post, we'll look at the entire end-to-end pipeline to see other places latency has crept in and we'll also look at how some modern devices keep latency down.

Appendix: where is the latency coming from?

A major source of latency is key travel time. It’s not a coincidence that the quickest keyboard measured also has the shortest key travel distance by a large margin. video setup I’m using to measure end-to-end latency is a 240 fps camera, which means that frames are 4ms apart. When videoing “normal" keypresses and typing, it takes 4-8 frames for a key to become fully depressed. Most switches will start firing before the key is fully depressed, but the key travel time is still significant and can easily add 10ms of delay (or more, depending on the switch mechanism). Contrast this to the Apple "magic" keyboard measured, where the key travel is so short that it can’t be captured with a 240 fps camera, indicating that the key travel time is < 4ms.

Note that, unlike the other measurement I was able to find online, this measurement was from the start of the keypress instead of the switch activation. This is because, as a human, you don't activate the switch, you press the key. A measurement that starts from switch activiation time misses this large component to latency. If, for example, you're playing a game and you switch from moving forward to moving backwards when you see something happen, you have pay the cost of the key movement, which is different for different keyboards. A common response to this is that "real" gamers will preload keys so that they don't have to pay the key travel cost, but if you go around with a high speed camera and look at how people actually use their keyboards, the fraction of keypresses that are significantly preloaded is basically zero even when you look at gamers. It's possible you'd see something different if you look at high-level competitive gamers, but even then, just for example, people who use a standard wasd or esdf layout will typically not preload a key when going from back to forward. Also, the idea that it's fine that keys have a bunch of useless travel because you can pre-depress the key before really pressing the key is just absurd. That's like saying latency on modern computers is fine because some people build gaming boxes that, when run with unusually well optimzed software, get 50ms response time. Normal, non-hardcore-gaming users simply aren't going to do this. Since that's the vast majority of the market, even if all "serious" gamers did this, that would stll be a round error.

The other large sources of latency are scaning the keyboard matrix and debouncing. Neither of these delays are inherent -- keyboards use a matrix that has to be scanned instead of having a wire per-key because it saves a few bucks, and most keyboards scan the matrix at such a slow rate that it induces human noticable delays because that saves a few bucks, but a manufacturer willing to spend a bit more on manufacturing a keyboard could make the delay from that far below the threshold of human perception. See below for debouncing delay.

Although we didn't discuss throughput in this, when I measure my typing speed, I find that I can type faster with the low-travel Apple keyboard than with any of the other keyboards. There's no way to do a blinded experiment for this, but Gary Bernhardt and others have also observed the same thing. Some people claim that key travel doesn't matter for typing speed because they use the minimum amount of travel necessary and that this therefore can't matter, but as with the above claims on keypresses, if you walk around with a high speed camera and observe what actually happens when people type, it's very hard to find someone who actually does this.

Appendix: counter-arguments to common arguments that latency doesn’t matter

Before writing this up, I read what I could find about latency and it was hard to find non-specialist articles or comment sections that didn’t have at least one of the arguments listed below:

Computers and devices are fast

The most common response to questions about latency is that input latency is basically zero, or so close to zero that it’s a rounding error. For example, two of the top comments on this slashdot post asking about keyboard latency are that keyboards are so fast that keyboard speed doesn’t matter. One person even says

There is not a single modern keyboard that has 50ms latency. You (humans) have that sort of latency.

As far as response times, all you need to do is increase the poll time on the USB stack

As we’ve seen, some devices do have latencies in the 50ms range. This quote as well as other comments in the thread illustrate another common fallacy -- that input devices are limited by the speed of the USB polling. While that’s technically possible, most devices are nowhere near being fast enough to be limited by USB polling latency.

Unfortunately, most online explanations of input latency assume that the USB bus is the limiting factor.

Humans can’t notice 100ms or 200ms latency

Here’s a “cognitive neuroscientist who studies visual perception and cognition" who refers to the fact that human reaction time is roughly 200ms, and then throws in a bunch more scientific mumbo jumbo to say that no one could really notice latencies below 100ms. This is a little unusual in that the commenter claims some kind of special authority and uses a lot of terminology, but it’s common to hear people claim that you can’t notice 50ms or 100ms of latency because human reaction time is 200ms. This doesn’t actually make sense because there are independent quantities. This line of argument is like saying that you wouldn’t notice a flight being delayed by an hour because the duration of the flight is six hours.

Another problem with this line of reasoning is that the full pipeline from keypress to screen update is quite long and if you say that it’s always fine to add 10ms here and 10ms there, you end up with a much larger amount of bloat through the entire pipeline, which is how we got where we are today, where can buy a system with the CPU that gives you the fastest single-threaded performance money can buy and get 6x the latency of a machine from the 70s.

It doesn’t matter because the game loop runs at 60 Hz

This is fundamentally the same fallacy as above. If you have a delay that’s half the duration a clock period, there’s a 50% chance the delay will push the event into the next processing step. That’s better than a 100% chance, but it’s not clear to me why people think that you’d need a delay as long as the the clock period for the delay to matter. And for reference, the 45ms delta between the slowest and fastest keyboard measured here corresponds to 2.7 frames at 60fps.

Keyboards can’t possibly response faster more quickly than 5ms/10ms/20ms due to debouncing

Even without going through contortions to optimize the switch mechanism, if you’re willing to put hysteresis into the system, there’s no reason that the keyboard can’t assume a keypress (or release) is happening the moment it sees an edge. This is commonly done for other types of systems and AFAICT there’s no reason keyboards couldn’t do the same thing (and perhaps some do). The debounce time might limit the repeat rate of the key, but there’s no inherent reason that it has to affect the latency. And if we're looking at the repeat rate, imagine we have a 5ms limit on the rate of change of the key state due to introducing hysteresis. That gives us one full keypress cycle (press and release) every 10ms, or 100 keypresses per second per key, which is well beyond the capacity of any human. You might argue that this introduces a kind of imprecision, which might matter in some applications (music, rythym games), but that's limited by the switch mechanism. Using a debouncing mechanism with hysteresis doesn't make us any worse off than we were before.

An additional problem with debounce delay is that most keyboard manufacturers seem to have confounded scan rate and debounce delay. It's common to see keyboards with scan rates in the 100 Hz to 200 Hz range. This is justified by statements like "there's no point in scanning faster because the debounce delay is 5ms", which combines two fallacies mentioned above. If you pull out the schematics for the Apple 2e, you can see that the scan rate is roughly 50 kHz. Its debounce time is roughly 6ms, which corresponds to a frequency of 167 Hz. Why scan so quickly? The fast scan allows the keyboard controller to start the clock on the debounce time almost immediately (after at most 20 microseconds), as opposed a modern keyboard that scans at 167 Hz, which might not start the clock on debouncing for 6ms, or after 300x as much time.

Apologies for not explaining terminology here, but I think that anyone making this objection should understand the explanation :-).

Appendix: experimental setup

The USB measurement setup was a USB cable. Cutting open the cable damages the signal integrity and I found that, with a very long cable, some keyboards that weakly drive the data lines didn't drive them strongly enough to get a good signal with the cheap logic analyzer I used.

The start-of-input was measured by pressing two keys at once -- one key on the keyboard and a button that was also connected to the logic analyzer. This introduces some jitter as the two buttons won’t be pressed at exactly the same time. To calibrate the setup, we used two identical buttons connected to the logic analyzer. The median jitter was < 1ms and the 90%-ile jitter was roughly 5ms. This is enough that tail latency measurements for quick keyboards aren’t really possible with this setup, but average latency measurements like the ones done here seem like they should be ok. The input jitter could probably be reduced to a negligible level by building a device to both trigger the logic analyzer and press a key on the keyboard under test at the same time. Average latency measurements would also get better with such a setup (because it would be easier to run a large number of measurements).

If you want to know the exact setup, a E-switch LL1105AF065Q switch was used. Power and ground were supplied by an arduino board. There’s no particular reason to use this setup. In fact, it’s a bit absurd to use an entire arduino to provide power, but this was done with spare parts that were lying around and this stuff just happened to be stuff that RC had in their lab, with the exception of the switches. There weren’t two identical copies of any switch, so we bought a few switches so we could do calibration measurements with two identical switches. The exact type of switch isn’t important here; any low-resistance switch would do.

Tests were done by pressing the z key and then looking for byte 29 on the USB bus and then marking the end of the first packet containing the appropriate information. But, as above, any key would do.

I don't actually trust this setup and I'd like to build a completely automated setup before testing more keyboards. While the measurements are in line with the one other keyboard measurement I could find online, this setup has an inherent imprecision that's probably in the 1ms to 10ms range. While averaging across multiple measurements reduces that imprecision, since the measurements are done by a human, it's not guaranteed and perhaps not even likely that the errors are independent and will average out.

I’m looking for more phones and machines to measure and would love to run a quick benchmark with a high-speed camera if you have a machine or device that’s not on this list! If you’re not in the area and want to donate a device for me to test, feel free to mail the device to

Dan Luu
Recurse Center
455 Broadway, 2nd Floor
New York, NY 10013

This project was done with help from Wesley Aptekar-Cassels, Leah Hanson, and Kate Murphy. BTW, Wesley is looking for work. In addition to knowing “normal" programming stuff, he’s also familiar with robotics, controls, electronics, and general “low-level" programming.

Thanks to RC, Ahmad Jarara, Raph Levien, Peter Bhat Harkins, Brennan Chesley, Dan Bentley, Kate Murphy, Christian Ternus, Sophie Haskins, and Dan Puttick, for letting us use their keyboards for testing.

Thanks for Leah Hanson, Mark Feeney, and Zach Allaun for comments/corrections/discussion on this post.

Mon, 16 Oct 2017 00:00:00 +0000


Branch prediction

*This is a pseudo-transcript for a talk on branch prediction given at Two Sigma on 8/22/2017 to kick off "localhost", a talk series organized by RC.

How many of you use branches in your code? Could you please raise your hand if you use if statements or pattern matching?

Most of the audience raises their hands

I won’t ask you to raise your hands for this next part, but my guess is that if I asked, how many of you feel like you have a good understanding of what your CPU does when it executes a branch and what the performance implications are, and how many of you feel like you could understand a modern paper on branch prediction, fewer people would raise their hands.

The purpose of this talk is to explain how and why CPUs do “branch prediction” and then explain enough about classic branch prediction algorithms that you could read a modern paper on branch prediction and basically know what’s going on.

Before we talk about branch prediction, let’s talk about why CPUs do branch prediction. To do that, we’ll need to know a bit about how CPUs work.

For the purposes of this talk, you can think of your computer as a CPU plus some memory. The instructions live in memory and the CPU executes a sequence of instructions from memory, where instructions are things like “add two numbers”, “move a chunk of data from memory to the processor”. Normally, after executing one instruction, the CPU will execute the instruction that’s at the next sequential address. However, there are instructions called “branches” that let you change the address next instruction comes from.

Here’s an abstract diagram of a CPU executing some instructions. The x-axis is time and the y-axis distinguishes different instructions.

Instructions executing sequentially

Here, we execute instruction A, followed by instruction B, followed by instruction C, followed by instruction D.

One way you might design a CPU is to have the CPU do all of the work for one instruction, then move on to the next instruction, do all of the work for the next instruction, and so on. There’s nothing wrong with this; a lot of older CPUs did this, and some modern very low-cost CPUs still do this. But if you want to make a faster CPU, you might make a CPU that works like an assembly line. That is, you break the CPU up into two parts, so that half the CPU can do the “front half” of the work for an instruction while half the CPU works on the “back half” of the work for an instruction, like an assembly line. This is typically called a pipelined CPU.

Instructions with overlapping execution

If you do this, the execution might look something like the above. After the first half of instruction A is complete, the CPU can work on the second half of instruction A while the first half of instruction B runs. And when the second half of A finishes, the CPU can start on both the second half of B and the first half of C. In this diagram, you can see that the pipelined CPU can execute twice as many instructions per unit time as the unpipelined CPU above.

There’s no reason that a CPU can only be broken up into two parts. We could break the CPU into three parts, and get a 3x speedup, or four parts and get a 4x speedup. This isn’t strictly true, and we generally get less than a 3x speedup for a three-stage pipeline or 4x speedup for a 4-stage pipeline because there’s overhead in breaking the CPU up into more parts and having a deeper pipeline.

One source of overhead is how branches are handled. One of the first things the CPU has to do for an instruction is to get the instruction; to do that, it has to know where the instruction is. For example, consider the following code:

if (x == 0) {
  // Do stuff
} else {
  // Do other stuff (things)
}
  // Whatever happens later

This might turn into assembly that looks something like

branch_if_not_equal x, 0, else_label
// Do stuff
goto end_label
else_label:
// Do things
end_label:
// whatever happens later

In this example, we compare x to 0. if_not_equal, then we branch to else_label and execute the code in the else block. If that comparison fails (i.e., if x is 0), we fall through, execute the code in the if block, and then jump to end_label in order to avoid executing the code in else block.

The particular sequence of instructions that’s problematic for pipelining is

branch_if_not_equal x, 0, else_label
???

The CPU doesn’t know if this is going to be

branch_if_not_equal x, 0, else_label
// Do stuff

or

branch_if_not_equal x, 0, else_label
// Do things

until the branch has finished (or nearly finished) executing. Since one of the first things the CPU needs to do for an instruction is to get the instruction from memory, and we don’t know which instruction ??? is going to be, we can’t even start on ??? until the previous instruction is nearly finished.

Earlier, when we said that we’d get a 3x speedup for a 3-stage pipeline or a 20x speedup for a 20-stage pipeline, that assumed that you could start a new instruction every cycle, but in this case the two instructions are nearly serialized.

non-overlapping execution due to branch stall

One way around this problem is to use branch prediction. When a branch shows up, the CPU will guess if the branch was taken or not taken.

speculating about a branch result

In this case, the CPU predicts that the branch won’t be taken and starts executing the first half of stuff while it’s executing the second half of the branch. If the prediction is correct, the CPU will execute the second half of stuff and can start another instruction while it’s executing the second half of stuff, like we saw in the first pipeline diagram.

overlapped execution after a correct prediction

If the prediction is wrong, when the branch finishes executing, the CPU will throw away the result from stuff.1 and start executing the correct instructions instead of the wrong instructions. Since we would’ve stalled the processor and not executed any instructions if we didn’t have branch prediction, we’re no worse off than we would’ve been had we not made a prediction (at least at the level of detail we’re looking at).

aborted prediction

What’s the performance impact of doing this? To make an estimate, we’ll need a performance model and a workload. For the purposes of this talk, our cartoon model of a CPU will be a pipelined CPU where non-branches take an average of one instruction per clock, unpredicted or mispredicted branches take 20 cycles, and correctly predicted branches take one cycle.

If we look at the most commonly used benchmark of “workstation” integer workloads, SPECint, the composition is maybe 20% branches, and 80% other operations. Without branch prediction, we then expect the “average” instruction to take branch_pct * 1 + non_branch_pct * 20 = 0.2 * 20 + 0.8 * 1 = 4 + 0.8 = 4.8 cycles. With perfect, 100% accurate, branch prediction, we’d expect the average instruction to take 0.8 * 1 + 0.2 * 1 = 1 cycle, a 4.8x speedup! Another way to look at it is that if we have a pipeline with a 20-cycle branch misprediction penalty, we have nearly a 5x overhead from our ideal pipelining speedup just from branches alone.

Let’s see what we can do about this. We’ll start with the most naive things someone might do and work our way up to something better.

Predict taken

Instead of predicting randomly, we could look at all branches in the execution of all programs. If we do this, we’ll see that taken and not not-taken branches aren’t exactly balanced -- there are substantially more taken branches than not-taken branches. One reason for this is that loop branches are often taken.

If we predict that every branch is taken, we might get 70% accuracy, which means we’ll pay the the misprediction cost for 30% of branches, making the cost of of an average instruction (0.8 + 0.7 * 0.2) * 1 + 0.3 * 0.2 * 20 = 0.94 + 1.2. = 2.14. If we compare always predicting taken to no prediction and perfect prediction, always predicting taken gets a large fraction of the benefit of perfect prediction despite being a very simple algorithm.

2.14 cycles per instruction

Backwards taken forwards not taken (BTFNT)

Predicting branches as taken works well for loops, but not so great for all branches. If we look at whether or not branches are taken based on whether or not the branch is forward (skips over code) or backwards (goes back to previous code), we can see that backwards branches are taken more often than forward branches, so we could try a predictor which predicts that backward branches are taken and forward branches aren’t taken (BTFNT). If we implement this scheme in hardware, compiler writers will conspire with us to arrange code such that branches the compiler thinks will be taken will be backwards branches and branches the compiler thinks won’t be taken will be forward branches.

If we do this, we might get something like 80% prediction accuracy, making our cost function (0.8 + 0.8 * 0.2) * 1 + 0.2 * 0.2 * 20 = 0.96 + 0.8 = 1.76 cycles per instruction.

1.76 cycles per instruction

Used by

One-bit

So far, we’ve look at schemes that don’t store any state, i.e., schemes where the prediction ignores the program’s execution history. These are called static branch prediction schemes in the literature. These schemes have the advantage of being simple but they have the disadvantage of being bad at predicting branches whose behavior change over time. If you want an example of a branch whose behavior changes over time, you might imagine some code like

if (flag) {
  // things
  }

Over the course of the program, we might have one phase of the program where the flag is set and the branch is taken and another phase of the program where flag isn’t set and the branch isn’t taken. There’s no way for a static scheme to make good predictions for a branch like that, so let’s consider dynamic branch prediction schemes, where the prediction can change based on the program history.

One of the simplest things we might do is to make a prediction based on the last result of the branch, i.e., we predict taken if the branch was taken last time and we predict not taken if the branch wasn’t taken last time.

Since having one bit for every possible branch is too many bits to feasibly store, we’ll keep a table of some number of branches we’ve seen and their last results. For this talk, let’s store not taken as 0 and taken as 1.

prediction table with 1-bit entries indexed by low bits of branch address

In this case, just to make things fit on a diagram, we have a 64-entry table, which mean that we can index into the table with 6 bits, so we index into the table with the low 6 bits of the branch address. After we execute a branch, we update the entry in the prediction table (highlighted below) and the next time the branch is executed again, we index into the same entry and use the updated value for the prediction.

indexed entry changes on update

It’s possible that we’ll observe aliasing and two branches in two different locations will map to the same location. This isn’t ideal, but there’s a tradeoff between table speed & cost vs. size that effectively limits the size of the table.

If we use a one-bit scheme, we might get 85% accuracy, a cost of (0.8 + 0.85 * 0.2) * 1 + 0.15 * 0.2 * 20 = 0.97 + 0.6 = 1.57 cycles per instruction.

1.57 cycles per instruction

Used by

Two-bit

A one-bit scheme works fine for patterns like TTTTTTTT… or NNNNNNN… but will have a misprediction for a stream of branches that’s mostly taken but has one branch that’s not taken, ...TTTNTTT... This can be fixed by adding second bit for each address and implementing a saturating counter. Let’s arbitrarily say that we count down when a branch is not taken and count up when it’s taken. If we look at the binary values, we’ll then end up with:

00: predict Not
01: predict Not
10: predict Taken
11: predict Taken

The “saturating” part of saturating counter means that if we count down from 00, instead of underflowing, we stay at 00, and similar for counting up from 11 staying at 11. This scheme is identical to the one-bit scheme, except that each entry in the prediction table is two bits instead of one bit.

same as 1-bit, except that the table has 2 bits

Compared to a one-bit scheme, a two-bit scheme can have half as many entries at the same size/cost (if we only consider the cost of storage and ignore the cost of the logic for the saturating counter), but even so, for most reasonable table sizes a two-bit scheme provides better accuracy.

Despite being simple, this works quite well, and we might expect to see something like 90% accuracy for a two bit predictor, which gives us a cost of 1.38 cycles per instruction.

1.38 cycles per instruction

One natural thing to do would be to generalize the scheme to an n-bit saturating counter, but it turns out that adding more bits has a relatively small effect on accuracy. We haven’t really discussed the cost of the branch predictor, but going from 2 bits to 3 bits per branch increases the table size by 1.5x for little gain, which makes it not worth the cost in most cases. The simplest and most common things that we won’t predict well with a two-bit scheme are patterns like NTNTNTNTNT... or NNTNNTNNT…, but going to n-bits won’t let us predict those patterns well either!

Used by

Two-level adaptive, global (1991)

If we think about code like

for (int i = 0; i < 3; ++i) {
  // code here.
  }

That code will produce a pattern of branches like TTTNTTTNTTTN....

If we know the last three executions of the branch, we should be able to predict the next execution of the branch:

TTT:N
TTN:T
TNT:T
NTT:T

The previous schemes we’ve considered use the branch address to index into a table that tells us if the branch is, according to recent history, more likely to be taken or not taken. That tells us which direction the branch is biased towards, but it can’t tell us that we’re in the middle of a repetitive pattern. To fix that, we’ll store the history of the most recent branches as well as a table of predictions.

Use global branch history and branch address to index into prediction table

In this example, we concatenate 4 bits of branch history together with 2 bits of branch address to index into the prediction table. As before, the prediction comes from a 2-bit saturating counter. We don’t want to only use the branch history to index into our prediction table since, if we did that, any two branches with the same history would alias to the same table entry. In a real predictor, we’d probably have a larger table and use more bits of branch address, but in order to fit the table on a slide, we have an index that’s only 6 bits long.

Below, we’ll see what gets updated when we execute a branch.

Update changes index because index uses bits from branch history

The bolded parts are the parts that were updated. In this diagram, we shift new bits of branch history in from right to left, updating the branch history. Because the branch history is updated, the low bits of the index into the prediction table are updated, so the next time we take the same branch again, we’ll use a different entry in the table to make the prediction, unlike in previous schemes where the index is fixed by the branch address. The old entry’s value is updated so that the next time we take the same branch again with the same branch history, we’ll have the updated prediction.

Since the history in this scheme is global, this will correctly predict patterns like NTNTNTNT… in inner loops, but may not always correct make predictions for higher-level branches because the history is global and will be contaminated with information from other branches. However, the tradeoff here is that keeping a global history is cheaper than keeping a table of local histories. Additionally, using a global history lets us correctly predict correlated branches. For example, we might have something like:

if x > 0:
  x -= 1
if y > 0:
  y -= 1
if x * y > 0:
  foo()

If either the first branch or the next branch isn’t taken, then the third branch definitely will not be taken.

With this scheme, we might get 93% accuracy, giving us 1.27 cycles per instruction.

1.27 cycles per instruction

Used by

Two-level adaptive, local [1992]

As mentioned above, an issue with the global history scheme is that the branch history for local branches that could be predicted cleanly gets contaminated by other branches.

One way to get good local predictions is to keep separate branch histories for separate branches.

keep a table of per-branch histories instead of a global history

Instead of keeping a single global history, we keep a table of local histories, index by the branch address. This scheme is identical to the global scheme we just looked at, except that we keep multiple branch histories. One way to think about this is that having global history is a special case of local history, where the number of histories we keep track of is 1.

With this scheme, we might get something like 94% accuracy, which gives us a cost of 1.23 cycles per instruction.

1.23 cycles per instruction

Used by

gshare

One tradeoff a global two-level scheme has to make is that, for a prediction table of a fixed size, bits must be dedicated to either the branch history or the branch address. We’d like to give more bits to the branch history because that allows correlations across greater “distance” as well as tracking more complicated patterns and we’d like to give more bits to the branch address to avoid interference between unrelated branches.

We can try to get the best of both worlds by hashing both the branch history and the branch address instead of concatenating them. One of the simplest reasonable things one might do, and the first proposed mechanism was to xor them together. This two-level adaptive scheme, where we xorthe bits together is called gshare.

hash branch address and branch history instead of appending

With this scheme, we might see something like 94% accuracy. That’s the accuracy we got from the local scheme we just looked at, but gshare avoids having to keep a large table of local histories; getting the same accuracy while having to track less state is a significant improvement.

Used by

agree (1997)

One reason for branch mispredictions is interference between different branches that alias to the same location. There are many ways to reduce interference between branches that alias to the same predictor table entry. In fact, the reason this talk only runs into schemes invented in the 90s is because a wide variety of interference-reducing schemes were proposed and there are too many to cover in half an hour.

We’ll look at one scheme which might give you an idea of what an interference-reducing scheme could look like, the “agree” predictor. When two branch-history pairs collide, the predictions either match or they don’t. If they match, we’ll call that neutral interference and if they don’t, we’ll call that negative interference. The idea is that most branches tend to be strongly biased (that is, if we use two-bit entries in the predictor table, we expect that, without interference, most entries will be 00 or 11 most of the time, not 01 or 10). For each branch in the program, we’ll store one bit, which we call the “bias”. The table of predictions will, instead of storing the absolute branch predictions, store whether or not the prediction matches or does not match the bias.

predict whether or not a branch agrees with its bias as opposed to whether or not it's taken

If we look at how this works, the predictor is identical to a gshare predictor, except that we make the changes mentioned above -- the prediction is agree/disagree instead of taken/not-taken and we have a bias bit that’s indexed by the branch address, which gives us something to agree or disagree with. In the original paper, they propose using the first thing you see as the bias and other people have proposed using profile-guided optimization (basically running the program and feeding the data back to the compiler) to determine the bias.

Note that, when we execute a branch and then later come back around to the same branch, we’ll use the same bias bit because the bias is indexed by the branch address, but we’ll use a different predictor table entry because that’s indexed by both the branch address and the branch history.

updating uses the same bias but a different meta-prediction table entry

If it seems weird that this would do anything, let’s look at a concrete example. Say we have two branches, branch A which is taken with 90% probability and branch B which is taken with 10% probability. If those two branches alias and we assume the probabilities that each branch is taken are independent, the probability that they disagree and negatively interfere is P(A taken) * P(B not taken) + P(A not taken) + P(B taken) = (0.9 * 0.9) + (0.1 * 0.1) = 0.82.

If we use the agree scheme, we can re-do the calculation above, but the probability that the two branches disagree and negatively interfere is P(A agree) * P(B disagree) + P(A disagree) * P(B agree) = P(A taken) * P(B taken) + P(A not taken) * P(B taken) = (0.9 * 0.1) + (0.1 * 0.9) = 0.18. Another way to look at it is, to have destructive interference, one of the branches must disagree with its bias. By definition, if we’ve correctly determined the bias, this cannot be likely to happen.

With this scheme, we might get something like 95% accuracy, giving us 1.19 cycles per instruction.

1.19 cycles per instruction

Used by

Hybrid (1993)

As we’ve seen, local predictors can predict some kinds of branches well (e.g., inner loops) and global predictors can predict some kinds of branches well (e.g., some correlated branches). One way to try to get the best of both worlds is to have both predictors, then have a meta predictor that predicts if the local or the global predictor should be used. A simple way to do this is to have the meta-predictor use the same scheme as the two-bit predictor above, except that instead of predicting taken or not taken it predicts local predictor or global predictor

predict which of two predictors is correct instead of predicting if the branch is taken

Just as there are many possible interference-reducing schemes, of which the agree predictor, above is one, there are many possible hybrid schemes. We could use any two predictors, not just a local and global predictor, and we could even use more than two predictors.

If we use a local and global predictor, we might get something like 96% accuracy, giving us 1.15 cycles per instruction.

1.15 cycles per instruction

Used by

Not covered

There are a lot of things we didn’t cover in this talk! As you might expect, the set of material that we didn’t cover is much larger than what we did cover. I’ll briefly describe a few things we didn’t cover, with references, so you can look them up if you’re interested in learning more.

One major thing we didn’t talk about is how to predict the branch target. Note that this needs to be done even for some unconditional branches (that is, branches that don’t need directional prediction because they’re always taken), since (some) unconditional branches have unknown branch targets.

Branch target prediction is expensive enough that some early CPUs had a branch prediction policy of “always predict not taken” because a branch target isn’t necessary when you predict the branch won’t be taken! Always predicting not taken has poor accuracy, but it’s still better than making no prediction at all.

Among the interference reducing predictors we didn’t discuss are bi-mode, gskew, and YAGS. Very briefly, bi-mode is somewhat like agree in that it tries to seperate out branches based on direction, but the mechanism used in bi-mode is that we keep multiple predictor tables and a third predictor based on the branch address is used to predict which predictor table gets use for the particular combination of branch and branch history. Bi-mode appears to be more successful than agree in that it's seen wider use. With gksew, we keep at least three predictor tables and use a different hash to index into each table. The idea is that, even if two branches alias, those two branches will only alias in one of the tables, so we can use a vote and the result from the other two tables will override the potentially bad result from the aliasing table. I don't know how to describe YAGS very briefly :-).

Because we didn't take about speed (as in latency), a prediction strategy we didn't talk about is to have a small/fast predictor that can be overridden by a slower and more accurate predictor when the slower predictor computes its result.

Some modern CPUs have completely different branch predictors; AMD Zen (2017) and AMD Bulldozer (2011) chips appear to use perceptron based branch predictors. Perceptrons are single-layer neural nets.

It’s been argued that Intel Haswell (2013) uses a variant of a TAGE predictor. TAGE stands for TAgged GEometric history length predictor. If we look at the predictors we’ve covered and look at actual executions of programs to see which branches we’re not predicting correctly, one major class of branches are branches that need a lot of history -- a significant number of branches need tens or hundreds of bits of history and some even need more than a thousand bits of branch history. If we have a single predictor or even a hybrid predictor that combines a few different predictors, it’s counterproductive to keep a thousand bits of history because that will make predictions worse for the branches which need a relatively small amount of history (especially relative to the cost), which is most branches. One of the ideas in the TAGE predictor is that, by keeping a geometric series of history lengths, each branch can use the appropriate history. That explains the GE. The TA part is that branches are tagged, which is a mechanism we don’t discuss that the predictor uses to track which branches should use which set of history.

Modern CPUs often have specialized predictors, e.g., a loop predictor can accurately predict loop branches in cases where a generalized branch predictor couldn’t reasonably store enough history to make perfect predictions for every iteration of the loop.

We didn’t talk at all about the tradeoff between using up more space and getting better predictions. Not only does changing the size of the table change the performance of a predictor, it also changes which predictors are better relative to each other.

We also didn’t talk at all about how different workloads affect different branch predictors. Predictor performance varies not only based on table size but also based on which particular program is run.

We’ve also talked about branch misprediction cost as if it’s a fixed thing, but it is not, and for that matter, the cost of non-branch instructions also varies widely between different workloads.

I tried to avoid introducing non-self-explanatory terminology when possible, so if you read the literature, terminology will be somewhat different.

Conclusion

We’ve looked at a variety of classic branch predictors and very briefly discussed a couple of newer predictors. Some of the classic predictors we discussed are still used in CPUs today, and if this were an hour long talk instead of a half-hour long talk, we could have discussed state-of-the-art predictors. I think that a lot of people have an idea that CPUs are mysterious and hard to understand, but I think that CPUs are actually easier to understand than software. I might be biased because I used to work on CPUs, but I think that this is not a result of my bias but something fundamental.

If you think about the complexity of software, the main limiting factor on complexity is your imagination. If you can imagine something in enough detail that you can write it down, you can make it. Of course there are cases where that’s not the limiting factor and there’s something more practical (e.g., the performance of large scale applications), but I think that most of us spend most of our time writing software where the limiting factor is the ability to create and manage complexity.

Hardware is quite different from this in that there are forces that push back against complexity. Every chunk of hardware you implement costs money, so you want to implement as little hardware as possible. Additionally, performance matters for most hardware (whether that’s absolute performance or performance per dollar or per watt or per other cost), and adding complexity makes hardware slower, which limits performance. Today, you can buy an off-the-shelf CPU for $300 which can be overclocked to 5 GHz. At 5 GHz, one unit of work is one-fifth of one nanosecond. For reference, light travels roughly one foot in one nanosecond. Another limiting factor is that people get pretty upset when CPUs don’t work perfectly all of the time. Although CPUs do have bugs, the rate of bugs is much lower than in almost all software, i.e., the standard to which they’re verified/tested is much higher. Adding complexity makes things harder to test and verify. Because CPUs are held to a higher correctness standard than most software, adding complexity creates a much higher test/verification burden on CPUs, which makes adding a similar amount of complexity much more expensive in hardware than in software, even ignoring the other factors we discussed.

A side effect of these factors that push back against chip complexity is that, for any particular “high-level” general purpose CPU feature, it is generally conceptually simple enough that it can be described in a half-hour or hour-long talk. CPUs are simpler than many programmers think! BTW, I say “high-level” to rule out things like how transistors and circuit design, which can require a fair amount of low-level (physics or solid-state) background to understand.

CPU internals series

Thanks to Leah Hanson, Hari Angepat, and Nick Bergson-Shilcock for reviewing practice versions of the talk and to Fred Clausen Jr for finding a typo in this post. Apologies for the somewhat slapdash state of this post -- I wrote it quickly so that people who attended the talk could refer to the “transcript ” soon afterwards and look up references, but this means that there are probably more than the usual number of errors and that the organization isn’t as nice as it would be for a normal blog post. In particular, things that were explained using a series of animations in the talk are not explained in the same level of detail and on skimming this, I notice that there’s less explanation of what sorts of branches each predictor doesn’t handle well, and hence less motivation for each predictor. I may try to go back and add more motivation, but I’m unlikely to restructure the post completely and generate a new set of graphics that better convey concepts when there are a couple of still graphics next to text. Thanks to Julien Vivenot, Ralph Corderoy, Vaibhav Sagar, Mindy Preston, and Uri Shaked for catching typos in this hastily written post.

Wed, 23 Aug 2017 00:00:00 +0000


Sattolo's algorithm

I recently had a problem where part of the solution was to do a series of pointer accesses that would walk around a chunk of memory in pseudo-random order. Sattolo's algorithm provides a solution to this because it produces a permutation of a list with exactly one cycle, which guarantees that we will reach every element of the list even though we're traversing it in random order.

However, the explanations of why the algorithm worked that I could find online either used some kind of mathematical machinery (stirling numbers, assuming familiarity with cycle notation, etc.), or used logic that was hard for me to follow. I find that this is common for explanations of concepts that could, but don't have to, use a lot of mathematical machinery. I don't think there's anything wrong with using existing mathematical methods per se -- it's a nice mental shortcut if you're familiar with the concepts, but I think it's unfortunate that it's hard to find a relatively simple explanation that doesn't require any background. When I was looking for a simple explanation, I also found a lot of people who were using Sattolo's algorithm in places where it wasn't appropriate and also people who didn't know that Sattolo's algorithm is what they were looking for, so here's an attempt at an explanation of why the algorithm works that doesn't assume an undergraduate combinatorics background.

Before we look at Sattolo's algorithm, let's look at Fisher-Yates, which is an in-place algorithm that produces a random permutation of an array/vector, where every possible permutation occurs with uniform probability.

We'll look at the code for Fisher-Yates and then how to prove that the algorithm produces the intended result.

def shuffle(a):
    n = len(a)
    for i in range(n - 1):  # i from 0 to n-2, inclusive.
        j = random.randrange(i, n)  # j from i to n-1, inclusive.
        a[i], a[j] = a[j], a[i]  # swap a[i] and a[j].

shuffle takes an array and produces a permutation of the array, i.e., it shuffles the array. We can think of this loop as placing each element of the array, a, in turn, from a[0] to a[n-2]. On some iteration, i, we choose one of n-i elements to swap with and swap element i with some random element. The last element in the array, a[n-1], is skipped because it would always be swapped with itself. One way to see that this produces every possible permutation with uniform probability is to write down the probability that each element will end up in any particular location1. Another way to do it is to observe two facts about this algorithm:

  1. Every output that Fisher-Yates produces is produced with uniform probability
  2. Fisher-Yates produces as many outputs as there are permutations (and each output is a permutation)

(1) For each random choice we make in the algorithm, if we make a different choice, we get a different output. For example, if we look at the resultant a[0], the only way to place the element that was originally in a[k] (for some k) in the resultant a[0] is to swap a[0] with a[k] in iteration 0. If we choose a different element to swap with, we'll end up with a different resultant a[0]. Once we place a[0] and look at the resultant a[1], the same thing is true of a[1] and so on for each a[i]. Additionally, each choice reduces the range by the same amount -- there's a kind of symmetry, in that although we place a[0] first, we could have placed any other element first; every choice has the same effect. This is vaguely analogous to the reason that you can pick an integer uniformly at random by picking digits uniformly at random, one at a time.

(2) How many different outputs does Fisher-Yates produce? On the first iteration, we fix one of n possible choices for a[0], then given that choice, we fix one of n-1 choices for a[1], then one of n-2 for a[2], and so on, so there are n * (n-1) * (n-2) * ... 2 * 1 = n! possible different outputs.

This is exactly the same number of possible permutations of n elements, by pretty much the same reasoning. If we want to count the number of possible permutations of n elements, we first pick one of n possible elements for the first position, n-1 for the second position, and so on resulting in n! possible permutations.

Since Fisher-Yates only produces unique permutations and there are exactly as many outputs as there are permutations, Fisher-Yates produces every possible permutation. Since Fisher-Yates produces each output with uniform probability, it produces all possible permutations with uniform probability.

Now, let's look at Sattolo's algorithm, which is almost identical to Fisher-Yates and also produces a shuffled version of the input, but produces something quite different:

def sattolo(a):
    n = len(a)
    for i in range(n - 1):
        j = random.randrange(i+1, n)  # i+1 instead of i
        a[i], a[j] = a[j], a[i]

Instead of picking an element at random to swap with, like we did in Fisher-Yates, we pick an element at random that is not the element being placed, i.e., we do not allow an element to be swapped with itself. One side effect of this is that no element ends up where it originally started.

Before we talk about why this produces the intended result, let's make sure we're on the same page regarding terminology. One way to look at an array is to view it as a description of a graph where the index indicates the node and the value indicates where the edge points to. For example, if we have the list 0 2 3 1, this can be thought of as a directed graph from its indices to its value, which is a graph with the following edges:

0 -> 0
1 -> 2
2 -> 3
3 -> 1

Node 0 points to itself (because the value at index 0 is 0), node 1 points to node 2 (because the value at index 1 is 2), and so on. If we traverse this graph, we see that there are two cycles. 0 -> 0 -> 0 ... and 1 -> 2 -> 3 -> 1....

Let's say we swap the element in position 0 with some other element. It could be any element, but let's say that we swap it with the element in position 2. Then we'll have the list 3 2 0 1, which can be thought of as the following graph:

0 -> 3
1 -> 2
2 -> 0
3 -> 1

If we traverse this graph, we see the cycle 0 -> 3 -> 1 -> 2 -> 0.... This is an example of a permutation with exactly one cycle.

If we swap two elements that belong to different cycles, we'll merge the two cycles into a single cycle. One way to see this is when we swap two elements in the list, we're essentially picking up the arrow-heads pointing to each element and swapping where they point (rather than the arrow-tails, which stay put). Tracing the result of this is like tracing a figure-8. Just for example, say if we swap 0 with an arbitrary element of the other cycle, let's say element 2, we'll end up with 3 2 0 1, whose only cycle is 0 -> 3 -> 1 -> 2 -> 0.... Note that this operation is reversible -- if we do the same swap again, we end up with two cycles again. In general, if we swap two elements from the same cycle, we break the cycle into two separate cycles.

If we feed a list consisting of 0 1 2 ... n-1 to sattolo's algorithm we'll get a permutation with exactly one cycle. Furthermore, we have the same probability of generating any permutation that has exactly one cycle. Let's look at why Sattolo's generates exactly one cycle. Afterwards, we'll figure out why it produces all possible cycles with uniform probability.

For Sattolo's algorithm, let's say we start with the list 0 1 2 3 ... n-1, i.e., a list with n cycles of length 1. On each iteration, we do one swap. If we swap elements from two separate cycles, we'll merge the two cycles, reducing the number of cycles by 1. We'll then do n-1 iterations, reducing the number of cycles from n to n - (n-1) = 1.

Now let's see why it's safe to assume we always swap elements from different cycles. In each iteration of the algorithm, we swap some element with index > i with the element at index i and then increment i. Since i gets incremented, the element that gets placed into index i can never be swapped again, i.e., each swap puts one of the two elements that was swapped into its final position, i.e., for each swap, we take two elements that were potentially swappable and render one of them unswappable.

When we start, we have n cycles of length 1, each with 1 element that's swappable. When we swap the initial element with some random element, we'll take one of the swappable elements and render it unswappable, creating a cycle of length 2 with 1 swappable element and leaving us with n-2 other cycles, each with 1 swappable element.

The key invariant that's maintained is that each cycle has exactly 1 swappable element. The invariant holds in the beginning when we have n cycles of length 1. And as long as this is true, every time we merge two cycles of any length, we'll take the swappable element from one cycle and swap it with the swappable element from the other cycle, rendering one of the two elements unswappable and creating a longer cycle that still only has one swappable element, maintaining the invariant.

Since we cannot swap two elements from the same cycle, we merge two cycles with every swap, reducing the number of cycles by 1 with each iteration until we've run n-1 iterations and have exactly one cycle remaining.

To see that we generate each cycle with equal probability, note that there's only one way to produce each output, i.e., changing any particular random choice results in a different output. In the first iteration, we randomly choose one of n-1 placements, then n-2, then n-3, and so on, so for any particular cycle, we produce it with probability (n-1) * (n-2) * (n-3) ... * 2 * 1 = (n-1)!. If we can show that there are (n-1)! permutations with exactly one cycle, then we'll know that we generate every permutation with exactly one cycle with uniform probability.

Let's say we have an arbitrary list of length n that has exactly one cycle and we add a single element, there are n ways to extend that to become a cycle of length n+1 because there are n places we could add in the new element and keep the cycle, which means that the number of cycles of length n+1, cycles(n+1), is n * cycles(n).

For example, say we have a cycle that produces the path 0 -> 1 -> 2 -> 0 ... and we want to add a new element, 3. We can substitute -> 3 -> for any -> and get a cycle of length 4 instead of length 3.

In the base case, there's one cycle of length 2, the permutation 1 0 (the other permutation of length two, 0 1, has two cycles of length one instead of having a cycle of length 2), so we know that cycles(2) = 1. If we apply the recurrence above, we get that cycles(n) = (n-1)!, which is exactly the number of different permutations that Sattolo's algorithm generates, which means that we generate all possible permutations with one cycle. Since we know that we generate each cycle with uniform probability, we now know that we generate all possible one-cycle permutations with uniform probability.

An alternate way to see that there are (n-1)! permutations with exactly one cycle, is that we rotate each cycle around so that 0 is at the start and write it down as 0 -> i -> j -> k -> .... The number of these is the same as the number of permutations of elements to the right of the 0 ->, which is (n-1)!.

Conclusion

We've looked at two algorithms that are identical, except for a two character change. These algorithms produce quite different results -- one algorithm produces a random permutation and the other produces a random permutation with exactly one cycle. I think these algorithms are neat because they're so simple, just a double for loop with a swap.

In practice, you probably don't "need" to know how these algorithms work because the standard library for most modern languages will have some way of producing a random shuffle. And if you have a function that will give you a shuffle, you can produce a permutation with exactly one cycle if you don't mind a non-in-place algorithm that takes an extra pass. I'll leave that as an exercise for the reader, but if you want a hint, one way to do it parallels the "alternate" way to see that there are `(n-1)! permutations with exactly one cycle.

Although I said that you probably don't need to know this stuff, you do actually need to know it if you're going to implement a custom shuffling algorithm! That may sound obvious, but there's a long history of people implementing incorrect shuffling algorithms. This was common in games and on online gambling sites in the 90s and even the early 2000s and you still see the occasional mis-implemented shuffle, e.g., when Microsoft implemented a bogus shuffle and failed to properly randomize a browser choice poll. At the time, the top Google hit for javascript random array sort was the incorrect algorithm that Microsoft ended up using. That site has been fixed, but you can still find incorrect tutorials floating around online.

Appendix: generating a random derangement

A permutation where no element ends up in its original position is called a derangement. Sattolo's algorithm generates derangements, but it only generates derangements with exactly one cycle, and there are derangements with more than one cycle (e.g., 3 2 1 0), so can't possibly generate random derangements with uniform probability.

One way to generate random derangements is to generate random shuffles using Fisher-Yates and then retry until we get a derangement:

def derangement(n):
    assert n != 1, "can't have a derangement of length 1"
    a = list(range(n))
    while not is_derangement(a):
        shuffle(a)
    return a

This algorithm is simple, and is overwhelmingly likely to eventually return a derangement (for n != 1), but it's not immediately obvious how long we should expect this to run before it returns a result. Maybe we'll get a derangement on the first try and run shuffle once, or maybe it will take 100 tries and we'll have to do 100 shuffles before getting a derangement.

To figure this out, we'll want to know the probability that a random permutation (shuffle) is a derangement. To get that, we'll want to know, given a list of of length n, how many permutations there are and how many derangements there are.

Since we're deep in the appendix, I'll assume that you know the number of permutations of a n elements is n! what binomial coefficients are, and are comfortable with Taylor series.

To count the number of derangements, we can start with the number of permutations, n!, and subtract off permutations where an element remains in its starting position, (n choose 1) * (n - 1)!. That isn't quite right because this double subtracts permutations where two elements remain in the starting position, so we'll have to add back (n choose 2) * (n - 2)!. That isn't quite right because we've overcorrected elements with three permutations, so we'll have to add those back, and so on and so forth, resulting in ∑ (−1)ᵏ (n choose k)(n−k)!. If we expand this out and divide by n! and cancel things out, we get ∑ (−1)ᵏ (1 / k!). If we look at the limit as the number of elements goes to infinity, this looks just like the Taylor series for e^x where x = -1, i.e., 1/e, i.e., in the limit, we expect that the fraction of permutations that are derangements is 1/e, i.e., we expect to have to do e times as many swaps to generate a derangement as we do to generate a random permutation. Like many alternating series, this series converges quickly. It gets within 7 significant figures of e when k = 10!

One silly thing about our algorithm is that, if we place the first element in the first location, we already know that we don't have a derangement, but we continue placing elements until we've created an entire permutation. If we reject illegal placements, we can do even better than a factor of e overhead. It's also possible to come up with a non-rejection based algorithm, but I really enjoy the naive rejection based algorithm because I find it delightful when basic randomized algorithms that consist of "keep trying again" work well.

Appendix: wikipedia's explanation of Sattolo's algorithm

I wrote this explanation because I found the explanation in Wikipedia relatively hard to follow, but if you find the explanation above difficult to understand, maybe you'll prefer wikipedia's version:

The fact that Sattolo's algorithm always produces a cycle of length n can be shown by induction. Assume by induction that after the initial iteration of the loop, the remaining iterations permute the first n - 1 elements according to a cycle of length n - 1 (those remaining iterations are just Sattolo's algorithm applied to those first n - 1 elements). This means that tracing the initial element to its new position p, then the element originally at position p to its new position, and so forth, one only gets back to the initial position after having visited all other positions. Suppose the initial iteration swapped the final element with the one at (non-final) position k, and that the subsequent permutation of first n - 1 elements then moved it to position l; we compare the permutation π of all n elements with that remaining permutation σ of the first n - 1 elements. Tracing successive positions as just mentioned, there is no difference between σ and π until arriving at position k. But then, under π the element originally at position k is moved to the final position rather than to position l, and the element originally at the final position is moved to position l. From there on, the sequence of positions for π again follows the sequence for σ, and all positions will have been visited before getting back to the initial position, as required.

As for the equal probability of the permutations, it suffices to observe that the modified algorithm involves (n-1)! distinct possible sequences of random numbers produced, each of which clearly produces a different permutation, and each of which occurs--assuming the random number source is unbiased--with equal probability. The (n-1)! different permutations so produced precisely exhaust the set of cycles of length n: each such cycle has a unique cycle notation with the value n in the final position, which allows for (n-1)! permutations of the remaining values to fill the other positions of the cycle notation

Thanks to Mathieu Guay-Paquet, Leah Hanson, Rudi Chen, Kamal Marhubi, Michael Robert Arntzenius, Heath Borders, Shreevatsa R, and David Turner for comments/corrections/discussion.


  1. a[0] is placed on the first iteration of the loop. Assuming randrange generates integers with uniform probability in the appropriate range, the original a[0] has 1/n probability of being swapped with any element (including itself), so the resultant a[0] has a 1/n chance of being any element from the original a, which is what we want.

    a[1] is placed on the second iteration of the loop. At this point, a[0] is some element from the array before it was mutated. Let's call the unmutated array original. a[0] is original[k], for some k. For any particular value of k, it contains original[k] with probability 1/n. We then swap a[1] with some element from the range [1, n-1].

    If we want to figure out the probability that a[1] is some particular element from original, we might think of this as follows: a[0] is original[k_0] for some k_0. a[1] then becomes original[k_1] for some k_1 where k_1 != k_0. Since k_0 was chosen uniformly at random, if we integrate over all k_0, k_1 is also uniformly random.

    Another way to look at this is that it's arbitrary that we place a[0] and choose k_0 before we place a[1] and choose k_1. We could just have easily placed a[1] and chosen k_1 first so, over all possible choices, the choice of k_0 cannot bias the choice of k_1.

    [return]

Wed, 09 Aug 2017 00:00:00 +0000


Terminal latency

There’s a great MSR demo from 2012 that shows the effect of latency on the experience of using a tablet. If you don’t want to watch the three minute video, they basically created a device which could simulate arbitrary latencies down to a fraction of a millisecond. At 100ms (1/10th of a second), which is typical of consumer tablets, the experience is terrible. At 10ms (1/100th of a second), the latency is noticeable, but the experience is ok, and at < 1ms the experience is great, as good as pen and paper. If you want to see a mini version of this for yourself, you can try a random Android tablet with a stylus vs. the current generation iPad Pro with the Apple stylus. The Apple device has well above 10ms end-to-end latency, but the difference is still quite dramatic -- it’s enough that I’ll actually use the new iPad Pro to take notes or draw diagrams, whereas I find Android tablets unbearable as a pen-and-paper replacement.

You can also see something similar if you try VR headsets with different latencies. 20ms feels fine, 50ms feels laggy, and 150ms feels unbearable.

Curiously, I rarely hear complaints about keyboard and mouse input being slow. One reason might be that keyboard and mouse input are quick and that inputs are reflected nearly instantaneously, but I don’t think that’s true. People often tell me that’s true, but I think it’s just the opposite. The idea that computers respond quickly to input, so quickly that humans can’t notice the latency, is the most common performance-related fallacy I hear from professional programmers.

When people measure actual end-to-end latency for games on normal computer setups, they usually find latencies in the 100ms range.

If we look at Robert Menzel’s breakdown of the the end-to-end pipeline for a game, it’s not hard to see why we expect to see 100+ ms of latency:

Note that this assumes a gaming mouse and a pretty decent LCD; it’s common to see substantially slower latency for the mouse and for pixel switching.

It’s possible to tune things to get into the 40ms range, but the vast majority of users don’t do that kind of tuning, and even if they do, that’s still quite far from the 10ms to 20ms range, where tablets and VR start to feel really “right”.

Keypress-to-display measurements are mostly done in games because gamers care more about latency than most people, but I don’t think that most applications are all that different from games in terms of latency. While games often do much more work per frame than “typical” applications, they’re also much better optimized than “typical” applications. Menzel budgets 33ms to the game, half for game logic and half for rendering. How much time do non-game applications take? Pavel Fatin measured this for text editors and found latencies ranging from a few milliseconds to hundreds of milliseconds and he did this with an app he wrote that we can use to measure the latency of other applications that uses java.awt.Robot to generate keypresses and do screen captures.

Personally, I’d like to see the latency of different terminals and shells for a couple of reasons. First, I spend most of my time in a terminal and usually do editing in a terminal, so the latency I see is at least the latency of the terminal. Second, the most common terminal benchmark I see cited (by at least two orders of magnitude) is the rate at which a terminal can display output, often measured by running cat on a large file. This is pretty much as useless a benchmark as I can think of. I can’t recall the last task I did which was limited by the speed at which I can cat a file to stdout on my terminal (well, unless I’m using eshell in emacs), nor can I think of any task for which that sub-measurement is useful. The closest thing that I care about is the speed at which I can ^C a command when I’ve accidentally output too much to stdout, but as we’ll see when we look at actual measurements, a terminal’s ability to absorb a lot of input to stdout is only weakly related to its responsiveness to ^C. The speed at which I can scroll up or down an entire page sounds related, but in actual measurements the two are not highly correlated (e.g., emacs-eshell is quick at scrolling but extremely slow at sinking stdout). Another thing I care about is latency, but knowing that a particular terminal has high stdout throughput tells me little to nothing about its latency.

Let’s look at some different terminals to see if any terminals add enough latency that we’d expect the difference to be noticeable. If we measure the latency from keypress to internal screen capture on my laptop, we see the following latencies for different terminals

Plot of terminal tail latency Plot of terminal tail latency

These graphs show the distribution of latencies for each terminal. The y-axis has the latency in milliseconds. The x-axis is the percentile (e.g., 50 means represents 50%-ile keypress i.e., the median keypress). Measurements are with macOS unless otherwise stated. The graph on the left is when the machine is idle, and the graph on the right is under load. If we just look at median latencies, some setups don’t look too bad -- terminal.app and emacs-eshell are at roughly 5ms unloaded, small enough that many people wouldn’t notice. But most terminals (st, alacritty, hyper, and iterm2) are in the range where you might expect people to notice the additional latency even when the machine is idle. If we look at the tail when the machine is idle, say the 99.9%-ile latency, every terminal gets into the range where the additional latency ought to be perceptible, according to studies on user interaction. For reference, the internally generated keypress to GPU memory trip for some terminals is slower than the time it takes to send a packet from Boston to Seattle and back, about 70ms.

All measurements were done with input only happening on one terminal at a time, with full battery and running off of A/C power. The loaded measurements were done while compiling Rust (as before, with full battery and running off of A/C power, and in order to make the measurements reproducible, each measurement started 15s after a clean build of Rust after downloading all dependencies, with enough time between runs to avoid thermal throttling interference across runs).

If we look at median loaded latencies, other than emacs-term, most terminals don’t do much worse than at idle. But as we look at tail measurements, like 90%-ile or 99.9%-ile measurements, every terminal gets much slower. Switching between macOS and Linux makes some difference, but the difference is different for different terminals.

These measurements aren't anywhere near the worst case (if we run off of battery when the battery is low, and wait 10 minutes into the compile in order to exacerbate thermal throttling, it’s easy to see latencies that are multiple hundreds of ms) but even so, every terminal has tail latency that should be observable. Also, recall that this is only a fraction of the total end-to-end latency.

Why don’t people complain about keyboard-to-display latency the way they complain stylus-to-display latency or VR latency? My theory is that, for both VR and tablets, people have a lot of experience with a much lower latency application. For tablets, the “application” is pen-and-paper, and for VR, the “application” is turning your head without a VR headset on. But input-to-display latency is so bad for every application that most people just expect terrible latency.

An alternate theory might be that keyboard and mouse input are fundamentally different from tablet input in a way that makes latency less noticeable. Even without any data, I’d find that implausible because, when I access a remote terminal in a way that adds tens of milliseconds of extra latency, I find typing to be noticeably laggy. And it turns out that when extra latency is A/B tested, people can and do notice latency in the range we’re discussing here.

Just so we can compare the most commonly used benchmark (throughput of stdout) to latency, let’s measure how quickly different terminals can sink input on stdout:

terminal stdout
(MB/s)
idle50
(ms)
load50
(ms)
idle99.9
(ms)
load99.9
(ms)
mem
(MB)
^C
alacritty 39 31 28 36 56 18 ok
terminal.app 20 6 13 25 30 45 ok
st 14 25 27 63 111 2 ok
alacritty tmux 14
terminal.app tmux 13
iterm2 11 44 45 60 81 24 ok
hyper 11 32 31 49 53 178 fail
emacs-eshell 0.05 5 13 17 32 30 fail
emacs-term 0.03 13 30 28 49 30 ok

The relationship between the rate that a terminal can sink stdout and its latency is non-obvious. For the matter, the relationship between the rate at which a terminal can sink stdout and how fast it looks is non-obvious. During this test, terminal.app looked very slow. The text that scrolls by jumps a lot, as if the screen is rarely updating. Also, hyper and emacs-term both had problems with this test. Emacs-term can’t really keep up with the output and it takes a few seconds for the display to finish updating after the test is complete (the status bar that shows how many lines have been output appears to be up to date, so it finishes incrementing before the test finishes). Hyper falls further behind and pretty much doesn’t update the screen after a flickering a couple of times. The Hyper Helper process gets pegged at 100% CPU for about two minutes and the terminal is totally unresponsive for that entire time.

Alacritty was tested with tmux because alacritty doesn’t support scrolling back up, and the docs indicate that you should use tmux if you want to be able to scroll up. Just to have another reference, terminal.app was also tested with tmux. For most terminals, tmux doesn’t appear to reduce stdout speed, but alacritty and terminal.app are fast enough that they’re actually limited by the speed of tmux.

Emacs-eshell is technically not a terminal, but I also tested eshell because it can be used as a terminal alternative for some use cases. Emacs, with both eshell and term, is actually slow enough that I care about the speed at which it can sink stdout. When I’ve used eshell or term in the past, I find that I sometimes have to wait for a few thousand lines of text to scroll by if I run a command with verbose logging to stdout or stderr. Since that happens very rarely, it’s not really a big deal to me unless it’s so slow that I end up waiting half a second or a second when it happens, and no other terminal is slow enough for that to matter.

Conversely, I type individual characters often enough that I’ll notice tail latency. Say I type at 120wpm and that results in 600 characters per minute, or 10 characters per second of input. Then I’d expect to see the 99.9% tail (1 in 1000) every 100 seconds!

Anyway, the cat “benchmark” that I care about more is whether or not I can ^C a process when I’ve accidentally run a command that outputs millions of lines to the screen instead of thousands of lines. For that benchmark, every terminal is fine except for hyper and emacs-eshell, both of which hung for at least ten minutes (I killed each process after ten minutes, rather than waiting for the terminal to catch up).

Memory usage at startup is also included in the table for reference because that's the other measurement I see people benchmark terminals with. While I think that it's a bit absurd that terminals can use 40MB at startup, even the three year old hand-me-down laptop I'm using has 16GB of RAM, so squeezing that 40MB down to 2MB doesn't have any appreciable affect on user experience. Heck, even the $300 chromebook we recently got has 16GB of RAM.

Conclusion

Most terminals have enough latency that the user experience could be improved if the terminals concentrated more on latency and less on other features or other aspects of performance. However, when I search for terminal benchmarks, I find that terminal authors, if they benchmark anything, benchmark the speed of sinking stdout or memory usage at startup. This is unfortunate because most “low performance” terminals can already sink stdout many orders of magnitude faster than humans can keep up with, so further optimizing stdout throughput has a relatively small impact on actual user experience for most users. Likewise for reducing memory usage when an idle terminal uses 0.01% of the memory on my old and now quite low-end laptop.

If you work on a terminal, perhaps consider relatively more latency and interactivity (e.g., responsiveness to ^C) optimization and relatively less throughput and idle memory usage optimization.

Update: In response to this post, the author of alacritty explains where alacritty's latency comes from and describes how alacritty could reduce its latency

Appendix: negative results

Tmux and latency: I tried tmux and various terminals and found that the the differences were within the range of measurement noise.

Shells and latency: I tried a number of shells and found that, even in the quickest terminal, the difference between shells was within the range of measurement noise. Powershell was somewhat problematic to test with the setup I was using because it doesn’t handle colors correctly (the first character typed shows up with the color specified by the terminal, but other characters are yellow regardless of setting, which appears to be an open issue), which confused the image recognition setup I used. Powershell also doesn’t consistently put the cursor where it should be -- it jumps around randomly within a line, which also confused the image recognition setup I used. However, despite its other problems, powershell had comparable performance to other shells.

Shells and stdout throughput: As above, the speed difference between different shells was within the range of measurement noise.

Single-line vs. multiline text and throughput: Although some text editors bog down with extremely long lines, throughput was similar when I shoved a large file into a terminal whether the file was all one line or was line broken every 80 characters.

Head of line blocking / coordinated omission: I ran these tests with input at a rate of 10.3 characters per second. But it turns out this doesn't matter much and input rates that humans are capapable of and the latencies are quite similar to doing input once every 10.3 seconds. It's possible to overwhelm a terminal, and hyper is the first to start falling over at high input rates, but the speed necessary to make the tail latency worse is beyond the rate at which any human I know of can type.

Appendix: experimental setup

All tests were done on a dual core 2.6GHz 13” Mid-2014 Macbook pro. The machine has 16GB of RAM and a 2560x1600 screen. The OS X version was 10.12.5. Some tests were done in Linux (Lubuntu 16.04) to get a comparison between macOS and Linux. 10k keypresses were for each latency measurements.

Latency measurements were done with the . key and throughput was done with default base32 output, which is all plain ASCII text. George King notes that different kinds of text can change output speed:

I’ve noticed that Terminal.app slows dramatically when outputting non-latin unicode ranges. I’m aware of three things that might cause this: having to load different font pages, and having to parse code points outside of the BMP, and wide characters.

The first probably boils down to a very complicated mix of lazy loading of font glyphs, font fallback calculations, and caching of the glyph pages or however that works.

The second is a bit speculative, but I would bet that Terminal.app uses Cocoa’s UTF16-based NSString, which almost certainly hits a slow path when code points are above the BMP due to surrogate pairs.

Terminals were fullscreened before running tests. This affects test results, and resizing the terminal windows can and does significantly change performance (e.g., it’s possible to get hyper to be slower than iterm2 by changing the window size while holding everything else constant). st on macOS was running as an X client under XQuartz. To see if XQuartz is inherently slow, I tried runes, another "native" Linux terminal that uses XQuartz; runes had much better tail latency than st and iterm2.

The “idle” latency tests were done on a freshly rebooted machine. All terminals were running, but input was only fed to one terminal at a time.

The “loaded” latency tests were done with rust compiling in the background, 15s after the compilation started.

Terminal bandwidth tests were done by creating a large, pseudo-random, text file with

timeout 64 sh -c 'cat /dev/urandom | base32 > junk.txt'

and then running

timeout 8 sh -c 'cat junk.txt | tee junk.term_name'

Terminator and urxvt weren’t tested because they weren’t completely trivial to install on mac and I didn’t want to futz around to make them work. Terminator was easy to build from source, but it hung on startup and didn’t get to a shell prompt. Urxvt installed through brew, but one of its dependencies (also installed through brew) was the wrong version, which prevented it from starting.

Thanks to Kamal Marhubi, Leah Hanson, Wesley Aptekar-Cassels, David Albert, Vaibhav Sagar, Indradhanush Gupta, Rudi Chen, Laura Lindzey, Ahmad Jarara, George King, Tim Dierks, Nikith Naide, Veit Heller, and Nick Bergson-Shilcock for comments/corrections/discussion.

Tue, 18 Jul 2017 00:00:00 +0000


Keyboard v. mouse

Which is faster, keyboard or mouse? A large number of programmers believe that the keyboard is faster for all (programming-related) tasks. However, there are a few widely cited webpages that claim that “studies” show that using the mouse is faster than using the keyboard for everything and that people who think that using the keyboard is faster are just deluding themselves. This might sound extreme, but, just for example, one page says that the author has “never seen [the keyboard] outperform the mouse”.

But it can’t be the case that the mouse is faster for everything -- almost no one is faster at clicking on an on-screen keyboard with a mouse than typing at a physical keyboard. Conversely, there are tasks for which mice are much better suited than keyboards (e.g., aiming in FPS games). For someone without an agenda, the question shouldn’t be, which is faster at all tasks, but which tasks are faster with a keyboard, which are faster with a mouse, and which are faster when both are used?

You might ask if any of this matters. It depends! One of the best programmrers I know is a hunt-and-peck typist, so it's clearly possible to be a great programmer without having particularly quick input speed. But I'm in the middle of an easy data munging task where I'm limited by the speed at which I can type in a large amount of boring code. If I were quicker, this task would be quicker, and there are tasks that I don't do that I might do. I can type at > 100 wpm, which isn't bad, but I can talk at > 400 wpm and I can think much faster than I can talk. I'm often rate limited even when talking; typing is much worse and the half-a-second here and one-second there I spent on navigation certainly doesn't help. When I first got started in tech, I had a mundane test/verification/QA role where my primary job was to triage test failures. Even before I started automating tasks, I could triage nearly twice as many bugs per day as other folks in the same role because I took being efficient at basic navigation tasks seriously. Nowadays, my jobs aren't 90% rote anymore, but my guess is that about a third of the time I spend in front of a computer is spent on mindless tasks that are rate-limited by my input and navigation speed. If I could get faster at those mundane tasks and have to spend less time on them and more time doing things that are fun, that would be great.

Anyway, to start, let’s look at the cited studies to see where the mouse is really faster. Most references on the web, when followed all the way back, point to the AskTog, a site by Bruce Tognazzini, who describes himself as a “recognized leader in human/computer interaction design”.

The most cited AskTog page on the topic claims that they've spent $50M of R&D and done all kinds of studies; the page claims that, among other things, the $50M in R&D showed “Test subjects consistently report that keyboarding is faster than mousing” and “The stopwatch consistently proves mousing is faster than keyboarding. ”. The claim is that this both proves that the mouse is faster than the keyboard, and explains why programmers think the keyboard is faster than the mouse even though it’s slower. However, the result is unreproducible because “Tog” not only doesn’t cite the details of the experiments, Tog doesn’t even describe the experiments and just makes a blanket claim.

The second widely cited AskTog page is in response to a response to the previous page, and it simply repeats that the first page showed that keyboard shortcuts are slower. While there’s a lot of sarcasm, like “Perhaps we have all been misled these years. Perhaps the independent studies that show over and over again that Macintosh users are more productive, can learn quicker, buy more software packages, etc., etc., etc., are somehow all flawed. Perhaps....” no actual results are cited, as before. There is, however, a psuedo-scientific explanation of why the mouse is faster than the keyboard:

Command Keys Aren’t Faster. As you know from my August column, it takes just as long to decide upon a command key as it does to access the mouse. The difference is that the command-key decision is a high-level cognitive function of which there is no long-term memory generated. Therefore, subjectively, keys seem faster when in fact they usually take just as long to use.

Since mouse acquisition is a low-level cognitive function, the user need not abandon cognitive process on the primary task during the acquisition period. Therefore, the mouse acquirer achieves greater productivity.

One question this raises is, why should typing on the keyboard be any different from using command keys? There certainly are people who aren’t fluent at touch typing who have to think about which key they’re going to press when they type. Those people are very slow typists, perhaps even slower than someone who’s quick at using the mouse to type via an on screen keyboard. But there are also people who are fluent with the keyboard and can type without consciously thinking about which keys they’re going to press. The implicit claim here is that it’s not possible to be fluent with command keys in the same way it’s possible to be fluent with the keyboard for typing. It’s possible that’s true, but I find the claim to be highly implausible, both in principle, and from having observed people who certainly seem to be fluent with command keys, and the claim has no supporting evidence.

The third widely cited AskTog page cites a single experiment, where the author typed a paragraph and then had to replace every “e” with a “|”, either using cursor keys or the mouse. The author found that the average time for using cursor keys was 99.43 seconds and the average time for the mouse was 50.22 seconds. No information about the length of the paragraph or the number of “e”s was given. The third page was in response to a user who cited specific editing examples where they found that they were faster with a keyboard than with a mouse.

My experience with benchmarking is that the vast majority of microbenchmarks have wrong or misleading results because they’re difficult to set up properly, and even when set up properly, understanding how the microbenchmark results relate to real-world world results requires a deep understanding of the domain. As a result, I’m deeply skeptical of broad claims that come from microbenchmarks unless the author has a demonstrated, deep, understanding of benchmarking their particular domain, and even then I’ll ask why they believe their result generalizes. The opinion that microbenchmarks are very difficult to interpret properly is widely shared among people who understand benchmarking.

The e -> | replacement task described is not only a microbenchmark, it's a bizarrely artificial microbenchmark.

Based on the times given in the result, the task was either for very naive users, or disallowed any kind of search and replace functionality. This particular AskTog column is in response to a programmer who mentioned editing tasks, so the microbenchmark is meaningless unless that programmer is trapped in an experiment where they’re not allowed to use their editor’s basic functionality. Moreover, the replacement task itself is unrealistic -- how often do people replace e with |?

I timed this task without the bizarre no-search-and-replace restriction removed and got the following results:

The first result was from using a keyboard shortcut. The second result is something I might do if I were in someone else’s emacs setup, which has different keyboard shortcuts mapped; emacs lets you run a command by hitting “M-x” and typing the entire name of the command. That’s much slower than using a keyboard shortcut directly, but still faster than using the mouse (at least for me, here) Does this mean that keyboards are great and mice are terrible? No, the result is nearly totally meaningless because I spend almost none of my time doing single-character search-and-replace, making the speed of single-character search-and-replace irrelevant.

Also, since I’m used to using the keyboard, the mouse speed here is probably unusually slow. That’s doubly true here because my normal editor setup (emacs -nw) doesn’t allow for mouse usage, so I ended up using an unfamiliar editor, TextEdit, for the mouse test. I did each task once in order to avoid “practicing” the exact task, which could unrealistically make the keyboard-shortcut version nearly instantaneous because it’s easy to hit a practiced sequence of keys very quickly. However, this meant that I was using an unfamiliar mouse in an unfamiliar set of menus for the mouse. Furthermore, like many people who’ve played FPS games in the distant past, I’m used to having “mouse acceleration” turned off, but the Mac has this on by default and I didn’t go through the rigmarole necessary to disable mouse acceleration. Additionally, recording program I used (quicktime) made the entire machine laggy, which probably affects mousing speed more than keyboard speed, and the menu setup for the program I happened to use forced me to navigate through two levels of menus.

That being said, despite not being used to the mouse, if I want to find a microbenchmark where I’m faster with the mouse than with the keyboard, that’s easy: let me try selecting a block of text that’s on the screen but not near my cursor:

I tend to do selection of blocks in emacs by searching for something at the start of the block, setting a mark, and then searching for something at the end of the mark. I typically type three characters to make sure that I get a unique chunk of text (and I’ll type more if it’s text where I don’t think three characters will cut it). This makes the selection task somewhat slower than the replacement task because the replacement task used single characters and this task used multiple characters.

The mouse is so much better suited for selecting a block of text that even with an unfamiliar mouse setup where I end up having to make a correction instead of being able to do the selection in one motion, the mouse is over twice as fast. But, if I wanted select something that was off screen and the selection was so large that it wouldn’t fit on one screen, the keyboard time wouldn’t change and the mouse time would get much slower, making the keyboard faster.

In addition to doing the measurements, I also (informally) polled people to ask if they thought the keyboard or the mouse would be faster for specific tasks. Both search-and-replace and select-text are tasks where the result was obvious to most people. But not all tasks are obvious; scrolling was one where people didn’t have strong opinions one way or another. Let’s look at scrolling, which is a task both the keyboard and the mouse are well suited for. To have something concrete, let’s look at scrolling down 4 pages:

While there’s some difference, and I suspect that if I repeated the experiment enough times I could get a statistically significant result, but the difference is small enough that the difference isn’t of practical significance.

Contra Tog’s result, which was that everyone believes the keyboard was faster even though the mouse is faster, I find that people are pretty good at estimating what’s which device is faster for which tasks and also at estimate when both devices will give a similar result. One possible reason is that I’m polling programmers, and in particular, programmers at RC, who are probably a different population than whoever Tog might’ve studied in his studies. He was in a group that was looking at how to design the UI for a general purpose computer in the 80s, where it would have been actually been unreasonable to focus on studying people, many of whom grew up using computers, and then chose a career where you use computers all day. The equivalent population would’ve had to start using computers in the 60s or even earlier, but even if they had, input devices were quite different (the ball mouse wasn’t invented until 1972, and it certainly wasn’t in wide use the moment it was invented). There’s nothing wrong with studying populations who aren’t relatively expert at using computer input devices, but there is something wrong with generalizing those results to people who are relatively expert.

Unlike claims by either keyboard or mouse advocates, when I do experiments myself, the results are mixed. Some tasks are substantially faster if I use the keyboard and some are substantially faster if I use the mouse. Moreover, most of the results are easily predictable (when the results are similar, the prediction is that it would be hard to predict). If we look at the most widely cited, authoritative, results on the web, we find that they make very strong claims that the mouse is much faster than the keyboard but back up the claim with nothing but a single, bogus, experiment. It’s possible that some of the vaunted $50M in R&D went into valid experiments, but those experiments, if they exist, aren’t cited.

I spent some time reviewing the literature on the subject, but couldn’t find anything conclusive. Rather than do a point-by-point summary of each study (like I did here for here for another controversial topic), I’ll mention the high-level issues that make the studies irrelevant to me. All studies I could find had at least one of the issues listed below; if you have a link to a study that isn’t irrelevant for one of the following reasons, I’d love to hear about it!

  1. Age of study: it’s unclear how a study on interacting with computers from the mid-80s transfers to how people interact with computers today. Even ignoring differences in editing programs, there are large differences in the interface. Mice are more precise and a decent modern optical mouse can be moved as fast as a human can move it without the tracking becoming erratic, something that isn’t true of any mouse I’ve tried from the 80s and was only true of high quality mice from the 90s when the balls were recently cleaned and the mouse was on a decent quality mousepad. Keyboards haven’t improved as much, but even so, I can type substantially faster a modern, low-travel, keyboard than on any keyboard I’ve tried from the 80s.
  2. Narrow microbenchmarking: not all of these are as irrelevant as the e -> | without search and replace task, but even in the case of tasks that aren’t obviously irrelevant, it’s not clear what the impact of the result is on actual work I might do.
  3. Not keyboard vs. mouse: a tiny fraction of published studies are on keyboard vs. mouse interaction. When a study is on device interaction, it’s often about some new kind of device or a new interaction model.
  4. Vague description: a lot of studies will say something like they found a 7.8% improvement, with results being significant with p < 0.005, without providing enough information to tell if the results are actually significant or merely statistically significant (recall that the practically insignificant scrolling result was a 0.08s difference, which could also be reported as a 16.3% improvement).
  5. Unskilled users: in one, typical, paper, they note that it can take users as long as two seconds to move the mouse from one side of the screen to a scrollbar on the other side of the screen. While there’s something to be said for doing studies on unskilled users in order to figure out what sorts of interfaces are easiest for users who have the hardest time, a study on users who take 2 seconds to get their mouse onto the scrollbar doesn’t appear to be relevant to my user experience. When I timed this for myself, it took 0.21s to get to the scrollbar from the other side of the screen and scroll a short distance, despite using an unfamiliar mouse with different sensitivity than I’m used to and running a recording program which made mousing more difficult than usual.
  6. Seemingly unreasonable results: some studies claim to show large improvements in overall productivity when switching from type of device to another (e.g., a 20% total productivity gain from switching types of mice).

Conclusion

It’s entirely possible that the mysterious studies Tog’s org spent $50M on prove that the mouse is faster than the keyboard for all tasks other than raw text input, but there doesn’t appear to be enough information to tell what the actual studies were. There are many public studies on user input, but I couldn’t find any that are relevant to whether or not I should use the mouse more or less at the margin.

When I look at various tasks myself, the results are mixed, and they’re mixed in the way that most programmers I polled predicted. This result is so boring that it would barely be worth mentioning if not for the large groups of people who believe that either the keyboard is always faster than the mouse or vice versa.

Please let me know if there are relevant studies on this topic that I should read! I’m not familiar with the relevant fields, so it’s possible that I’m searching with the wrong keywords and reading the wrong papers.

Appendix: note to self

I didn't realize that scrolling was so fast relative to searching (not explicitly mentioned in the blog post, but 1/2 of the text selection task). I tend to use search to scroll to things that are offscreen, but it appears that I should consider scrolling instead when I don't want to drop my cursor in a specific position.

Thanks to Leah Hanson, Quentin Pradet, Alex Wilson, and Gaxun for comments/corrections on this post and to Annie Cherkaev, Chris Ball, Stefan Lesser, and David Isaac Lee for related discussion.

Tue, 13 Jun 2017 00:00:00 +0000


Options v. cash

I often talk to startups that claim that their compensation package has a higher expected value than the equivalent package at a place like Facebook, Google, Twitter, or Snapchat. One thing I don’t understand about this claim is, if the claim is true, why shouldn’t the startup go to an investor, sell their options for what they claim their options to be worth, and then pay me in cash? The non-obvious value of options combined with their volatility is a barrier for recruiting.

Additionally, given my risk function and the risk function of VCs, this appears to be a better deal for everyone. Like most people, extra income gives me diminishing utility, but VCs have an arguably nearly linear utility in income. Moreover, even if VCs shared my risk function, because VCs hold a diversified portfolio of investments, the same options would be worth more to them than they are to me because they can diversify away downside risk much more effectively than I can. If these startups are making a true claim about the value of their options, there should be a trade here that makes all parties better off.

In a classic series of essays written a decade ago, seemingly aimed at convincing people to either found or join startups, Paul Graham stated, “Risk and reward are always proportionate.” This assertion is used to back the claim that people can make more money, in expectation, by joining startups and taking risky equity packages than they can by taking jobs that pay cash or cash plus public equity. However, the premise -- that risk and reward are always proportionate -- isn’t true in the general case. Only assets whose risk cannot be diversified away carry a risk premium (on average). Since VCs can and do diversify risk away, there’s no reason to believe that an individual employee who “invests” in startup options by working at a startup is getting a deal because of the risk involved. And by the way, when you look at historical returns, VC funds don’t appear to outperform other investment classes even though they get to buy a kind of startup equity that has less downside risk than the options you get as a normal employee.

So how come startups can’t or won’t take on more investment and pay their employees in cash? Let’s start by looking at some cynical reasons, followed by some less cynical reasons.

Cynical reasons

One possible answer, perhaps the simplest possible answer, is that options aren’t worth what startups claim they’re worth and startups prefer options because their lack of value is less obvious than it would be with cash. A simplistic argument that this might be the case is, if you look at the amount investors pay for a fraction of an early-stage or mid-stage startup and look at the extra cash the company would have been able to raise if they gave their employee option pool to investors, it usually isn’t enough to pay employees competitive compensation packages. Given that VCs don’t, on average, have outsized returns, this seems to imply that employee options aren’t worth as much as startups often claim. Compensation is much cheaper if you can convince people to take an arbirary number of lottery tickets in a lottery of unknown value instead of cash.

Some common ways that employee options are misrepresented are:

Strike price as value

A company that gives you 1M options with a strike price of $10 might claim that those are “worth” $10M. However, if the share price stays at $10 for the lifetime of the option, the options will end up being worth $0 because an option with a $10 strike price is an option to buy the stock at $10, which is not the same as a grant of actual shares worth $10 a piece.

Public valuation as value

Let’s say a company raised $300M by selling 30% of the company, giving the company an implied valuation of $1B. The most common misrepresentation I see is that the company will claim that because they’re giving an option for, say, 0.1% of the company, your option is worth $1B * 0.001 = $1M. A related, common, misrepresentation is that the company raised money last year and has increased in value since then, e.g., the company has since doubled in value, so your option is worth $2M. Even if you assume the strike price was $0 and and go with the last valuation at which the company raised money, the implied value of your option isn’t $1M because investors buy a different class of stock than you get as an employee.

There are a lot of differences between the preferred stock that VCs get and the common stock that employees get; let’s look at a couple of concrete scenarios.

Let’s say those investors that paid $300M for 30% of the company have a straight (1x) liquidation preference, and the company sells for $500M. The 1x liquidation preference means that the investors will get 1x of their investment back before lowly common stock holders get anything, so the investors will get $300M for their 30% of the company. The other 70% of equity will split $200M: your 0.1% common stock option with a $0 strike price is worth $285k (instead of the $500k you might expect it to be worth if you multiply $500M by 0.001).

The preferred stock VCs get usually has at least a 1x liquidation preference. Let’s say the investors had a 2x liquidation preference in the above scenario. They would get 2x their investment back before the common stockholders split the rest of the company. Since 2 * $300M is greater than $500M, the investors would get everything and the remaining equity holders would get $0.

Another difference between your common stock and preferred stock is that preferred stock sometimes comes with an anti-dilution clause, which you have no chance of getting as a normal engineering hire. Let’s look at an actual example of dilution at a real company. Mayhar got 0.4% of a company when it was valued at $5M. By the time the company was worth $1B, Mayhar’s share of the company was diluted by 8x, which made his share of the company worth less than $500k (minus the cost of exercising his options) instead of $4M (minus the cost of exercising his options).

This story has a few additional complications which illustrate other reasons options are often worth less than they seem. Mayhar couldn’t afford to exercise his options (by paying the strike price times the number of shares he had an option for) when he joined, which is common for people who take startup jobs out of college who don’t come from wealthy families. When he left four years later, he could afford to pay the cost of exercising the options, but due to a quirk of U.S. tax law, he either couldn’t afford the tax bill or didn’t want to pay that cost for what was still a lottery ticket -- when you exercise your options, you’re effectively taxed on the difference between the current valuation and the strike price. Even if the company has a successful IPO for 10x as much in a few years, you’re still liable for the tax bill the year you exercise (and if the company stays private indefinitely or fails, you get nothing but a future tax deduction). Because, like most options, Mayhar’s option has a 90-day exercise window, he didn’t get anything from his options.

While that’s more than the average amount of dilution, there are much worse cases, for example, cases where investors and senior management basically get to keep their equity and everyone else gets diluted to the point where their equity is worthless.

Those are just a few of the many ways in which the differences between preferred and common stock can cause the value of options to be wildly different from a value naively calculated from a public valuation. I often see both companies and employees use public preferred stock valuations as a benchmark in order to precisely value common stock options, but this isn’t possible, even in principle, without access to a company’s cap table (which shows how much of the company different investors own) as well as access to the specific details of each investment. Even if you can get that (which you usually can’t), determining the appropriate numbers to plug into a model that will give you the expected value is non-trivial because it requires answering questions like “what’s the probability that, in an acquisition, upper management will collude with investors to keep everything and leave the employees with nothing?”

Black-Scholes valuation as value

Because of the issues listed above, people will sometimes try to use a model to estimate the value of options. Black-Scholes is commonly used because well known and has an easy to use closed form solution, it’s the most commonly used model. Unfortunately, most of the major assumptions for Black-Scholes are false for startup options, making the relationship between the output between Black-Scholes and the actual value of your options non-obvious.

Options are often free to the company

A large fraction of options get returned to the employee option pool when employees leave, either voluntarily or involuntarily. I haven’t been able to find comprehensive numbers on this, but anecdotally, I hear that more than 50% of options end up getting taken back from employees and returned to the general pool. Dan McKinley points out an (unvetted) analysis that shows that only 5% of employee grants are exercised. Even with a conservative estimate, a 50% discount on options granted sounds pretty good. A 20x discount sounds amazing, and would explain why companies like options so much.

Present value of a future sum of money

When someone says that a startup’s compensation package is worth as much as Facebook’s, they often mean that the total value paid out over N years is similar. But a fixed nominal amount of money is worth more the sooner you get it because you can (at a minimum) invest it in a low-risk asset, like Treasury bonds, and get some return on the money.

That’s an abstract argument you’ll hear in an econ 101 class, but in practice, if you live somewhere with a relatively high cost of living, like SF or NYC, there’s an even greater value to getting paid sooner rather than later because it lets you live in a relatively nice place (however you define nice) without having to cram into a space with more roommates than would be considered reasonable elsewhere in the U.S. Many startups from the last two generations seem to be putting off their IPOs; for folks in those companies with contracts that prevent them from selling options on a secondary market, that could easily mean that the majority of their potential wealth is locked up for the first decade of their working life. Even if the startup’s compensation package is worth more when adjusting for inflation and interest, it’s not clear if that’s a great choice for most people who aren’t already moderately well off.

Non-cynical reasons

We’ve looked at some cynical reasons companies might want to offer options instead of cash, namely that they can claim that their options are worth more than they’re actually worth. Now, let’s look at some non-cynical reasons companies might want to give out stock options.

From an employee standpoint, one non-cynical reason might have been stock option backdating, at least until that loophole was mostly closed. Up until late early 2000s, many companies backdated the date of options grants. Let’s look at this example, explained by Jessie M. Fried

Options covering 1.2 million shares were given to Reyes. The reported grant date was October 1, 2001, when the firm's stock was trading at around $13 per share, the lowest closing price for the year. A week later, the stock was trading at $20 per share, and a month later the stock closed at almost $26 per share.

Brocade disclosed this grant to investors in its 2002 proxy statement in a table titled "Option Grants in the Last Fiscal Year, prepared in the format specified by SEC rules. Among other things, the table describes the details of this and other grants to executives, including the number of shares covered by the option grants, the exercise price, and the options' expiration date. The information in this table is used by analysts, including those assembling Standard & Poor's well-known ExecuComp database, to calculate the Black Scholes value for each option grant on the date of grant. In calculating the value, the analysts assumed, based on the firm's representations about its procedure for setting exercise prices, that the options were granted at-the-money. The calculated value was then widely used by shareholders, researchers, and the media to estimate the CEO's total pay. The Black Scholes value calculated for Reyes' 1.2 million stock option grant, which analysts assumed was at-the-money, was $13.2 million.

However, the SEC has concluded that the option grant to Reyes was backdated, and the market price on the actual date of grant may have been around $26 per share. Let us assume that the stock was in fact trading at $26 per share when the options were actually granted. Thus, if Brocade had adhered to its policy of giving only at-the-money options, it should have given Reyes options with a strike price of $26 per share. Instead, it gave Reyes options with a strike price of $13 per share, so that the options were $13 in the money. And it reported the grant as if it had given Reyes at-the-money options when the stock price was $13 per share.

Had Brocade given Reyes at-the-money options at a strike price of $26 per share, the Black Scholes value of the option grant would have been approximately $26 million. But because the options were $13 million in the money, they were even more valuable. According to one estimate, they were worth $28 million. Thus, if analysts had been told that Reyes received options with a strike price of $13 when the stock was trading for $26, they would have reported their value as $28 million rather than $13.2 million. In short, backdating this particular option grant, in the scenario just described, would have enabled Brocade to give Reyes $2 million more in options (Black Scholes value) while reporting an amount that was $15 million less.

While stock options backdating isn’t (easily) possible anymore, there might be other loopholes or consequences of tax law that make options a better deal than cash. I could only think of one reason off the top of my head, so I spent a couple weeks asking folks (including multiple founders) for their non-cynical reasons why startups might prefer options to an equivalent amount of cash.

Tax benefit of ISOs

In the U.S., Incentive stock options (ISOs) have the property that, if held for one year after the exercise date and two years after the grant date, the owner of the option pays long-term capital gains tax instead of ordinary income tax on the difference between the exercise price and the strike price.

This isn’t quite as good as it sounds because the difference between the exercise price and the strike price is subject to the Alternative Minimum Tax (AMT). I don’t find this personally relevant since I prefer to sell employer stock as quickly as possible in order to be as diversified as possible, but if you’re interested in figuring out how the AMT affects your tax bill when you exercise ISOs, see this explanation for more details.

Tax benefit of QSBS

There’s a certain class of stock that is exempt from federal capital gains tax and state tax in many states (though not in CA). This is interesting, but it seems like people rarely take advantage of this when eligible, and many startups aren’t eligible.

Tax benefit of other options

The IRS says:

Most nonstatutory options don't have a readily determinable fair market value. For nonstatutory options without a readily determinable fair market value, there's no taxable event when the option is granted but you must include in income the fair market value of the stock received on exercise, less the amount paid, when you exercise the option. You have taxable income or deductible loss when you sell the stock you received by exercising the option. You generally treat this amount as a capital gain or loss.

Valuations are bogus

One quirk of stock options is that, to qualify as ISOs, the strike price must be at least the fair market value. That’s easy to determine for public companies, but the fair market value of a share in a private company is somewhat arbitrary. For ISOs, my reading of the requirement is that companies must make “an attempt, made in good faith” to determine the fair market value. For other types of options, there’s other regulation which which determines the definition of fair market value. Either way, startups usually go to an outside firm between 1 and N times a year to get an estimate of the fair market value for their common stock. This results in at least two possible gaps between a hypothetical “real” valuation and the fair market value for options purposes.

First, the valuation is updated relatively infrequently. A common pitch I’ve heard is that the company hasn’t had its valuation updated for ages, and the company is worth twice as much now, so you’re basically getting a 2x discount.

Second, the firms doing the valuations are poorly incentivized to produce “correct” valuations. The firms are paid by startups, which gain something when the legal valuation is as low as possible.

I don’t really believe that these things make options amazing, because I hear these exact things from startups and founders, which means that their offers take these into account and are priced accordingly. However, if there’s a large gap between the legal valuation and the “true” valuation and this allows companies to effectively give out higher compensation, the way stock option backdating did, I could see how this would tilt companies towards favoring options.

Control

Even if employees got the same class of stock that VCs get, founders would retain less control if they transferred the equity from employees to VCs because employee-owned equity is spread between a relatively large number of people.

Retention

This answer was commonly given to me as a non-cynical reason. The idea is that, if you offer employees options and have a clause that prevents them from selling options on a secondary market, many employees won’t be able to leave without walking away from the majority of their compensation. Personally, this strikes me as a cynical reason, but that’s not how everyone sees it. For example, Andreessen Horowitz managing partner Scott Kupor recently proposed a scheme under which employees would lose their options under all circumstances if they leave before a liquidity event, supposedly in order to help employees.

Whether or not you view employers being able to lock in employees for indeterminate lengths of time as good or bad, options lock-in appears to be a poor retention mechanism -- companies that pay cash seem to have better retention. Just for example, Netflix pays salaries that are comparable to the total compensation in the senior band at places like Google and, anecdotally, they seem to have less attrition than trendy Bay Area startups. In fact, even though Netflix makes a lot of noise about showing people the door if they’re not a good fit, they don’t appear to have a higher involuntary attrition rate than trendy Bay Area startups -- they just seem more honest about it, something which they can do because their recruiting pitch doesn’t involve you walking away with below-market compensation if you leave. If you think this comparison is unfair because Netflix hasn’t been a startup in recent memory, you can compare to finance startups, e.g. Headlands, which was founded in the same era as Uber, Airbnb, and Stripe. They (and some other finance startups) pay out hefty sums of cash and this does not appear to result in higher attrition than similarly aged startups which give out illiquid option grants.

In the cases where this results in the employee staying longer than they otherwise would, options lock-in is often a bad deal for all parties involved. The situation is obviously bad for employees and, on average, companies don’t want unhappy people who are just waiting for a vesting cliff or liquidity event.

Incentive alignment

Another commonly stated reason is that, if you give people options, they’ll work harder because they’ll do well when the company does well. This was the reason that was given most vehemently (“you shouldn’t trust someone who’s only interested in a paycheck”, etc.)

However, as far as I can tell, paying people in options almost totally decouples job performance and compensation. If you look at companies that have made a lot of people rich, like Microsoft, Google, Apple, and Facebook, almost none of the employees who became rich had an instrumental role in the company’s success. Google and Microsoft each made thousands of people rich, but the vast majority of those folks just happened to be in the right place at the right time and could have just as easily taken a different job where they didn't get rich. Conversely, the vast majority of startup option packages end up being worth little to nothing, but nearly none of the employees whose options end up being worthless were instrumental in causing their options to become worthless.

If options are a large fraction of compensation, choosing a company that’s going to be successful is much more important than working hard. For reference, Microsoft is estimated to have created roughly 10^3 millionaires by 1992 (adjusted for inflation, that's $1.75M). The stock then went up by more than 20x. Microsoft was legendary for making people who didn't particularly do much rich; all told, it's been estimated that they made 10^4 people rich by the late 90s. The vast majority of those people were no different from people in similar roles at Microsoft's competitors. They just happened to pick a winning lottery ticket. This is the opposite of what founders claim they get out of giving options. As above, companies that pay cash, like Netflix, don’t seem to have a problem with employee productivity.

By the way, a large fraction of the people who were made rich by working at Microsoft joined after their IPO, which was in 1986. The same is true of Google, and while Facebook is too young for us to have a good idea what the long-term post-IPO story is, the folks who joined a year or two after the IPO (5 years ago, in 2012) have done quite well for themselves. People who joined pre-IPO have done better, but as mentioned above, most people have diminishing returns to individual wealth. The same power-law-like distribution that makes VC work also means that it's entirely plausible that Microsoft alone made more post-IPO people rich from 1986-1999 than all pre-IPO tech companies combined during that period. Something similar is plausibly true for Google from 2004 until FB's IPO in 2012, even including the people who got rich from FB's IPO as people who were made rich by a pre-IPO company, and you can do a similar calculation for Apple.

VC firms vs. the market

There are several potential counter-arguments to the statement that VC returns (and therefore startup equity) don’t beat the market.

One argument is, when people say that, they typically mean that after VCs take their fees, returns to VC funds don’t beat the market. As an employee who gets startup options, you don’t (directly) pay VC fees, which means you can beat the market by keeping the VC fees for yourself.

Another argument is that, some investors (like YC) seem to consistently do pretty well. If you join a startup that’s funded by a savvy investors, you too can do pretty well. For this to make sense, you have to realize that the company is worth more than “expected” while the company doesn’t have the same realization because you need the company to give you an option package without properly accounting for its value. For you to have that expectation and get a good deal, this requires the founders to not only not be overconfident in the company’s probability of success, but actually requires that the founders are underconfident. While this isn’t impossible, the majority of startup offers I hear about have the opposite problem.

Conclusion

There are a number of factors that can make options more or less valuable than they seem. From an employee standpoint, the factors that make options more valuable than they seem can cause equity to be worth tens of percent more than a naive calculation. The factors that make options less valuable than they seem do so in ways that mostly aren’t easy to quantify.

Whether or not the factors that make options relatively more valuable dominate or the factors that make options relatively less valuable dominate is an empirical question. My intuition is that the factors that make options relatively less valuable are stronger, but that’s just a guess. A way to get an idea about this from public data would be to go through through successful startup S-1 filing. Since this post is already ~5k words, I’ll leave that for another post, but I’ll note that in my preliminary skim of a handful of 99%-ile exits (> $1B), the median employee seems to do worse than someone who’s on the standard Facebook/Google/Amazon career trajectory.

From a company standpoint, there are a couple factors that allow companies to retain more leverage/control by giving relatively more options to employees and relatively less equity to investors.

All of this sounds fine for founders and investors, but I don’t see what’s in it for employees. If you have additional reasons that I’m missing, I’d love to hear them.

_If you liked this post, you may also like this other post on the tradeoff between working at a big company and working at a startup.

Appendix: caveats

Many startups don’t claim that their offers are financially competitive. As time goes on, I hear less “If you wanted to get rich, how would you do it? I think your best bet would be to start or join a startup. That's been a reliable way to get rich for hundreds of years.” (that’s an actual Paul Graham quote) and more “we’re not financially competitive with Facebook, but…”. I’ve heard from multiple founders that joining as an early employee is an incredibly bad deal when you compare early-employee equity and workload vs. founder equity and workload.

Some startups are giving out offers that are actually competitive with large company offers. Something I’ve seen from startups that are trying to give out compelling offers is that, for “senior” folks, they’re willing to pay substantially higher salaries than public companies because it’s understood that options aren’t great for employees because of their timeline, risk profile, and expected value.

There’s a huge amount of variation in offers, much of which is effectively random. I know of cases where an individual got a more lucrative offer from a startup (that doesn’t tend to give particular strong offers) than from Google, and if you ask around you’ll hear about a lot of cases like that. It’s not always true that startup offers are lower than Google/Facebook/Amazon offers, even at startups that don’t pay competitively (on average).

Anything in this post that’s related to taxes is U.S. specific. For example, I’m told that in Canada, “you can defer the payment of taxes when exercising options whose strike price is way below fair market valuation until disposition, as long as the company is Canadian-controlled and operated in Canada”.

You might object that the same line of reasoning we looked at for options can be applied to RSUs, even RSUs for public companies. That’s true, although the largest downsides of startup options are mitigated or non-existent, cash still has significant advantages to employees over RSUs. Unfortunately, the only non-finance company I know of that uses this to their advantage in recruiting is Netflix; please let me know if you can think of other tech companies that use the same compensation model.

Some startups have a sliding scale that lets you choose different amounts of option/salary compensation. I haven't seen an offer that will let you put the slider to 100% cash and 0% options (or 100% options and 0% cash), but someone out there will probably be willing to give you an all-cash offer.

In the current environment, looking at public exits may bias the data towards less sucessful companies. The most sucessful startups from the last couple generations of startups that haven't exited by acquisition have so far chosen not to IPO. It's possible that, once all the data are in, the average returns to joining a startup will look quite different (although I doubt the median return will change much).

BTW, I don't have anything against taking a startup offer, even if it's low. When I graduated from college, I took the lowest offer I had, and my partner recently took the lowest offer she got (nearly a 2x difference over the highest offer). There are plenty of reasons you might want to take an offer that isn't the best possible financial offer. However, I think you should know what you're getting into and not take an offer that you think is financially great when it's merely mediocre or even bad.

Appendix: non-counterarguments

The most common objection I’ve heard to this is that most startups don’t have enough money to pay equivalent cash and couldn’t raise that much money by selling off what would “normally” be their employee option pool. Maybe so, but that’s not a counter-argument -- it’s an argument that the most startups don’t have options that are valuable enough to be exchanged for the equivalent sum of money, i.e., that the options simply aren’t as valuable as claimed. This argument can be phrased in a variety of ways (e.g., paying salary instead of options increases burn rate, reduces runway, makes the startup default dead, etc.), but arguments of this form are fundamentally equivalent to admitting that startup options aren’t worth much because they wouldn't hold up if the options were worth enough that a typical compensation package was worth as much as a typical "senior" offer at Google or Facebook.

If you don't buy this, imagine a startup with a typical valuation that's at a stage where they're giving out 0.1% equity in options to new hires. Now imagine that some irrational bystander is willing to make a deal where they take 0.1% of the company for $1B. Is it worth it to take the money and pay people out of the $1B cash pool instead of paying people with 0.1% slices of the option pool? Your answer should be yes, unless you believe that the ratio between the value of cash on hand and equity is nearly infinite. Absolute statements like "options are preferred to cash because paying cash increases burn rate, making the startup default dead" at any valuation are equivalent to stating that the correct ratio is infinity. That's clearly nonsensical; there's some correct ratio, and we might disagree over what the correct ratio is, but for typical startups it should not be the case that the correct ratio is infinite. Since this was such a common objection, if you have this objection, my question to you is, why don't you argue that startups should pay even less cash and even more options? Is the argument that the current ratio is exactly optimal, and if so, why? Also, why does the ratio vary so much between different companies at the same stage which have raised roughly the same amount of money? Are all of those companies giving out optimal deals?

The second most common objection is that startup options are actually worth a lot, if you pick the right startup and use a proper model to value the options. Perhaps, but if that’s true, why couldn’t they have raised a bit more money by giving away more equity to VCs at its true value, and then pay cash?

Another common objection is something like "I know lots of people who've made $1m from startups". Me too, but I also know lots of people who've made much more than that working at public companies. This post is about the relative value of compensation packages, not the absolute value.

Acknowledgements

Thanks to Leah Hanson, Ben Kuhn, Tim Abbott, David Turner, Nick Bergson-Shilcock, Peter Fraenkel, Joe Ardent, Chris Ball, Anton Dubrau, Sean Talts, Danielle Sucher, Dan McKinley, Bert Muthalaly, Dan Puttick, Indradhanush Gupta, and Gaxun for comments and corrections.

Wed, 07 Jun 2017 00:00:00 +0000


Web bloat

A couple years ago, I took a road trip from Wisconsin to Washington and mostly stayed in rural hotels on the way. I expected the internet in rural areas too sparse to have cable internet to be slow, but I was still surprised that a large fraction of the web was inaccessible. Some blogs with lightweight styling were readable, as were pages by academics who hadn’t updated the styling on their website since 1995. But very few commercial websites were usable (other than Google). When I measured my connection, I found that the bandwidth was roughly comparable to what I got with a 56k modem in the 90s. The latency and packetloss were significantly worse than the average day on dialup: latency varied between 500ms and 1000ms and packetloss varied between 1% and 10%. Those numbers are comparable to what I’d see on dialup on a bad day.

Despite my connection being only a bit worse than it was in the 90s, the vast majority of the web wouldn’t load. Why shouldn’t the web work with dialup or a dialup-like connection? It would be one thing if I tried to watch youtube and read pinterest. It’s hard to serve videos and images without bandwidth. But my online interests are quite boring from a media standpoint. Pretty much everything I consume online is plain text, even if it happens to be styled with images and fancy javascript. In fact, I recently tried using w3m (a terminal-based web browser that, by default, doesn’t support css, javascript, or even images) for a week and it turns out there are only two websites I regularly visit that don’t really work in w3m (twitter and zulip, both fundamentally text based sites, at least as I use them)1.

More recently, I was reminded of how poorly the web works for people on slow connections when I tried to read a joelonsoftware post while using a flaky mobile connection. The HTML loaded but either one of the five CSS requests or one of the thirteen javascript requests timed out, leaving me with a broken page. Instead of seeing the article, I saw three entire pages of sidebar, menu, and ads before getting to the title because the page required some kind of layout modification to display reasonably. Pages are often designed so that they're hard or impossible to read if some dependency fails to load. On a slow connection, it's quite common for at least one depedency to fail. After refreshing the page twice, the page loaded as it was supposed to and I was able to read the blog post, a fairly compelling post on eliminating dependencies.

Complaining that people don’t care about performance like they used to and that we’re letting bloat slow things down for no good reason is “old man yells at cloud” territory; I probably sound like that dude who complains that his word processor, which used to take 1MB of RAM, takes 1GB of RAM. Sure, that could be trimmed down, but there’s a real cost to spending time doing optimization and even a $300 laptop comes with 2GB of RAM, so why bother? But it’s not quite the same situation -- it’s not just nerds like me who care about web performance. In the U.S., AOL alone had over 2 million dialup users in 2015. Outside of the U.S., there are even more people with slow connections. I recently chatted with Ben Kuhn, who spends a fair amount of time in Africa, about his internet connection:

I've seen ping latencies as bad as ~45 sec and packet loss as bad as 50% on a mobile hotspot in the evenings from Jijiga, Ethiopia. (I'm here now and currently I have 150ms ping with no packet loss but it's 10am). There are some periods of the day where it ~never gets better than 10 sec and ~10% loss. The internet has gotten a lot better in the past ~year; it used to be that bad all the time except in the early mornings.

Speedtest.net reports 2.6 mbps download, 0.6 mbps upload. I realized I probably shouldn't run a speed test on my mobile data because bandwidth is really expensive.

Our server in Ethiopia is has a fiber uplink, but it frequently goes down and we fall back to a 16kbps satellite connection, though I think normal people would just stop using the Internet in that case.

If you think browsing on a 56k connection is bad, try a 16k connection from Ethiopia!

Everything we’ve seen so far is anecdotal. Let’s load some websites that programmers might frequent with a variety of simulated connections to get data on page load times. webpagetest lets us see how long it takes a web site to load (and why it takes that long) from locations all over the world. It even lets us simulate different kinds of connections as well as load sites on a variety of mobile devices. The times listed in the table below are the time until the page is “visually complete”; as measured by webpagetest, that’s the time until the above-the-fold content stops changing.

URL Size C Load time in seconds
MB FIOS Cable LTE 3G 2G Dial Bad 😱
0 http://bellard.org 0.01 5 0.40 0.59 0.60 1.2 2.9 1.8 9.5 7.6
1 http://danluu.com 0.02 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
2 news.ycombinator.com 0.03 1 0.30 0.49 0.69 1.6 5.5 5.0 14 27
3 danluu.com 0.03 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15
4 http://jvns.ca 0.14 7 0.49 0.69 1.2 2.9 10 19 29 108
5 jvns.ca 0.15 4 0.50 0.80 1.2 3.3 11 21 31 97
6 fgiesen.wordpress.com 0.37 12 1.0 1.1 1.4 5.0 16 66 68 FAIL
7 google.com 0.59 6 0.80 1.8 1.4 6.8 19 94 96 236
8 joelonsoftware.com 0.72 19 1.3 1.7 1.9 9.7 28 140 FAIL FAIL
9 bing.com 1.3 12 1.4 2.9 3.3 11 43 134 FAIL FAIL
10 reddit.com 1.3 26 7.5 6.9 7.0 20 58 179 210 FAIL
11 signalvnoise.com 2.1 7 2.0 3.5 3.7 16 47 173 218 FAIL
12 amazon.com 4.4 47 6.6 13 8.4 36 65 265 300 FAIL
13 steve-yegge.blogspot.com 9.7 19 2.2 3.6 3.3 12 36 206 188 FAIL
14 blog.codinghorror.com 23 24 6.5 15 9.5 83 235 FAIL FAIL FAIL

Each row is a website. For sites that support both plain HTTP as well as HTTPS, both were tested; URLs are HTTPS except where explicitly specified as HTTP. The first two columns show the amount of data transferred over the wire in MB (which includes headers, handshaking, compression, etc.) and the number of TCP connections made. The rest of the columns show the time in seconds to load the page on a variety of connections from fiber (FIOS) to less good connections. “Bad” has the bandwidth of dialup, but with 1000ms ping and 10% packetloss, which is roughly what I saw when using the internet in small rural hotels. “😱” simulates a 16kbps satellite connection from Jijiga, Ethiopia. Rows are sorted by the measured amount of data transferred.

The timeout for tests was 6 minutes; anything slower than that is listed as FAIL. Pages that failed to load are also listed as FAIL. A few things that jump out from the table are:

  1. A large fraction of the web is unusable on a bad connection. Even on a good (0% packetloss, no ping spike) dialup connection, some sites won’t load.
  2. Some sites will use a lot of data!

The web on bad connections

As commercial websites go, Google is basically as good as it gets for people on a slow connection. On dialup, the 50%-ile page load time is a minute and a half. But at least it loads -- when I was on a slow, shared, satellite connection in rural Montana, virtually no commercial websites would load at all. I could view websites that only had static content via Google cache, but the live site had no hope of loading.

Some sites will use a lot of data

Although only two really big sites were tested here, there are plenty of sites that will use 10MB or 20MB of data. If you’re reading this from the U.S., maybe you don’t care, but if you’re browsing from Mauritania, Madagascar, or Vanuatu, loading codinghorror once will cost you more than 10% of the daily per capita GNI.

Page weight matters

Despite the best efforts of Maciej, the meme that page weight doesn’t matter keeps getting spread around. AFAICT, the top HN link of all time on web page optimization is to an article titled “Ludicrously Fast Page Loads - A Guide for Full-Stack Devs”. At the bottom of the page, the author links to another one of his posts, titled “Page Weight Doesn’t Matter”.

Usually, the boogeyman that gets pointed at is bandwidth: users in low-bandwidth areas (3G, developing world) are getting shafted. But the math doesn’t quite work out. Akamai puts the global connection speed average at 3.9 megabits per second.

The “ludicrously fast” guide fails to display properly on dialup or slow mobile connections because the images time out. On reddit, it also fails under load: "Ironically, that page took so long to load that I closed the window.", "a lot of … gifs that do nothing but make your viewing experience worse", "I didn't even make it to the gifs; the header loaded then it just hung.", etc.

The flaw in the “page weight doesn’t matter because average speed is fast” is that if you average the connection of someone in my apartment building (which is wired for 1Gbps internet) and someone on 56k dialup, you get an average speed of 500 Mbps. That doesn’t mean the person on dialup is actually going to be able to load a 5MB website. The average speed of 3.9 Mbps comes from a 2014 Akamai report, but it’s just an average. If you look at Akamai’s 2016 report, you can find entire countries where more than 90% of IP addresses are slower than that!

Yes, there are a lot of factors besides page weight that matter, and yes it's possible to create a contrived page that's very small but loads slowly, as well as a huge page that loads ok because all of the weight isn't blocking, but total page weight is still pretty decently correlated with load time.

Since its publication, the "ludicrously fast" guide was updated with some javascript that only loads images if you scroll down far enough. That makes it look a lot better on webpagetest if you're looking at the page size number (if webpagetest isn't being scripted to scroll), but it's a worse user experience for people on slow connections who want to read the page. If you're going to read the entire page anyway, the weight increases, and you can no longer preload images by loading the site. Instead, if you're reading, you have to stop for a few minutes at every section to wait for the images from that section to load. And that's if you're lucky and the javascript for loading images didn't fail to load.

The average user fallacy

Just like many people develop with an average connection speed in mind, many people have a fixed view of who a user is. Maybe they think there are customers with a lot of money with fast connections and customers who won't spend money on slow connections. That is, very roughly speaking, perhaps true on average, but sites don't operate on average, they operate in particular domains. Jamie Brandon writes the following about his experience with Airbnb:

I spent three hours last night trying to book a room on airbnb through an overloaded wifi and presumably a satellite connection. OAuth seems to be particularly bad over poor connections. Facebook's OAuth wouldn't load at all and Google's sent me round a 'pick an account' -> 'please reenter you password' -> 'pick an account' loop several times. It took so many attempts to log in that I triggered some 2fa nonsense on airbnb that also didn't work (the confirmation link from the email led to a page that said 'please log in to view this page') and eventually I was just told to send an email to account.disabled@airbnb.com, who haven't replied.

It's particularly galling that airbnb doesn't test this stuff, because traveling is pretty much the whole point of the site so they can't even claim that there's no money in servicing people with poor connections.

What about tail latency?

My original plan for this was post was to show 50%-ile, 90%-ile, 99%-ile, etc., tail load times. But the 50%-ile results are so bad that I don’t know if there’s any point to showing the other results. If you were to look at the 90%-ile results, you’d see that most pages fail to load on dialup and the “Bad” and “😱” connections are hopeless for almost all sites.

HTTP vs HTTPs

URL Size C Load time in seconds
kB FIOS Cable LTE 3G 2G Dial Bad 😱
1 http://danluu.com 21.1 2 0.20 0.20 0.40 0.80 2.7 1.6 6.4 7.6
3 https://danluu.com 29.3 2 0.20 0.40 0.49 1.1 3.6 3.5 9.3 15

You can see that for a very small site that doesn’t load many blocking resources, HTTPS is noticeably slower than HTTP, especially on slow connections. Practically speaking, this doesn’t matter today because virtually no sites are that small, but if you design a web site as if people with slow connections actually matter, this is noticeable.

How to make pages usable on slow connections

The long version is, to really understand what’s going on, considering reading high-performance browser networking, a great book on web performance that’s avaiable for free.

The short version is that most sites are so poorly optimized that someone who has no idea what they’re doing can get a 10x improvement in page load times for a site whose job is to serve up text with the occasional image. When I started this blog in 2013, I used Octopress because Jekyll/Octopress was the most widely recommended static site generator back then. A plain blog post with one or two images took 11s to load on a cable connection because the Octopress defaults included multiple useless javascript files in the header (for never-used-by-me things like embedding flash videos and delicious integration), which blocked page rendering. Just moving those javascript includes to the footer halved page load time, and making a few other tweaks decreased page load time by another order of magnitude. At the time I made those changes, I knew nothing about web page optimization, other than what I heard during a 2-minute blurb on optimization from a 40-minute talk on how the internet works and I was able to get a 20x speedup on my blog in a few hours. You might argue that I’ve now gone too far and removed too much CSS, but I got a 20x speedup for people on fast connections before making changes that affected the site’s appearance (and the speedup on slow connections was much larger).

That’s normal. Popular themes for many different kinds of blogging software and CMSs contain anti-optimizations so blatant that any programmer, even someone with no front-end experience, can find large gains by just pointing webpagetest at their site and looking at the output.

What about browsers?

While it's easy to blame page authors because there's a lot of low-hanging fruit on the page side, there's just as much low-hanging fruit on the browser side. Why does my browser open up 6 TCP connections to try to download six images at once when I'm on a slow satellite connection? That just guarantees that all six images will time out! Even if I tweak the timeout on the client side, servers that are configured to protect against DoS attacks won't allow long lived connections that aren't doing anything. I can sometimes get some images to load by refreshing the page a few times (and waiting ten minutes each time), but why shouldn't the browser handle retries for me? If you think about it for a few minutes, there are a lot of optimiztions that browsers could do for people on slow connections, but because they don't, the best current solution for users appears to be: use w3m when you can, and then switch to a browser with ad-blocking when that doesn't work. But why should users have to use two entirely different programs, one of which has a text-based interface only computer nerds will find palatable?

Conclusion

When I was at Google, someone told me a story about a time that “they” completed a big optimization push only to find that measured page load times increased. When they dug into the data, they found that the reason load times had increased was that they got a lot more traffic from Africa after doing the optimizations. The team’s product went from being unusable for people with slow connections to usable, which caused so many users with slow connections to start using the product that load times actually increased.

Last night, at a presentation on the websockets protocol, Gary Bernhardt made the observation that the people who designed the websockets protocol did things like using a variable length field for frame length to save a few bytes. By contrast, if you look at the Alexa top 100 sites, almost all of them have a huge amount of slop in them; it’s plausible that the total bandwidth used for those 100 sites is probably greater than the total bandwidth for all websockets connections combined. Despite that, if we just look at the three top 35 sites tested in this post, two send uncompressed javascript over the wire, two redirect the bare domain to the www subdomain, and two send a lot of extraneous information by not compressing images as much as they could be compressed without sacrificing quality. If you look at twitter, which isn’t in our table but was mentioned above, they actually do an anti-optimization where, if you upload a PNG which isn’t even particularly well optimized, they’ll re-encode it as a jpeg which is larger and has visible artifacts!

“Use bcrypt” has become the mantra for a reasonable default if you’re not sure what to do when storing passwords. The web would be a nicer place if “use webpagetest” caught on in the same way. It’s not always the best tool for the job, but it sure beats the current defaults.

Appendix: experimental caveats

The above tests were done by repeatedly loading pages via a private webpagetest image in AWS west 2, on a c4.xlarge VM, with simulated connections on a first page load in Chrome with no other tabs open and nothing running on the VM other than the webpagetest software and the browser. This is unrealistic in many ways.

In relative terms, this disadvantages sites that have a large edge presence. When I was in rural Montana, I ran some tests and found that I had noticeably better latency to Google than to basically any other site. This is not reflected in the test results. Furthermore, this setup means that pages are nearly certain to be served from a CDN cache. That shouldn't make any difference for sites like Google and Amazon, but it reduces the page load time of less-trafficked sites that aren't "always" served out of cache. For example, when I don't have a post trending on social media, between 55% and 75% of traffic is served out of a CDN cache, and when I do have something trending on social media, it's more like 90% to 99%. But the test setup means that the CDN cache hit rate during the test is likely to be > 99% for my site and other blogs which aren't so widely read that they'd normally always have a cached copy available.

All tests were run assuming a first page load, but it’s entirely reasonable for sites like Google and Amazon to assume that many or most of their assets are cached. Testing first page load times is perhaps reasonable for sites with a traffic profile like mine, where much of the traffic comes from social media referrals of people who’ve never visited the site before.

A c4.xlarge is a fairly powerful machine. Today, most page loads come from mobile and even the fastest mobile devices aren’t as fast as a c4.xlarge; most mobile devices are much slower than the fastest mobile devices. Most desktop page loads will also be from a machine that’s slower than a c4.xlarge. Although the results aren’t shown, I also ran a set of tests using a t2.micro instance: for simple sites, like mine, the difference was negligible, but for complex sites, like Amazon, page load times were as much as 2x worse. As you might expect, for any particular site, the difference got smaller as the connection got slower.

As Joey Hess pointed out, many dialup providers attempt to do compression or other tricks to reduce the effective weight of pages and none of these tests take that into account.

Firefox, IE, and Edge often have substantially different performance characteristics from Chrome. For that matter, different versions of Chrome can have different performance characteristics. I just used Chrome because it’s the most widely used desktop browser, and running this set of tests took over a full day of VM time with a single-browser.

The simulated bad connections add a constant latency and fixed (10%) packetloss. In reality, poor connections have highly variable latency with peaks that are much higher than the simulated latency and periods of much higher packetloss than can last for minutes, hours, or days. Putting 😱 at the rightmost side of the table may make it seem like the worst possible connection, but packetloss can get much worse.

Similarly, while codinghorror happens to be at the bottom of the page, it's nowhere to being the slowest loading page. Just for example, I originally considered including slashdot in the table but it was so slow that it caused a significant increase in total test run time because it timed out at six minutes so many times. Even on FIOS it takes 15s to load by making a whopping 223 requests over 100 TCP connections despite weighing in at "only" 1.9MB. Amazingly, slashdot also pegs the CPU at 100% for 17 entire seconds while loading on FIOS. In retrospect, this might have been a good site to include because it's pathologically mis-optimized sites like slashdot that allow the "page weight doesn't matter" meme to sound reasonable.

The websites compared don't do the same thing. Just looking at the blogs, some blogs put entire blog entries on the front page, which is more convenient in some ways, but also slower. Commercial sites are even more different -- they often can't reasonably be static sites and have to have relatively large javascrit payloads in order to work well.

Appendix: irony

The main table in this post is almost 50kB of HTML (without compression or minification); that’s larger than everything else in this post combined. That table is curiously large because I used a library (pandas) to generate the table instead of just writing a script to do it by hand, and as we know, the default settings for most libraries generate a massive amount of bloat. It didn’t even save time because every single built-in time-saving feature that I wanted to use was buggy, which forced me to write all of the heatmap/gradient/styling code myself anyway! Due to laziness, I left the pandas table generating scaffolding code, resulting in a table that looks like it’s roughly an order of magnitude larger than it needs to be.

This isn't a criticism of pandas. Pandas is probably quite good at what it's designed for; it's just not designed to produce slim websites. The CSS class names are huge, which is reasonable if you want to avoid accidental name collisions for generated CSS. Almost every td, th, and tr element is tagged with a redundant rowspan=1 or colspan=1, which is reasonable for generated code if you don't care about size. Each cell has its own CSS class, even though many cells share styling with other cells; again, this probably simplified things on the code generation. Every piece of bloat is totally reasonable. And unfortunately, there's no tool that I know of that will take a bloated table and turn it into a slim table. A pure HTML minifier can't change the class names because it doesn't know that some external CSS or JS doesn't depend on the class name. An HTML minifier could theoretically determine that different cells have the same styling and merge them, except for the aforementioned problem with potential but non-existent external depenencies, but that's beyond the capability of the tools I know of.

For another level of ironic, consider that while I think of a 50kB table as bloat, this page is 12kB when gzipped, even with all of the bloat. Google's AMP currently has > 100kB of blocking javascript that has to load before the page loads! There's no reason for me to use AMP pages because AMP is slower than my current setup of pure HTML with a few lines of embedded CSS and the occasional image, but, as a result, I'm penalized by Google (relative to AMP pages) for not "accelerating" (deccelerating) my page with AMP.

Thanks to Leah Hanson, Jason Owen, and Lindsey Kuper for comments/corrections


  1. excluding internal Microsoft stuff that’s required for work. Many of the sites are IE only and don’t even work in edge. I didn’t try those sites in w3m but I doubt they’d work! In fact, I doubt that even half of the non-IE specific internal sites would work in w3m. [return]

Wed, 08 Feb 2017 00:00:00 +0000


HN: the good parts

HN comments are terrible. On any topic I’m informed about, the vast majority of comments are pretty clearly wrong. Most of the time, there are zero comments from people who know anything about the topic and the top comment is reasonable sounding but totally incorrect. Additionally, many comments are gratuitously mean. You'll often hear mean comments backed up with something like "this is better than the other possibility, where everyone just pats each other on the back with comments like 'this is great'", as if being an asshole is some sort of talisman against empty platitudes. I've seen people push back against that; when pressed, people often say that it’s either impossible or inefficient to teach someone without being mean, as if telling someone that they're stupid somehow helps them learn. It's as if people learned how to explain things by watching Simon Cowell and can't comprehend the concept of an explanation that isn't littered with personal insults. Paul Graham has said, "Oh, you should never read Hacker News comments about anything you write”. Most of the negative things you hear about HN comments are true.

And yet, I haven’t found a public internet forum with better technical commentary. On topics I'm familiar with, while it's rare that a thread will have even a single comment that's well-informed, when those comments appear, they usually float to the top. On other forums, well-informed comments are either non-existent or get buried by reasonable sounding but totally wrong comments when they appear, and they appear even more rarely than on HN.

By volume, there are probably more interesting technical “posts” in comments than in links. Well, that depends on what you find interesting, but that’s true for my interests. If I see a low-level optimization comment from nkurz, a comment on business from patio11, a comment on how companies operate by nostrademons, I almost certainly know that I’m going to read an interesting comment. There are maybe 20 to 30 people I can think of who don’t blog much, but write great comments on HN and I doubt I even know of half the people who are writing great comments on HN1.

I compiled a very abbreviated list of comments I like because comments seem to get lost. If you write a blog post, people will refer it years later, but comments mostly disappear. I think that’s sad -- there’s a lot of great material on HN (and yes, even more not-so-great material).

What’s the deal with MS Word’s file format?

Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn't know any better.

Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy workarounds for things that would be simple if you allowed yourself to redesign the file format. It's pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.

Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That's because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can't change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today -- it is the only word processor that paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word -- only the file format doesn't tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like "Format like Word 95". The format doesn't say what that is -- because it's pretty obvious that the authors of the file format don't know. It's lost in a hopeless mess of legacy code and nobody can figure out what it does now.

Fun with NULL

Here's another example of this fine feature:

  #include <stdio.h>
  #include <string.h>
  #include <stdlib.h>
  #define LENGTH 128

  int main(int argc, char **argv) {
      char *string = NULL;
      int length = 0;
      if (argc > 1) {
          string = argv[1];
          length = strlen(string);
          if (length >= LENGTH) exit(1);
      }

      char buffer[LENGTH];
      memcpy(buffer, string, length);
      buffer[length] = 0;

      if (string == NULL) {
          printf("String is null, so cancel the launch.\n");
      } else {
          printf("String is not null, so launch the missiles!\n");
      }

      printf("string: %s\n", string);  // undefined for null but works in practice

      #if SEGFAULT_ON_NULL
      printf("%s\n", string);          // segfaults on null when bare "%s\n"
      #endif

      return 0;
  }

  nate@skylake:~/src$ clang-3.8 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ icc-17 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ gcc-5 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is not null, so launch the missiles!
  string: (null)

It appear that Intel's ICC and Clang still haven't caught up with GCC's optimizations. Ouch if you were depending on that optimization to get the performance you need! But before picking on GCC too much, consider that all three of those compilers segfault on printf("string: "); printf("%s\n", string) when string is NULL, despite having no problem with printf("string: %s\n", string) as a single statement. Can you see why using two separate statements would cause a segfault? If not, see here for a hint: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25609

How do you make sure the autopilot backup is paying attention?

Good engineering eliminates users being able to do the wrong thing as much as possible. . . . You don't design a feature that invites misuse and then use instructions to try to prevent that misuse.

There was a derailment in Australia called the Waterfall derailment [1]. It occurred because the driver had a heart attack and was responsible for 7 deaths (a miracle it was so low, honestly). The root cause was the failure of the dead-man's switch.

In the case of Waterfall, the driver had 2 dead-man switches he could use - 1) the throttle handle had to be held against a spring at a small rotation, or 2) a bar on the floor could be depressed. You had to do 1 of these things, the idea being that you prevent wrist or foot cramping by allowing the driver to alternate between the two. Failure to do either triggers an emergency brake.

It turns out that this driver was fat enough that when he had a heart attack, his leg was able to depress the pedal enough to hold the emergency system off. Thus, the dead-man's system never triggered with a whole lot of dead man in the driver's seat.

I can't quite remember the specifics of the system at Waterfall, but one method to combat this is to require the pedal to be held halfway between released and fully depressed. The idea being that a dead leg would fully depress the pedal so that would trigger a brake, and a fully released pedal would also trigger a brake. I don't know if they had that system but certainly that's one approach used in rail.

Either way, the problem is equally possible in cars. If you lose consciousness and your foot goes limp, a heavy enough leg will be able to hold the pedal down a bit depending on where it's positioned relative to the pedal and the leverage it has on the floor.

The other major system I'm familiar with for ensuring drivers are alive at the helm is called 'vigilance'. The way it works is that periodically, a light starts flashing on the dash and the driver has to acknowledge that. If they do not, a buzzer alarm starts sounding. If they still don't acknowledge it, the train brakes apply and the driver is assumed incapacitated. Let me tell you some stories of my involvement in it.

When we first started, we had a simple vigi system. Every 30 seconds or so (for example), the driver would press a button. Ok cool. Except that then drivers became so hard-wired to pressing the button every 30 seconds that we were having instances of drivers falling asleep/dozing off and still pressing the button right on every 30 seconds because it was so ingrained into them that it was literally a subconscious action.

So we introduced random-timing vigilance, where the time varies 30-60 seconds (for example) and you could only acknowledge it within a small period of time once the light started flashing. Again, drivers started falling asleep/semi asleep and would hit it as soon as the alarm buzzed, each and every time.

So we introduced random-timing, task-linked vigilance and that finally broke the back of the problem. Now, the driver has to press a button, or turn a knob, or do a number of different activities and they must do that randomly-chosen activity, at a randomly-chosen time, for them to acknowledge their consciousness. It was only at that point that we finally nailed out driver alertness.

See also.

Prestige

Curious why he would need to move to a more prestigious position? Most people realize by their 30s that prestige is a sucker's game; it's a way of inducing people to do things that aren't much fun and they wouldn't really want to do on their own, by lauding them with accolades from people they don't really care about.

Why is FedEx based in Mephis?

. . . we noticed that we also needed:
(1) A suitable, existing airport at the hub location.
(2) Good weather at the hub location, e.g., relatively little snow, fog, or rain.
(3) Access to good ramp space, that is, where to park and service the airplanes and sort the packages.
(4) Good labor supply, e.g., for the sort center.
(5) Relatively low cost of living to keep down prices.
(6) Friendly regulatory environment.
(7) Candidate airport not too busy, e.g., don't want arriving planes to have to circle a long time before being able to land.
(8) Airport with relatively little in cross winds and with more than one runway to pick from in case of winds.
(9) Runway altitude not too high, e.g., not high enough to restrict maximum total gross take off weight, e.g., rule out Denver.
(10) No tall obstacles, e.g., mountains, near the ends of the runways.
(11) Good supplies of jet fuel.
(12) Good access to roads for 18 wheel trucks for exchange of packages between trucks and planes, e.g., so that some parts could be trucked to the hub and stored there and shipped directly via the planes to customers that place orders, say, as late as 11 PM for delivery before 10 AM.
So, there were about three candidate locations, Memphis and, as I recall, Cincinnati and Kansas City.
The Memphis airport had some old WWII hangers next to the runway that FedEx could use for the sort center, aircraft maintenance, and HQ office space. Deal done -- it was Memphis.

Why etherpad joined Wave, and why it didn’t work out as expected

The decision to sell to Google was one of the toughest decisions I and my cofounders ever had to wrestle with in our lives. We were excited by the Wave vision though we saw the flaws in the product. The Wave team told us about how they wanted our help making wave simpler and more like etherpad, and we thought we could help with that, though in the end we were unsuccessful at making wave simpler. We were scared of Google as a competitor: they had more engineers and more money behind this project, yet they were running it much more like an independent startup than a normal big-company department. The Wave office was in Australia and had almost total autonomy. And finally, after 1.5 years of being on the brink of failure with AppJet, it was tempting to be able to declare our endeavor a success and provide a decent return to all our investors who had risked their money on us.

In the end, our decision to join Wave did not work out as we had hoped. The biggest lessons learned were that having more engineers and money behind a project can actually be more harmful than helpful, so we were wrong to be scared of Wave as a competitor for this reason. It seems obvious in hindsight, but at the time it wasn't. Second, I totally underestimated how hard it would be to iterate on the Wave codebase. I was used to rewriting major portions of software in a single all-nighter. Because of the software development process Wave was using, it was practically impossible to iterate on the product. I should have done more diligence on their specific software engineering processes, but instead I assumed because they seemed to be operating like a startup, that they would be able to iterate like a startup. A lot of the product problems were known to the whole Wave team, but we were crippled by a large complex codebase built on poor technical choices and a cumbersome engineering process that prevented fast iteration.

The accuracy of tech news

When I've had inside information about a story that later breaks in the tech press, I'm always shocked at how differently it's perceived by readers of the article vs. how I experienced it. Among startups & major feature launches I've been party to, I've seen: executives that flat-out say that they're not working on a product category when there's been a whole department devoted to it for a year; startups that were founded 1.5 years before the dates listed in Crunchbase/Wikipedia; reporters that count the number of people they meet in a visit and report that as a the "team size", because the company refuses to release that info; funding rounds that never make it to the press; acquisitions that are reported as "for an undisclosed sum" but actually are less than the founders would've made if they'd taken a salaried job at the company; project start dates that are actually when the project was staffed up to its current size and ignore the year or so that a small team spent working on the problem (or the 3-4 years that other small teams spent working on the problem); and algorithms or other technologies that are widely reported as being the core of the company's success, but actually aren't even used by the company.

Self-destructing speakers from Dell

As the main developer of VLC, we know about this story since a long time, and this is just Dell putting crap components on their machine and blaming others. Any discussion was impossible with them. So let me explain a bit...

In this case, VLC just uses the Windows APIs (DirectSound), and sends signed integers of 16bits (s16) to the Windows Kernel.

VLC allows amplification of the INPUT above the sound that was decoded. This is just like replay gain, broken codecs, badly recorded files or post-amplification and can lead to saturation.

But this is exactly the same if you put your mp3 file through Audacity and increase it and play with WMP, or if you put a DirectShow filter that amplifies the volume after your codec output. For example, for a long time, VLC ac3 and mp3 codecs were too low (-6dB) compared to the reference output.

At worse, this will reduce the dynamics and saturate a lot, but this is not going to break your hardware.

VLC does not (and cannot) modify the OUTPUT volume to destroy the speakers. VLC is a Software using the OFFICIAL platforms APIs.

The issue here is that Dell sound cards output power (that can be approached by a factor of the quadratic of the amplitude) that Dell speakers cannot handle. Simply said, the sound card outputs at max 10W, and the speakers only can take 6W in, and neither their BIOS or drivers block this.

And as VLC is present on a lot of machines, it's simple to blame VLC. "Correlation does not mean causation" is something that seems too complex for cheap Dell support…

Learning on the job, startups vs. big companies

Working for someone else's startup, I learned how to quickly cobble solutions together. I learned about uncertainty and picking a direction regardless of whether you're sure it'll work. I learned that most startups fail, and that when they fail, the people who end up doing well are the ones who were looking out for their own interests all along. I learned a lot of basic technical skills, how to write code quickly and learn new APIs quickly and deploy software to multiple machines. I learned how quickly problems of scaling a development team crop up, and how early you should start investing in automation.

Working for Google, I learned how to fix problems once and for all and build that culture into the organization. I learned that even in successful companies, everything is temporary, and that great products are usually built through a lot of hard work by many people rather than great ah-ha insights. I learned how to architect systems for scale, and a lot of practices used for robust, high-availability, frequently-deployed systems. I learned the value of research and of spending a lot of time on a single important problem: many startups take a scattershot approach, trying one weekend hackathon after another and finding nobody wants any of them, while oftentimes there are opportunities that nobody has solved because nobody wants to put in the work. I learned how to work in teams and try to understand what other people want. I learned what problems are really painful for big organizations. I learned how to rigorously research the market and use data to make product decisions, rather than making decisions based on what seems best to one person.

We failed this person, what are we going to do differently?

Having been in on the company's leadership meetings where departures were noted with a simple 'regret yes/no' flag it was my experience that no single departure had any effect. Mass departures did, trends did, but one person never did, even when that person was a founder.

The rationalizations always put the issue back on the departing employee, "They were burned out", "They had lost their ability to be effective", "They have moved on", "They just haven't grown with the company" never was it "We failed this person, what are we going to do differently?"

AWS’s origin story

Anyway, the SOA effort was in full swing when I was there. It was a pain, and it was a mess because every team did things differently and every API was different and based on different assumptions and written in a different language.

But I want to correct the misperception that this lead to AWS. It didn't. S3 was written by its own team, from scratch. At the time I was at Amazon, working on the retail site, none of Amazon.com was running on AWS. I know, when AWS was announced, with great fanfare, they said "the services that power Amazon.com can now power your business!" or words to that effect. This was a flat out lie. The only thing they shared was data centers and a standard hardware configuration. Even by the time I left, when AWS was running full steam ahead (and probably running Reddit already), none of Amazon.com was running on AWS, except for a few, small, experimental and relatively new projects. I'm sure more of it has been adopted now, but AWS was always a separate team (and a better managed one, from what I could see.)

Why is Windows so slow?

I (and others) have put a lot of effort into making the Linux Chrome build fast. Some examples are multiple new implementations of the build system (http://neugierig.org/software/chromium/notes/2011/02/ninja.h... ), experimentation with the gold linker (e.g. measuring and adjusting the still off-by-default thread flags https://groups.google.com/a/chromium.org/group/chromium-dev/... ) as well as digging into bugs in it, and other underdocumented things like 'thin' ar archives.

But it's also true that people who are more of Windows wizards than I am a Linux apprentice have worked on Chrome's Windows build. If you asked me the original question, I'd say the underlying problem is that on Windows all you have is what Microsoft gives you and you can't typically do better than that. For example, migrating the Chrome build off of Visual Studio would be a large undertaking, large enough that it's rarely considered. (Another way of phrasing this is it's the IDE problem: you get all of the IDE or you get nothing.)

When addressing the poor Windows performance people first bought SSDs, something that never even occurred to me ("your system has enough RAM that the kernel cache of the file system should be in memory anyway!"). But for whatever reason on the Linux side some Googlers saw it fit to rewrite the Linux linker to make it twice as fast (this effort predated Chrome), and all Linux developers now get to benefit from that. Perhaps the difference is that when people write awesome tools for Windows or Mac they try to sell them rather than give them away.

Why is Windows so slow, an insider view

I'm a developer in Windows and contribute to the NT kernel. (Proof: the SHA1 hash of revision #102 of [Edit: filename redacted] is [Edit: hash redacted].) I'm posting through Tor for obvious reasons.

Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening. The cause of the problem is social. There's almost none of the improvement for its own sake, for the sake of glory, that you see in the Linux world.

Granted, occasionally one sees naive people try to make things better. These people almost always fail. We can and do improve performance for specific scenarios that people with the ability to allocate resources believe impact business goals, but this work is Sisyphean. There's no formal or informal program of systemic performance improvement. We started caring about security because pre-SP3 Windows XP was an existential threat to the business. Our low performance is not an existential threat to the business.

See, component owners are generally openly hostile to outside patches: if you're a dev, accepting an outside patch makes your lead angry (due to the need to maintain this patch and to justify in in shiproom the unplanned design change), makes test angry (because test is on the hook for making sure the change doesn't break anything, and you just made work for them), and PM is angry (due to the schedule implications of code churn). There's just no incentive to accept changes from outside your own team. You can always find a reason to say "no", and you have very little incentive to say "yes".

What’s the probability of a successful exit by city?

See link for giant table :-).

The hiring crunch

Broken record: startups are also probably rejecting a lot of engineering candidates that would perform as well or better than anyone on their existing team, because tech industry hiring processes are folkloric and irrational.

Too long to excerpt. See the link!

Should you leave a bad job?

I am 42-year-old very successful programmer who has been through a lot of situations in my career so far, many of them highly demotivating. And the best advice I have for you is to get out of what you are doing. Really. Even though you state that you are not in a position to do that, you really are. It is okay. You are free. Okay, you are helping your boyfriend's startup but what is the appropriate cost for this? Would he have you do it if he knew it was crushing your soul?

I don't use the phrase "crushing your soul" lightly. When it happens slowly, as it does in these cases, it is hard to see the scale of what is happening. But this is a very serious situation and if left unchecked it may damage the potential for you to do good work for the rest of your life.

The commenters who are warning about burnout are right. Burnout is a very serious situation. If you burn yourself out hard, it will be difficult to be effective at any future job you go to, even if it is ostensibly a wonderful job. Treat burnout like a physical injury. I burned myself out once and it took at least 12 years to regain full productivity. Don't do it.

  • More broadly, the best and most creative work comes from a root of joy and excitement. If you lose your ability to feel joy and excitement about programming-related things, you'll be unable to do the best work. That this issue is separate from and parallel to burnout! If you are burned out, you might still be able to feel the joy and excitement briefly at the start of a project/idea, but they will fade quickly as the reality of day-to-day work sets in. Alternatively, if you are not burned out but also do not have a sense of wonder, it is likely you will never get yourself started on the good work.

  • The earlier in your career it is now, the more important this time is for your development. Programmers learn by doing. If you put yourself into an environment where you are constantly challenged and are working at the top threshold of your ability, then after a few years have gone by, your skills will have increased tremendously. It is like going to intensively learn kung fu for a few years, or going into Navy SEAL training or something. But this isn't just a one-time constant increase. The faster you get things done, and the more thorough and error-free they are, the more ideas you can execute on, which means you will learn faster in the future too. Over the long term, programming skill is like compound interest. More now means a LOT more later. Less now means a LOT less later.

So if you are putting yourself into a position that is not really challenging, that is a bummer day in and day out, and you get things done slowly, you aren't just having a slow time now. You are bringing down that compound interest curve for the rest of your career. It is a serious problem. If I could go back to my early career I would mercilessly cut out all the shitty jobs I did (and there were many of them).

Creating change when politically unpopular

A small anecdote. An acquaintance related a story of fixing the 'drainage' in their back yard. They were trying to grow some plants that were sensitive to excessive moisture, and the plants were dying. Not watering them, watering them a little, didn't seem to change. They died. A professional gardner suggested that their problem was drainage. So they dug down about 3' (where the soil was very very wet) and tried to build in better drainage. As they were on the side of a hill, water table issues were not considered. It turned out their "problem" was that the water main that fed their house and the houses up the hill, was so pressurized at their property (because it had maintain pressure at the top of the hill too) that the pipe seams were leaking and it was pumping gallons of water into the ground underneath their property. The problem wasn't their garden, the problem was that the city water supply was poorly designed.

While I have never been asked if I was an engineer on the phone, I have experienced similar things to Rachel in meetings and with regard to suggestions. Co-workers will create an internal assessment of your value and then respond based on that assessment. If they have written you off they will ignore you, if you prove their assessment wrong in a public forum they will attack you. These are management issues, and something which was sorely lacking in the stories.

If you are the "owner" of a meeting, and someone is trying to be heard and isn't. It is incumbent on you to let them be heard. By your position power as "the boss" you can naturally interrupt a discussion to collect more data from other members. Its also important to ask questions like "does anyone have any concerns?" to draw out people who have valid input but are too timid to share it.

In a highly political environment there are two ways to create change, one is through overt manipulation, which is to collect political power to yourself and then exert it to enact change, and the other is covert manipulation, which is to enact change subtly enough that the political organism doesn't react. (sometimes called "triggering the antibodies").

The problem with the latter is that if you help make positive change while keeping everyone not pissed off, no one attributes it to you (which is good for the change agent because if they knew the anti-bodies would react, but bad if your manager doesn't recognize it). I asked my manager what change he wanted to be 'true' yet he (or others) had been unsuccessful making true, he gave me one, and 18 months later that change was in place. He didn't believe that I was the one who had made the change. I suggested he pick a change he wanted to happen and not tell me, then in 18 months we could see if that one happened :-). But he also didn't understand enough about organizational dynamics to know that making change without having the source of that change point back at you was even possible.

How to get tech support from Google

Heavily relying on Google product? ✓
Hitting a dead-end with Google's customer service? ✓
Have an existing audience you can leverage to get some random Google employee's attention? ✓
Reach front page of Hacker News? ✓
Good news! You should have your problem fixed in 2-5 business days. The rest of us suckers relying on google services get to stare at our inboxes helplessly, waiting for a response to our support ticket (which will never come). I feel like it's almost a [rite] of passage these days to rely heavily on a Google service, only to have something go wrong and be left out in the cold.

Taking funding

IIRC PayPal was very similar - it was sold for $1.5B, but Max Levchin's share was only about $30M, and Elon Musk's was only about $100M. By comparison, many early Web 2.0 darlings (Del.icio.us, Blogger, Flickr) sold for only $20-40M, but their founders had only taken small seed rounds, and so the vast majority of the purchase price went to the founders. 75% of a $40M acquisition = 3% of a $1B acquisition.

Something for founders to think about when they're taking funding. If you look at the gigantic tech fortunes - Gates, Page/Brin, Omidyar, Bezos, Zuckerburg, Hewlett/Packard - they usually came from having a company that was already profitable or was already well down the hockey-stick user growth curve and had a clear path to monetization by the time they sought investment. Companies that fight tooth & nail for customers and need lots of outside capital to do it usually have much worse financial outcomes.

StackOverflow vs. Experts-Exchange

A lot of the people who were involved in some way in Experts-Exchange don't understand Stack Overflow.

The basic value flow of EE is that "experts" provide valuable "answers" for novices with questions. In that equation there's one person asking a question and one person writing an answer.

Stack Overflow recognizes that for every person who asks a question, 100 - 10,000 people will type that same question into Google and find an answer that has already been written. In our equation, we are a community of people writing answers that will be read by hundreds or thousands of people. Ours is a project more like wikipedia -- collaboratively creating a resource for the Internet at large.

Because that resource is provided by the community, it belongs to the community. That's why our data is freely available and licensed under creative commons. We did this specifically because of the negative experience we had with EE taking a community-generated resource and deciding to slap a paywall around it.

The attitude of many EE contributors, like Greg Young who calculates that he "worked" for half a year for free, is not shared by the 60,000 people who write answers on SO every month. When you talk to them you realize that on Stack Overflow, answering questions is about learning. It's about creating a permanent artifact to make the Internet better. It's about helping someone solve a problem in five minutes that would have taken them hours to solve on their own. It's not about working for free.

As soon as EE introduced the concept of money they forced everybody to think of their work on EE as just that -- work.

Making money from amazon bots

I saw that one of my old textbooks was selling for a nice price, so I listed it along with two other used copies. I priced it $1 cheaper than the lowest price offered, but within an hour both sellers had changed their prices to $.01 and $.02 cheaper than mine. I reduced it two times more by $1, and each time they beat my price by a cent or two. So what I did was reduce my price by a few dollars every hour for one day until everybody was priced under $5. Then I bought their books and changed my price back.

What running a business is like

While I like the sentiment here, I think the danger is that engineers might come to the mistaken conclusion that making pizzas is the primary limiting reagent to running a successful pizzeria. Running a successful pizzeria is more about schlepping to local hotels and leaving them 50 copies of your menu to put at the front desk, hiring drivers who will both deliver pizzas in a timely fashion and not embezzle your (razor-thin) profits while also costing next-to-nothing to employ, maintaining a kitchen in sufficient order to pass your local health inspector's annual visit (and dealing with 47 different pieces of paper related to that), being able to juggle priorities like "Do I take out a bank loan to build a new brick-oven, which will make the pizza taste better, in the knowledge that this will commit $3,000 of my cash flow every month for the next 3 years, or do I hire an extra cook?", sourcing ingredients such that they're available in quantity and quality every day for a fairly consistent price, setting prices such that they're locally competitive for your chosen clientele but generate a healthy gross margin for the business, understanding why a healthy gross margin really doesn't imply a healthy net margin and that the rent still needs to get paid, keeping good-enough records such that you know whether your business is dying before you can't make payroll and such that you can provide a reasonably accurate picture of accounts for the taxation authorities every year, balancing 50% off medium pizza promotions with the desire to not cannibalize the business of your regulars, etc etc, and by the way tomato sauce should be tangy but not sour and cheese should melt with just the faintest whisp of a crust on it.

Do you want to write software for a living? Google is hiring. Do you want to run a software business? Godspeed. Software is now 10% of your working life.

How to handle mismanagement?

The way I prefer to think of it is: it is not your job to protect people (particularly senior management) from the consequences of their decisions. Make your decisions in your own best interest; it is up to the organization to make sure that your interest aligns with theirs.

Google used to have a severe problem where code refactoring & maintenance was not rewarded in performance reviews while launches were highly regarded, which led to the effect of everybody trying to launch things as fast as possible and nobody cleaning up the messes left behind. Eventually launches started getting slowed down, Larry started asking "Why can't we have nice things?", and everybody responded "Because you've been paying us to rack up technical debt." As a result, teams were formed with the express purpose of code health & maintenance, those teams that were already working on those goals got more visibility, and refactoring contributions started counting for something in perf. Moreover, many ex-Googlers who were fed up with the situation went to Facebook and, I've heard, instituted a culture there where grungy engineering maintenance is valued by your peers.

None of this would've happened if people had just heroically fallen on their own sword and burnt out doing work nobody cared about. Sometimes it takes highly visible consequences before people with decision-making power realize there's a problem and start correcting it. If those consequences never happen, they'll keep believing it's not a problem and won't pay much attention to it.

Some downsides of immutability

People who aren’t exactly lying

It took me too long to figure this out. There are some people to truly, and passionately, believe something they say to you, and realistically they personally can't make it happen so you can't really bank on that 'promise.'

I used to think those people were lying to take advantage, but as I've gotten older I have come to recognize that these 'yes' people get promoted a lot. And for some of them, they really do believe what they are saying.

As an engineer I've found that once I can 'calibrate' someone's 'yes-ness' I can then work with them, understanding that they only make 'wishful' commitments rather than 'reasoned' commitments.

So when someone, like Steve Jobs, says "we're going to make it an open standard!", my first question then is "Great, I've got your support in making this an open standard so I can count on you to wield your position influence to aid me when folks line up against that effort, right?" If the answer that that question is no, then they were lying.

The difference is subtle of course but important. Steve clearly doesn't go to standards meetings and vote etc, but if Manager Bob gets push back from accounting that he's going to exceed his travel budget by sending 5 guys to the Open Video Chat Working Group which is championing the Facetime protocol as an open standard, then Manager Bob goes to Steve and says "I need your help here, these 5 guys are needed to argue this standard and keep it from being turned into a turd by the 5 guys from Google who are going to attend." and then Steve whips off a one liner to accounting that says "Get off this guy's back we need this." Then its all good. If on the other hand he says "We gotta save money, send one guy." well in that case I'm more sympathetic to the accusation of prevarication.

What makes engineers productive?

For those who work inside Google, it's well worth it to look at Jeff & Sanjay's commit history and code review dashboard. They aren't actually all that much more productive in terms of code written than a decent SWE3 who knows his codebase.

The reason they have a reputation as rockstars is that they can apply this productivity to things that really matter; they're able to pick out the really important parts of the problem and then focus their efforts there, so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn't really have happened anyway, or migrating from one system to another that isn't really a large improvement, or going down an architectural dead end that'll just have to be rewritten later. Jeff or Sanjay (or any of the other folks operating at that level) will spend their time running a proposed API by clients to ensure it meets their needs, or measuring the performance of subsystems so they fully understand their building blocks, or mentally simulating the operation of the system before building it so they rapidly test out alternatives. They don't actually write more code than a junior developer (oftentimes, they write less), but the code they do write gives them more information, which makes them ensure that they write the rightcode.

I feel like this point needs to be stressed a whole lot more than it is, as there's a whole mythology that's grown up around 10x developers that's not all that helpful. In particular, people need to realize that these developers rapidly become 1x developers (or worse) if you don't let them make their own architectural choices - the reason they're excellent in the first place is because they know how to determine if certain work is going to be useless and avoid doing it in the first place. If you dictate that they do it anyway, they're going to be just as slow as any other developer

Do the work, be a hero

I got the hero speech too, once. If anyone ever mentions the word "heroic" again and there isn't a burning building involved, I will start looking for new employment immediately. It seems that in our industry it is universally a code word for "We're about to exploit you because the project is understaffed and under budgeted for time and that is exactly as we planned it so you'd better cowboy up."

Maybe it is different if you're writing Quake, but I guarantee you the 43rd best selling game that year also had programmers "encouraged onwards" by tales of the glory that awaited after the death march.

Learning English from watching movies

I was once speaking to a good friend of mine here, in English.
"Do you want to go out for yakitori?"
"Go fuck yourself!"
"... switches to Japanese Have I recently done anything very major to offend you?"
"No, of course not."
"Oh, OK, I was worried. So that phrase, that's something you would only say under extreme distress when you had maximal desire to offend me, or I suppose you could use it jokingly between friends, but neither you nor I generally talk that way."
"I learned it from a movie. I thought it meant ‘No.’"

Being smart and getting things done

True story: I went to a talk given by one of the 'engineering elders' (these were low Emp# engineers who were considered quite successful and were to be emulated by the workers :-) This person stated when they came to work at Google they were given the XYZ system to work on (sadly I'm prevented from disclosing the actual system). They remarked how they spent a couple of days looking over the system which was complicated and creaky, they couldn't figure it out so they wrote a new system. Yup, and they committed that. This person is a coding God are they not? (sarcasm) I asked what happened to the old system (I knew but was interested on their perspective) and they said it was still around because a few things still used it, but (quite proudly) nearly everything else had moved to their new system.

So if you were reading carefully, this person created a new system to 'replace' an existing system which they didn't understand and got nearly everyone to move to the new system. That made them uber because they got something big to put on their internal resume, and a whole crapload of folks had to write new code to adapt from the old system to this new system, which imperfectly recreated the old system (remember they didn't understand the original), such that those parts of the system that relied on the more obscure bits had yet to be converted (because nobody undersood either the dependent code or the old system apparently).

Was this person smart? Blindingly brilliant according to some of their peers. Did they get things done? Hell yes, they wrote the replacement for the XYZ system from scratch! One person? Can you imagine? Would I hire them? Not unless they were the last qualified person in my pool and I was out of time.

That anecdote encapsulates the dangerous side of smart people who get things done.

Public speaking tips

Some kids grow up on football. I grew up on public speaking (as behavioral therapy for a speech impediment, actually). If you want to get radically better in a hurry:

Too long to excerpt. See the link.

A reason a company can be a bad fit

I can relate to this, but I can also relate to the other side of the question. Sometimes it isn't me, its you. Take someone who gets things done and suddenly in your organization they aren't delivering. Could be them, but it could also be you.

I had this experience working at Google. I had a horrible time getting anything done there. Now I spent a bit of time evaluating that since it had never been the case in my career, up to that point, where I was unable to move the ball forward and I really wanted to understand that. The short answer was that Google had developed a number of people who spent much, if not all, of their time preventing change. It took me a while to figure out what motivated someone to be anti-change.

The fear was risk and safety. Folks moved around a lot and so you had people in charge of systems they didn't build, didn't understand all the moving parts of, and were apt to get a poor rating if they broke. When dealing with people in that situation one could either educate them and bring them along, or steam roll over them. Education takes time, and during that time the 'teacher' doesn't get anything done. This favors steamrolling evolutionarily :-)

So you can hire someone who gets stuff done, but if getting stuff done in your organization requires them to be an asshole, and they aren't up for that, well they aren't going to be nearly as successful as you would like them to be.

What working at Google is like

I can tell that this was written by an outsider, because it focuses on the perks and rehashes several cliches that have made their way into the popular media but aren't all that accurate.

Most Googlers will tell you that the best thing about working there is having the ability to work on really hard problems, with really smart coworkers, and lots of resources at your disposal. I remember asking my interviewer whether I could use things like Google's index if I had a cool 20% idea, and he was like "Sure. That's encouraged. Oftentimes I'll just grab 4000 or so machines and run a MapReduce to test out some hypothesis." My phone screener, when I asked him what it was like to work there, said "It's a place where really smart people go to be average," which has turned out to be both true and honestly one of the best things that I've gained from working there.

NSA vs. Black Hat

This entire event was a staged press op. Keith Alexander is a ~30 year veteran of SIGINT, electronic warfare, and intelligence, and a Four-Star US Army General --- which is a bigger deal than you probably think it is. He's a spy chief in the truest sense and a master politician. Anyone who thinks he walked into that conference hall in Caesars without a near perfect forecast of the outcome of the speech is kidding themselves.

Heckling Alexander played right into the strategy. It gave him an opportunity to look reasonable compared to his detractors, and, more generally (and alarmingly), to have the NSA look more reasonable compared to opponents of NSA surveillance. It allowed him to "split the vote" with audience reactions, getting people who probably have serious misgivings about NSA programs to applaud his calm and graceful handling of shouted insults; many of those people probably applauded simply to protest the hecklers, who after all were making it harder for them to follow what Alexander was trying to say.

There was no serious Q&A on offer at the keynote. The questions were pre-screened; all attendees could do was vote on them. There was no possibility that anything would come of this speech other than an effectively unchallenged full-throated defense of the NSA's programs.

Are deadlines necessary?

Interestingly one of the things that I found most amazing when I was working for Google was a nearly total inability to grasp the concept of 'deadline.' For so many years the company just shipped it by committing it to the release branch and having the code deploy over the course of a small number of weeks to the 'fleet'.

Sure there were 'processes', like "Canary it in some cluster and watch the results for a few weeks before turning it loose on the world." but being completely vertically integrated is a unique sort of situation.

Debugging on Windows vs. Linux

Being a very experienced game developer who tried to switch to Linux, I have posted about this before (and gotten flamed heavily by reactionary Linux people).

The main reason is that debugging is terrible on Linux. gdb is just bad to use, and all these IDEs that try to interface with gdb to "improve" it do it badly (mainly because gdb itself is not good at being interfaced with). Someone needs to nuke this site from orbit and build a new debugger from scratch, and provide a library-style API that IDEs can use to inspect executables in rich and subtle ways.

Productivity is crucial. If the lack of a reasonable debugging environment costs me even 5% of my productivity, that is too much, because games take so much work to make. At the end of a project, I just don't have 5% effort left any more. It requires everything. (But the current Linux situation is way more than a 5% productivity drain. I don't know exactly what it is, but if I were to guess, I would say it is something like 20%.)

What happens when you become rich?

What is interesting is that people don't even know they have a complex about money until they get "rich." I've watched many people, perhaps a hundred, go from "working to pay the bills" to "holy crap I can pay all my current and possibly my future bills with the money I now have." That doesn't include the guy who lived in our neighborhood and won the CA lottery one year.

It affects people in ways they don't expect. If its sudden (like lottery winning or sudden IPO surge) it can be difficult to process. But it is an important thing to realize that one is processing an exceptional event. Like having a loved one die or a spouse suddenly divorcing you.

Not everyone feels "guilty", not everyone feels "smug." A lot of millionaires and billionaires in the Bay Area are outwardly unchanged. But the bottom line is that the emotion comes from the cognitive dissonance between values and reality. What do you value? What is reality?

One woman I knew at Google was massively conflicted when she started work at Google. She always felt that she would help the homeless folks she saw, if she had more money than she needed. Upon becoming rich (on Google stock value), now she found that she wanted to save the money she had for her future kids education and needs. Was she a bad person? Before? After? Do your kids hate you if you give away their college education to the local foodbank? Do your peers hate you because you could close the current food gap at the foodbank and you don't?

Microsoft’s Skype acquisition

This is Microsoft's ICQ moment. Overpaying for a company at the moment when its core competency is becoming a commodity. Does anyone have the slightest bit of loyalty to Skype? Of course not. They're going to use whichever video chat comes built into their SmartPhone, tablet, computer, etc. They're going to use FaceBook's eventual video chat service or something Google offers. No one is going to actively seek out Skype when so many alternatives exist and are deeply integrated into the products/services they already use. Certainly no one is going to buy a Microsoft product simply because it has Skype integration. Who cares if it's FaceTime, FaceBook Video Chat, Google Video Chat? It's all the same to the user.

With $7B they should have just given away about 15 million Windows Mobile phones in the form of an epic PR stunt. It's not a bad product -- they just need to make people realize it exists. If they want to flush money down the toilet they might as well engage users in the process right?

What happened to Google Fiber?

I worked briefly on the Fiber team when it was very young (basically from 2 weeks before to 2 weeks after launch - I was on loan from Search specifically so that they could hit their launch goals). The bottleneck when I was there were local government regulations, and in fact Kansas City was chosen because it had a unified city/county/utility regulatory authority that was very favorable to Google. To lay fiber to the home, you either need right-of-ways on the utility poles (which are owned by Google's competitors) or you need permission to dig up streets (which requires a mess of permitting from the city government). In either case, the cable & phone companies were in very tight with local regulators, and so you had hostile gatekeepers whose approval you absolutely needed.

The technology was awesome (1G Internet and HDTV!), the software all worked great, and the economics of hiring contractors to lay the fiber itself actually worked out. The big problem was regulatory capture.

With Uber & AirBnB's success in hindsight, I'd say that the way to crack the ISP business is to provide your customers with the tools to break the law en masse. For example, you could imagine an ISP startup that basically says "Here's a box, a wire, and a map of other customers' locations. Plug into their jack, and if you can convince others to plug into yours, we'll give you a discount on your monthly bill based on how many you sign up." But Google in general is not willing to break laws - they'll go right up to the boundary of what the law allows, but if a regulatory agency says "No, you can't do that", they won't do it rather than fight the agency.

Indeed, Fiber is being phased out in favor of Google's acquisition of WebPass, which does basically exactly that but with wireless instead of fiber. WebPass only requires the building owner's consent, and leaves the city out of it.

What it's like to talk at Microsoft's TechEd

I've spoken at TechEds in the US and Europe, and been in the top 10 for attendee feedback twice.

I'd never speak at TechEd again, and I told Microsoft the same thing, same reasons. The event staff is overly demanding and inconsiderate of speaker time. They repeatedly dragged me into mandatory virtual and in-person meetings to cover inane details that should have been covered via email. They mandated the color of pants speakers wore. Just ridiculously micromanaged.

Why did Hertz suddenly become so flaky?

Hertz laid off nearly the entirety of their rank and file IT staff earlier this year.

In order to receive our severance, we were forced to train our IBM replacements, who were in India. Hertz's strategy of IBM and Austerity is the new SMT's solution for a balance sheet that's in shambles, yet they have rewarded themselves by increasing executive compensation 35% over the prior year, including a $6 million bonus to the CIO.

I personally landed in an Alphabet company, received a giant raise, and now I get to work on really amazing stuff, so I'm doing fine. But to this day I'm sad to think how our once-amazing Hertz team, staffed with really smart people, led by the best boss I ever had, and were really driving the innovation at Hertz, was just thrown away like yesterday's garbage.

Before startups put clauses in contracts forbidden, they sometimes blocked sales via backchannel communications

Don't count on definitely being able to sell the stock to finance the taxes. I left after seven years in very good standing (I believed) but when I went to sell the deal was shut down [1]. Luckily I had a backup plan and I was ok [2].

[1] Had a handshake deal with an investor in the company, then the investor went silent on me. When I followed up he said the deal was "just much too small." I reached out to the company for help, and they said they'd actually told him not to buy from me. I never would have known if they hadn't decided to tell me for some reason. The takeaway is that the markets for private company stock tend to be small, and the buyers care more about their relationships with the company than they do about having your shares. Even if the stock terms allow them to buy, and they might not.

An Amazon pilot program designed to reduce the cost of interviewing

I took the first test just like the OP, the logical reasoning part seemed kind of irrelevant and a waste of time for me. That was nothing compared to the second online test.

The environment of the second test was like a scenario out of Black Mirror. Not only did they want to have the webcam and microphone on the entire time, I also had to install their custom software so the proctors could monitor my screen and control my computer. They opened up the macOS system preferences so they could disable all shortcuts to take screenshots, and they also manually closed all the background services I had running (even f.lux!).

Then they asked me to pick up my laptop and show them around my room with the webcam. They specifically asked to see the contents of my desk and the walls and ceiling of my room. I had some pencil and paper on my desk to use as scratch paper for the obvious reasons and they told me that wasn't allowed. Obviously that made me a little upset because I use it to sketch out examples and concepts. They also saw my phone on the desk and asked me to put it out of arm's reach.

After that they told me I couldn't leave the room until the 5 minute bathroom break allowed half-way through the test. I had forgotten to tell my roommate I was taking this test and he was making a bit of a ruckus playing L4D2 online (obviously a bit distracting). I asked the proctor if I could briefly leave the room to ask him to quiet down. They said I couldn't leave until the bathroom break so there was nothing I could do. Later on, I was busy thinking about a problem and had adjusted how I was sitting in my chair and moved my face slightly out of the camera's view. The proctor messaged me again telling me to move so they could see my entire face.

Amazon interviews, part 2

The first part of the interview was exactly like the linked experience. No coding questions just reasoning. The second part I had to use ProctorU instead of Proctorio. Personally I thought the experience was super weird but understandable, I'll get to that later, somebody watched me through my webcam the entire time with my microphone on. They needed to check my ID before the test. They needed me to show them the entire room I was in (which was my bedroom). My desktop computer was on behind my laptop so I turned off my computer (I don't remember if I offered to or if they asked me to) but they also asked me to cover my monitors up with something which I thought was silly after I turned them off so I covered them with a towel. They then used LogMeIn to remote into my machine so they could check running programs. I quit all my personal chat programs and pretty much only had the Chrome window running.

...

I didn't talk a real person who actually worked at Amazon (by email or through webcam) until I received an offer.

What's getting acquired by Oracle like?

[M]y company got acquired by Oracle. We thought things would be OK. Nothing changed immediately. Slowly but surely they turned the screws. 5 year laptop replacement policy. You get the corporate standard laptop and you'll like it. Sales? Oh those guys can buy new Macs every two years, they get whatever they want. Then you understand where Software Engineers rank in the company hierarchy. Oracle took the average price of our product from $100k to $5 million for the same size deals. Our sales went from $5-7m to more than $40m with no increasing in engineering headcount (team of 15). Didn't matter when bonus time came, we all got stack-ranked and some people got nothing. As a top performer I got a few options, worth maybe $5k.

Oracle exists to extract the maximum amount of money possible from the Fortune 1000. Everyone else can fuck off. Your impotent internet rage is meaningless. If it doesn't piss off the CTO of $X then it doesn't matter. If it gets that CTO to cut a bigger check then it will be embraced with extreme enthusiasm.

The culture wears down a lot (but not all) of the good people, who then leave. What's left is a lot of mediocrity and architecture astronauts. The more complex the product the better - it means extra consulting dollars!

My relative works at a business dependent on Micros. When Oracle announced the acquisition I told them to start on the backup plan immediately because Oracle was going to screw them sooner or later. A few years on and that is proving true: Oracle is slowly excising the Micros dealers and ISVs out of the picture, gobbling up all the revenue while hiking prices.

How do you avoid hiring developers who do negative work?

In practice, we have to face that all that our quest for more stringent hiring standards is not really selecting the best, but just selecting fewer people, in ways that might, or might not, have anything to do with being good at a job. Let's go through a few examples in my career:

A guy that was the most prolific developer I have ever seen: He'd rewrite entire subsystems over a weekend. The problem is that said susbsytems were not necessarily better than they started, trading bugs for bugs, and anyone that wanted to work on them would have to relearn that programmer's idiosyncrasies of the week. He easily cost his project 12 man/months of work in 4 months, the length of time it took for management to realize that he had to be let go.

A company's big UI framework was quite broken, and a new developer came in and fixed it. Great, right? Well, he was handed code review veto to changes into the framework, and his standards and his demeanor made people stop contributing after two or three attempts. In practice, the framework died as people found it antiquated, and they decided to build a new one: Well, the same developer was tasked with building new framwork, which was made mandatory for 200+ developers to use. Total contribution was clearly negative.

A developer that was very fast, and wrote working code, had been managing a rather large 500K line codebase, and received some developers as help. He didn't believe in internal documentation or on keeping interfaces stable. He also didn't believe in writing code that wasn't brittle, or in unit tests: Code changes from the new developers often broke things, the veteran would come in, fix everything in the middle of the emergency, and look absolutely great, while all the other developers looked to management as if they were incompetent. They were not, however: they were quite successful when moved to other teams. It just happens that the original developer made sure nobody else could touch anything. Eventually, the experiment was retried after the original developer was sent to do other things. It took a few months, but the new replacement team managed to modularize the code, and new people could actually modify the codebase productively.

All of those negative value developers could probably be very valuable in very specific conditions, and they'd look just fine in a tough job interview. They were still terrible hires. In my experience, if anything, a harder process that demands people to appear smarter or work faster in an interview have the opposite effect of what I'd want: They end up selecting for people that think less and do more quickly, building debt faster.

My favorite developers ever all do badly in your typical stringent Silicon Valley intervew. They work slower, do more thinking, and consider every line of code they write technical debt. They won't have a million algorithms memorized: They'll go look at sources more often than not, and will spend a lot of time on tests that might as well be documentation. Very few of those traits are positive in an interview, but I think they are vital in creating good teams, but few select for them at all.

Linux and the demise of Solaris

I worked on Solaris for over a decade, and for a while it was usually a better choice than Linux, especially due to price/performance (which includes how many instances it takes to run a given workload). It was worth fighting for, and I fought hard. But Linux has now become technically better in just about every way. Out-of-box performance, tuned performance, observability tools, reliability (on patched LTS), scheduling, networking (including TCP feature support), driver support, application support, processor support, debuggers, syscall features, etc. Last I checked, ZFS worked better on Solaris than Linux, but it's an area where Linux has been catching up. I have little hope that Solaris will ever catch up to Linux, and I have even less hope for illumos: Linux now has around 1,000 monthly contributors, whereas illumos has about 15.

In addition to technology advantages, Linux has a community and workforce that's orders of magnitude larger, staff with invested skills (re-education is part of a TCO calculation), companies with invested infrastructure (rewriting automation scripts is also part of TCO), and also much better future employment prospects (a factor than can influence people wanting to work at your company on that OS). Even with my considerable and well-known Solaris expertise, the employment prospects with Solaris are bleak and getting worse every year. With my Linux skills, I can work at awesome companies like Netflix (which I highly recommend), Facebook, Google, SpaceX, etc.

Large technology-focused companies, like Netflix, Facebook, and Google, have the expertise and appetite to make a technology-based OS decision. We have dedicated teams for the OS and kernel with deep expertise. On Netflix's OS team, there are three staff who previously worked at Sun Microsystems and have more Solaris expertise than they do Linux expertise, and I believe you'll find similar people at Facebook and Google as well. And we are choosing Linux.

The choice of an OS includes many factors. If an OS came along that was better, we'd start with a thorough internal investigation, involving microbenchmarks (including an automated suite I wrote), macrobenchmarks (depending on the expected gains), and production testing using canaries. We'd be able to come up with a rough estimate of the cost savings based on price/performance. Most microservices we have run hot in user-level applications (think 99% user time), not the kernel, so it's difficult to find large gains from the OS or kernel. Gains are more likely to come from off-CPU activities, like task scheduling and TCP congestion, and indirect, like NUMA memory placement: all areas where Linux is leading. It would be very difficult to find a large gain by changing the kernel from Linux to something else. Just based on CPU cycles, the target that should have the most attention is Java, not the OS. But let's say that somehow we did find an OS with a significant enough gain: we'd then look at the cost to switch, including retraining staff, rewriting automation software, and how quickly we could find help to resolve issues as they came up. Linux is so widely used that there's a good chance someone else has found an issue, had it fixed in a certain version or documented a workaround.

What's left where Solaris/SmartOS/illumos is better? 1. There's more marketing of the features and people. Linux develops great technologies and has some highly skilled kernel engineers, but I haven't seen any serious effort to market these. Why does Linux need to? And 2. Enterprise support. Large enterprise companies where technology is not their focus (eg, a breakfast cereal company) and who want to outsource these decisions to companies like Oracle and IBM. Oracle still has Solaris enterprise support that I believe is very competitive compared to Linux offerings.~

Why wasn't RethinkDB more sucessful?

I'd argue that where RethinkDB fell down is on a step you don't list, "Understand the context of the problem", which you'd ideally do before figuring out how many people it's a problem for. Their initial idea was a MySQL storage engine for SSDs - the environmental change was that SSD prices were falling rapidly, SSDs have wildly different performance characteristics from disk, and so they figured there was an opportunity to catch the next wave. Only problem is that the biggest corporate buyers of SSDs are gigantic tech companies (eg. Google, Amazon) with large amounts of proprietary software, and so a generic MySQL storage engine isn't going to be useful to them anyway.

Unfortunately they'd already taken funding, built a team, and written a lot of code by the time they found that out, and there's only so far you can pivot when you have an ecosystem like that.

On falsehoods programmers believe about X

This unfortunately follows the conventions of the genre called "Falsehood programmers believe about X": ...

I honestly think this genre is horrible and counterproductive, even though the writer's intentions are good. It gives no examples, no explanations, no guidelines for proper implementations - just a list of condescending gotchas, showing off the superior intellect and perception of the author.

What does it mean if a company rescinds an offer because you tried to negotiate?

It happens sometimes. Usually it's because of one of two situations:

1) The company was on the fence about wanting you anyway, and negotiating takes you from the "maybe kinda sorta want to work with" to the "don't want to work with" pile.

2) The company is looking for people who don't question authority and don't stick up for their own interests.

Both of these are red flags. It's not really a matter of ethics - they're completely within their rights to withdraw an offer for any reason - but it's a matter of "Would you really want to work there anyway?" For both corporations and individuals, it usually leads to a smoother life if you only surround yourself with people who really value you.

HN comments

I feel like this is every HN discussion about "rates---comma---raising them": a mean-spirited attempt to convince the audience on the site that high rates aren't really possible, because if they were, the person telling you they're possible would be wealthy beyond the dreams of avarice. Once again: Patrick is just offering a more refined and savvy version of advice me and my Matasano friends gave him, and our outcomes are part of the record of a reasonable large public company.

This, by the way, is why I'll never write this kind of end-of-year wrap-up post (and, for the same reasons, why I'll never open source code unless I absolutely have to). It's also a big part of what I'm trying to get my hands around for the Starfighter wrap-up post. When we started Starfighter, everyone said "you're going to have such an amazing time because of all the HN credibility you have". But pretty much every time Starfighter actually came up on HN, I just wanted to hide under a rock. Even when the site is civil, it's still committed to grind away any joy you take either in accomplishing something near or even in just sharing something interesting you learned . You could sort of understand an atavistic urge to shit all over someone sharing an interesting experience that was pleasant or impressive. There's a bad Morrissey song about that. But look what happens when you share an interesting story that obviously involved significant unpleasantness and an honest accounting of one's limitations: a giant thread full of people piling on to question your motives and life choices. You can't win.

On the journalistic integrity of Quartz

I was the first person to be interviewed by this journalist (Michael Thomas @curious_founder). He approached me on Twitter to ask questions about digital nomad and remote work life (as I founded Nomad List and have been doing it for years).

I told him it'd be great to see more honest depictions as most articles are heavily idealized making it sound all great, when it's not necessarily. It's ups and downs (just like regular life really).

What happened next may surprise you. He wrote a hit piece on me changing my entire story that I told him over Skype into a clickbait article of how digital nomadism doesn't work and one of the main people doing it for awhile (en public) even settled down and gave up altogether.

I didn't settle down. I spent the summer in Amsterdam. Cause you know, it's a nice place! But he needed to say this to make a polarized hit piece with an angle. And that piece became viral. Resulting in me having to tell people daily that I didn't and getting lots of flack. You may understand it doesn't help if your entire startup is about something and a journalist writes a viral piece how you yourself don't even believe in that anymore. I contacted the journalist and Quartz but they didn't change a thing.

It's great this meant his journalistic breakthrough but it hurt me in the process.

I'd argue journalists like this are the whole problem we have these days. The articles they write can't be balanced because they need to get pageviews. Every potential to write something interesting quickly turns into clickbait. It turned me off from being interviewed ever again. Doing my own PR by posting comment sections of Hacker News or Reddit seems like a better idea (also see how Elon Musk does exactly this, seems smarter).

How did Click and Clack always manage to solve the problem?

Hope this doesn't ruin it for you, but I knew someone who had a problem presented on the show. She called in and reached an answering machine. Someone called her and qualified the problem. Then one of the brothers called and talked to her for a while. Then a few weeks later (there might have been some more calls, I don't know) both brothers called her and talked to her for a while. Her parts of that last call was edited into the radio show so it sounded like she had called and they just figured out the answer on the spot.

Why are so many people down on blockchain?

Blockchain is the world's worst database, created entirely to maintain the reputations of venture capital firms who injected hundreds of millions of dollars into a technology whose core defining insight was "You can improve on a Ponzi scam by making it self-organizing and distributed; that gets vastly more distribution, reduces the single point of failure, and makes it censorship-resistant."

That's more robust than I usually phrase things on HN, but you did ask. In slightly more detail:

Databases are wonderful things. We have a number which are actually employed in production, at a variety of institutions. They run the world. Meaningful applications run on top of Postgres, MySQL, Oracle, etc etc.

No meaningful applications run on top of "blockchain", because it is a marketing term. You cannot install blockchain just like you cannot install database. (Database sounds much cooler without the definitive article, too.) If you pick a particular instantiation of a blockchain-style database, it is a horrible, horrible database.

Can I pick on Bitcoin? Let me pick on Bitcoin. Bitcoin is claimed to be a global financial network and ready for production right now. Bitcoin cannot sustain 5 transactions per second, worldwide.

You might be sensibly interested in Bitcoin governance if, for some reason, you wanted to use Bitcoin. Bitcoin is a software artifact; it matters to users who makes changes to it and by what process. (Bitcoin is a software artifact, not a protocol, even though the Bitcoin community will tell you differently. There is a single C++ codebase which matters. It is essentially impossible to interoperate with Bitcoin without bugs-and-all replicating that codebase.) Bitcoin governance is captured by approximately ~5 people. This is a robust claim and requires extraordinary evidence.

Ordinary evidence would be pointing you, in a handwavy fashion, about the depth of acrimony with regards to raising the block size, which would let Bitcoin scale to the commanding heights of 10 or, nay, 100 transactions per second worldwide.

Extraordinary evidence might be pointing you to the time where the entire Bitcoin network was de-facto shut down based on the consensus of N people in an IRC channel. c.f. https://news.ycombinator.com/item?id=9320989 This was back in 2013. Long story short: a software update went awry so they rolled back global state by a few hours by getting the right two people to agree to it on a Skype call.

But let's get back to discussing that sole technical artifact. Bitcoin has a higher cost-to-value ratio than almost any technology conceivable; the cost to date is the market capitalization of Bitcoin. Because Bitcoin enters through a seigniorage mechanism, every Bitcoin existing was minted as compensation for "security the integrity of the blockchain" (by doing computationally expensive makework).

This cost is high. Today, routine maintenance of the Bitcoin network will cost the network approximately $1.5 million. That's on the order of $3 per write on a maximum committed capacity basis. It will cost another $1.5 million tomorrow, exchange rate depending.

(Bitcoin has successfully shifted much of the cost of operating its database to speculators rather than people who actually use Bitcoin for transaction processing. That game of musical chairs has gone on for a while.)

Bitcoin has some properties which one does not associate with many databases. One is that write acknowledgments average 5 minutes. Another is that they can stop, non-deterministically, for more than an hour at a time, worldwide, for all users simultaneously. This behavior is by design.

How big is the proprietary database market?

  1. The database market is NOT closed. In fact, we are in a database boom. Since 2009 (the year RethinkDB was founded), there have been over 100 production grade databases released in the market. These span document stores, Key/Value, time series, MPP, relational, in-memory, and the ever increasing "multi model databases."

  2. Since 2009, over $600 MILLION dollars (publicly announced) has been invested in these database companies (RethinkDB represents 12.2M or about 2%). That's aside from money invested in the bigger established databases.

  3. Almost all of the companies that have raised funding in this period generate revenue from one of more of the following areas:

a) exclusive hosting (meaning AWS et al. do not offer this product) b) multi-node/cluster support c) product enhancements c) enterprise support

Looking at each of the above revenue paths as executed by RethinkDB:

a) RethinkDB never offered a hosted solution. Compose offered a hosted solution in October of 2014. b) RethinkDB didn't support true high availability until the 2.1 release in August 2015. It was released as open source and to my knowledge was not monetized. c/d) I've heard that an enterprise version of RethinkDB was offered near the end. Enterprise Support is, empirically, a bad approach for a venture backed company. I don't know that RethinkDB ever took this avenue seriously. Correct me if I am wrong.

A model that is not popular among RECENT databases but is popular among traditional databases is a standard licensing model (e.g. Oracle, Microsoft SQL Server). Even these are becoming more rare with the advent of A, but never underestimate the licensing market.

Again, this is complete conjecture, but I believe RethinkDB failed for a few reasons:

1) not pursuing one of the above revenue models early enough. This has serious affects on the order of the feature enhancements (for instance, the HA released in 2015 could have been released earlier at a premium or to help facilitate a hosted solution).

2) incorrect priority of enhancements:

2a) general database performance never reached the point it needed to. RethinkDB struggled with both write and read performance well into 2015. There was no clear value add in this area compared to many write or read focused databases released around this time.

2b) lack of (proper) High Availability for too long.

2c) ReQL was not necessary - most developers use ORMs when interacting with SQL. When you venture into analytical queries, we actually seem to make great effort to provide SQL: look at the number of projects or companies that exist to bring SQL to databases and filesystems that don't support it (Hive, Pig, Slam Data, etc).

2d) push notifications. This has not been demonstrated to be a clear market need yet. There are a small handful of companies that promoting development stacks around this, but no database company is doing the same.

2e) lack of focus. What was RethinkDB REALLY good at? It push ReQL and joins at first, but it lacked HA until 2015, struggled with high write or read loads into 2015. It then started to focus on real time notifications. Again, there just aren't many databases focusing on these areas.

My final thought is that RethinkDB didn't raise enough capital. Perhaps this is because of previous points, but without capital, the above can't be corrected. RethinkDB actually raised far less money than basically any other venture backed company in this space during this time.

Again, I've never run a database company so my thoughts are just from an outsider. However, I am the founder of a company that provides database integration products so I monitor this industry like I hawk. I simply don't agree that the database market has been "captured."

I expect to see even bigger growth in databases in the future. I'm happy to share my thoughts about what types of databases are working and where the market needs solutions. Additionally, companies are increasingly relying on third part cloud services for data they previously captured themselves. Anything from payment processes, order fulfillment, traffic analytics etc is now being handled by someone else.

???

How did HN get get the commenter base that it has? If you read HN, on any given week, there are at least as many good, substantial, comments as there are posts. This is different from every other modern public news aggregator I can find out there, and I don’t really know what the ingredients are that make HN successful.

For the last couple years (ish?), the moderation regime has been really active in trying to get a good mix of stories on the front page and in tamping down on gratuitously mean comments. But there was a period of years where the moderation could be described as sparse, arbitrary, and capricious, and while there are fewer “bad” comments now, it doesn’t seem like good moderation actually generates more “good” comments.

The ranking scheme seems to penalize posts that have a lot of comments on the theory that flamebait topics will draw a lot of comments. That sometimes prematurely buries stories with good discussion, but much more often, it buries stories that draw pointless flamewars. If you just read HN, it’s hard to see the effect, but if you look at forums that use comments as a positive factor in ranking, the difference is dramatic -- those other forums that boost topics with many comments (presumably on theory that vigorous discussion should be highlighted) often have content-free flame wars pinned at the top for long periods of time.

Something else that HN does that’s different from most forums is that user flags are weighted very heavily. On reddit, a downvote only cancels out an upvote, which means that flamebait topics that draw a lot of upvotes like “platform X is cancer” “Y is doing some horrible thing” often get pinned to the top of r/programming for a an entire day, since the number of people who don’t want to see that is drowned out by the number of people who upvote outrageous stories. If you read the comments for one of the "X is cancer" posts on r/programming, the top comment will almost inevitably that the post has no content, that the author of the post is a troll who never posts anything with content, and that we'd be better off with less flamebait by the author at the top of r/programming. But the people who will upvote outrage porn outnumber the people who will downvote it, so that kind of stuff dominates aggregators that use raw votes for ranking. Having flamebait drop off the front page quickly is significant, but it doesn’t seem sufficient to explain why there are so many more well-informed comments on HN than on other forums with roughly similar traffic.

Maybe the answer is that people come to HN for the same reason people come to Silicon Valley -- despite all the downsides, there’s a relatively large concentration of experts there across a wide variety of CS-related disciplines. If that’s true, and it’s a combination of path dependence on network effects, that’s pretty depressing since that’s not replicable.

If you liked this curated list of comments, you'll probably also like this list of books and this list of blogs.

This is part of an experiment where I write up thoughts quickly, without proofing or editing. Apologies if this is less clear than a normal post. This is probably going to be the last post like this, for now, since, by quickly writing up a post whenever I have something that can be written up quickly, I'm building up a backlog of post ideas that require re-reading the literature in an area or running experiments.

P.S. Please suggest other good comments! By their nature, HN comments are much less discoverable than stories, so there are a lot of great coments that I haven't seen.


  1. if you’re one of those people, you’ve probably already thought of this, but maybe consider, at the margin, blogging more and commenting on HN less? As a result of writing this post, I looked through my old HN comments and noticed that I wrote this comment three years ago, which is another way of stating the second half of this post I wrote recently. Comparing the two, I think the HN comment is substantially better written. But, like most HN comments, it got some traffic while the story was still current and is now buried, and AFAICT, nothing really happened as a result of the comment. The blog post, despite being “worse”, has gotten some people to contact me personally, and I’ve had some good discussions about that and other topics as a result. Additionally, people occasionally contact me about older posts I’ve written; I continue to get interesting stuff in my inbox as a result of having written posts years ago. Writing your comment up as a blog post will almost certainly provide more value to you, and if it gets posted to HN, it will probably provide no less value to HN.

    Steve Yegge has a pretty list of reasons why you should blog that I won’t recapitulate here. And if you’re writing substantial comments on HN, you’re already doing basically everything you’d need to do to write a blog except that you’re putting the text into a little box on HN instead of into a static site generator or some hosted blogging service. BTW, I’m not just saying this for your benefit: my selfish reason for writing this appeal is that I really want to read the Nathan Kurz blog on low-level optimizations, the Jonathan Tang blog on what it’s like to work at startups vs. big companies, etc.

    [return]

Sun, 23 Oct 2016 00:00:00 +0000


Programming book list

There are a lot of “12 CS books every programmer must read” lists floating around out there. That's nonsense. The field is too broad for almost any topic to be required reading for all programmers, and even if a topic is that important, people's learning preferences differ too much for any book on that topic to be the best book on the topic for all people.

This is a list of topics and books where I've read the book, am familiar enough with the topic to say what you might get out of learning more about the topic, and have read other books and can say why you'd want to read one book over another.

Algorithms / Data Structures / Complexity

Why should you care? Well, there's the pragmatic argument: even if you never use this stuff in your job, most of the best paying companies will quiz you on this stuff in interviews. On the non-bullshit side of things, I find algorithms to be useful in the same way I find math to be useful. The probability of any particular algorithm being useful for any particular problem is low, but having a general picture of what kinds of problems are solved problems, what kinds of problems are intractable, and when approximations will be effective, is often useful.

McDowell; Cracking the Coding Interview

Some problems and solutions, with explanations, matching the level of questions you see in entry-level interviews at Google, Facebook, Microsoft, etc. I usually recommend this book to people who want to pass interviews but not really learn about algorithms. It has just enough to get by, but doesn't really teach you the why behind anything. If you want to actually learn about algorithms and data structures, see below.

Dasgupta, Papadimitriou, and Vazirani; Algorithms

Everything about this book seems perfect to me. It breaks up algorithms into classes (e.g., divide and conquer or greedy), and teaches you how to recognize what kind of algorithm should be used to solve a particular problem. It has a good selection of topics for an intro book, it's the right length to read over a few weekends, and it has exercises that are appropriate for an intro book. Additionally, it has sub-questions in the middle of chapters to make you reflect on non-obvious ideas to make sure you don't miss anything.

I know some folks don't like it because it's relatively math-y/proof focused. If that's you, you'll probably prefer Skiena.

Skiena; The Algorithm Design Manual

The longer, more comprehensive, more practical, less math-y version of Dasgupta. It's similar in that it attempts to teach you how to identify problems, use the correct algorithm, and give a clear explanation of the algorithm. Book is well motivated with “war stories” that show the impact of algorithms in real world programming.

CLRS; Introduction to Algorithms

This book somehow manages to make it into half of these “N books all programmers must read” lists despite being so comprehensive and rigorous that almost no practitioners actually read the entire thing. It's great as a textbook for an algorithms class, where you get a selection of topics. As a class textbook, it's nice bonus that it has exercises that are hard enough that they can be used for graduate level classes (about half the exercises from my grad level algorithms class were pulled from CLRS, and the other half were from Kleinberg & Tardos), but this is wildly impractical as a standalone introduction for most people.

Just for example, there's an entire chapter on Van Emde Boas trees. They're really neat -- it's a little surprising that a balanced-tree-like structure with O(lg lg n) insert, delete, as well as find, successor, and predecessor is possible, but a first introduction to algorithms shouldn't include Van Emde Boas trees.

Kleinberg & Tardos; Algorithm Design

Same comments as for CLRS -- it's widely recommended as an introductory book even though it doesn't make sense as an introductory book. Personally, I found the exposition in Kleinberg to be much easier to follow than in CLRS, but plenty of people find the opposite.

Demaine; Advanced Data Structures

This is a set of lectures and notes and not a book, but if you want a coherent (but not intractably comprehensive) set of material on data structures that you're unlikely to see in most undergraduate courses, this is great. The notes aren't designed to be standalone, so you'll want to watch the videos if you haven't already seen this material.

Okasaki; Purely Functional Data Structures

Fun to work through, but, unlike the other algorithms and data structures books, I've yet to be able to apply anything from this book to a problem domain where performance really matters.

For a couple years after I read this, when someone would tell me that it's not that hard to reason about the performance of purely functional lazy data structures, I'd ask them about part of a proof that stumped me in this book. I'm not talking about some obscure super hard exercise, either. I'm talking about something that's in the main body of the text that was considered too obvious to the author to explain. No one could explain it. Reasoning about this kind of thing is harder than people often claim.

Dominus; Higher Order Perl

A gentle introduction to functional programming that happens to use Perl. You could probably work through this book just as easily in Python or Ruby.

If you keep up with what's trendy, this book might seem a bit dated today, but only because so many of the ideas have become mainstream. If you're wondering why you should care about this "functional programming" thing people keep talking about, and some of the slogans you hear don't speak to you or are even off-putting (types are propositions, it's great because it's math, etc.), give this book a chance.

Levitin; Algorithms

I ordered this off amazon after seeing these two blurbs: “Other learning-enhancement features include chapter summaries, hints to the exercises, and a detailed solution manual.” and “Student learning is further supported by exercise hints and chapter summaries.” One of these blurbs is even printed on the book itself, but after getting the book, the only self-study resources I could find were some yahoo answers posts asking where you could find hints or solutions.

I ended up picking up Dasgupta instead, which was available off an author's website for free.

Mitzenmacher & Upfal; Probability and Computing: Randomized Algorithms and Probabilistic Analysis

I've probably gotten more mileage out of this than out of any other algorithms book. A lot of randomized algorithms are trivial to port to other applications and can simplify things a lot.

The text has enough of an intro to probability that you don't need to have any probability background. Also, the material on tails bounds (e.g., Chernoff bounds) is useful for a lot of CS theory proofs and isn't covered in the intro probability texts I've seen.

Sipser; Introduction to the Theory of Computation

Classic intro to theory of computation. Turing machines, etc. Proofs are often given at an intuitive, “proof sketch”, level of detail. A lot of important results (e.g, Rice's Theorem) are pushed into the exercises, so you really have to do the key exercises. Unfortunately, most of the key exercises don't have solutions, so you can't check your work.

For something with a more modern topic selection, maybe see Aurora & Barak.

Bernhardt; Computation

Covers a few theory of computation highlights. The explanations are delightful and I've watched some of the videos more than once just to watch Bernhardt explain things. Targeted at a general programmer audience with no background in CS.

Kearns & Vazirani; An Introduction to Computational Learning Theory

Classic, but dated and riddled with errors, with no errata available. When I wanted to learn this material, I ended up cobbling together notes from a couple of courses, one by Klivans and one by Blum.

Operating Systems

Why should you care? Having a bit of knowledge about operating systems can save days or week of debugging time. This is a regular theme on Julia Evans's blog, and I've found the same thing to be true of my experience. I'm hard pressed to think of anyone who builds practical systems and knows a bit about operating systems who hasn't found their operating systems knowledge to be a time saver. However, there's a bias in who reads operating systems books -- it tends to be people who do related work! It's possible you won't get the same thing out of reading these if you do really high-level stuff.

Silberchatz, Galvin, and Gagne; Operating System Concepts

This was what we used at Wisconsin before the comet book became standard. I guess it's ok. It covers concepts at a high level and hits the major points, but it's lacking in technical depth, details on how things work, advanced topics, and clear exposition.

Cox, Kasshoek, and Morris; xv6

This book is great! It explains how you can actually implement things in a real system, and it comes with its own implementation of an OS that you can play with. By design, the authors favor simple implementations over optimized ones, so the algorithms and data structures used are often quite different than what you see in production systems.

This book goes well when paired with a book that talks about how more modern operating systems work, like Love's Linux Kernel Development or Russinovich's Windows Internals.

Arpaci-Dusseau and Arpaci-Dusseau; Operating Systems: Three Easy Pieces

Nice explanation of a variety of OS topics. Goes into much more detail than any other intro OS book I know of. For example, the chapters on file systems describe the details of multiple, real, filessytems, and discusses the major implementation features of ext4. If I have one criticism about the book, it's that it's very *nix focused. Many things that are described are simply how things are done in *nix and not inherent, but the text mostly doesn't say when something is inherent vs. when it's a *nix implementation detail.

Love; Linux Kernel Development

The title can be a bit misleading -- this is basically a book about how the Linux kernel works: how things fit together, what algorithms and data structures are used, etc. I read the 2nd edition, which is now quite dated. The 3rd edition has some updates, but introduced some errors and inconsistencies, and is still dated (it was published in 2010, and covers 2.6.34). Even so, it's a nice introduction into how a relatively modern operating system works.

The other downside of this book is that the author loses all objectivity any time Linux and Windows are compared. Basically every time they're compared, the author says that Linux has clearly and incontrovertibly made the right choice and that Windows is doing something stupid. On balance, I prefer Linux to Windows, but there are a number of areas where Windows is superior, as well as areas where there's parity but Windows was ahead for years. You'll never find out what they are from this book, though.

Russinovich, Solomon, and Ionescu; Windows Internals

The most comprehensive book about how a modern operating system works. It just happens to be about Windows. Coming from a *nix background, I found this interesting to read just to see the differences.

This is definitely not an intro book, and you should have some knowledge of operating systems before reading this. If you're going to buy a physical copy of this book, you might want to wait until the 7th edition is released (early in 2017).

Downey; The Little Book of Semaphores

Takes a topic that's normally one or two sections in an operating systems textbook and turns it into its own 300 page book. The book is a series of exercises, a bit like The Little Schemer, but with more exposition. It starts by explaining what semaphore is, and then has a series of exercises that builds up higher level concurrency primitives.

This book was very helpful when I first started to write threading/concurrency code. I subscribe to the Butler Lampson school of concurrency, which is to say that I prefer to have all the concurrency-related code stuffed into a black box that someone else writes. But sometimes you're stuck writing the black box, and if so, this book has a nice introduction to the style of thinking required to write maybe possibly not totally wrong concurrent code.

I wish someone would write a book in this style, but both lower level and higher level. I'd love to see exercises like this, but starting with instruction-level primitives for a couple different architectures with different memory models (say, x86 and Alpha) instead of semaphores. If I'm writing grungy low-level threading code today, I'm overwhelmingly like to be using c++11 threading primitives, so I'd like something that uses those instead of semaphores, which I might have used if I was writing threading code against the Win32 API. But since that book doesn't exist, this seems like the next best thing.

I've heard that Doug Lea's Concurrent Programming in Java is also quite good, but I've only taken a quick look at it.

Computer architecture

Why should you care? The specific facts and trivia you'll learn will be useful when you're doing low-level performance optimizations, but the real value is learning how to reason about tradeoffs between performance and other factors, whether that's power, cost, size, weight, or something else.

In theory, that kind of reasoning should be taught regardless of specialization, but my experience is that comp arch folks are much more likely to “get” that kind of reasoning and do back of the envelope calculations that will save them from throwing away a 2x or 10x (or 100x) factor in performance for no reason. This sounds obvious, but I can think of multiple production systems at large companies that are giving up 10x to 100x in performance which are operating at a scale where even a 2x difference in performance could pay a VP's salary -- all because people didn't think through the performance implications of their design.

Hennessy & Patterson; Computer Architecture: A Quantitative Approach

This book teaches you how to do systems design with multiple constraints (e.g., performance, TCO, and power) and how to reason about tradeoffs. It happens to mostly do so using microprocessors and supercomputers as examples.

New editions of this book have substantive additions and you really want the latest version. For example, the latest version added, among other things, a chapter on data center design, and it answers questions like, how much opex/capex is spent on power, power distribution, and cooling, and how much is spent on support staff and machines, what's the effect of using lower power machines on tail latency and result quality (bing search results are used as an example), and what other factors should you consider when designing a data center.

Assumes some background, but that background is presented in the appendices (which are available online for free).

Shen & Lipasti: Modern Processor Design

Presents most of what you need to know to architect a high performance Pentium Pro (1995) era microprocessor. That's no mean feat, considering the complexity involved in such a processor. Additionally, presents some more advanced ideas and bounds on how much parallelism can be extracted from various workloads (and how you might go about doing such a calculation). Has an unusually large section on value prediction, because the authors invented the concept and it was still hot when the first edition was published.

For pure CPU architecture, this is probably the best book available.

Hill, Jouppi, and Sohi, Readings in Computer Architecture

Read for historical reasons and to see how much better we've gotten at explaining things. For example, compare Amdahl's paper on Amdahl's law (two pages, with a single non-obvious graph presented, and no formulas), vs. the presentation in a modern textbook (one paragraph, one formula, and maybe one graph to clarify, although it's usually clear enough that no extra graph is needed).

This seems to be worse the further back you go; since comp arch is a relatively young field, nothing here is really hard to understand. If you want to see a dramatic example of how we've gotten better at explaining things, compare Maxwell's original paper on Maxwell's equations to a modern treatment of the same material. Fun if you like history, but a bit of slog if you're just trying to learn something.

Algorithmic game theory / auction theory / mechanism design

Why should you care? Some of the world's biggest tech companies run on ad revenue, and those ads are sold through auctions. This field explains how and why they work. Additionally, this material is useful any time you're trying to figure out how to design systems that allocate resources effectively.1

In particular, incentive compatible mechanism design (roughly, how to create systems that provide globally optimal outcomes when people behave in their own selfish best interest) should be required reading for anyone who designs internal incentive systems at companies. If you've ever worked at a large company that "gets" this and one that doesn't, you'll see that the company that doesn't get it has giant piles of money that are basically being lit on fire because the people who set up incentives created systems that are hugely wasteful. This field gives you the background to understand what sorts of mechanisms give you what sorts of outcomes; reading case studies gives you a very long (and entertaining) list of mistakes that can cost millions or even billions of dollars.

Krishna; Auction Theory

The last time I looked, this was the only game in town for a comprehensive, modern, introduction to auction theory. Covers the classic second price auction result in the first chapter, and then moves on to cover risk aversion, bidding rings, interdependent values, multiple auctions, asymmetrical information, and other real-world issues.

Relatively dry. Unlikely to be motivating unless you're already interested in the topic. Requires an understanding of basic probability and calculus.

Steighlitz; Snipers, Shills, and Sharks: eBay and Human Behavior

Seems designed as an entertaining introduction to auction theory for the layperson. Requires no mathematical background and relegates math to the small print. Covers maybe, 1/10th of the material of Krishna, if that. Fun read.

Crampton, Shoham, and Steinberg; Combinatorial Auctions

Discusses things like how FCC spectrum auctions got to be the way they are and how “bugs” in mechanism design can leave hundreds of millions or billions of dollars on the table. This is one of those books where each chapter is by a different author. Despite that, it still manages to be coherent and I didn't mind reading it straight through. It's self-contained enough that you could probably read this without reading Krishna first, but I wouldn't recommend it.

Shoham and Leyton-Brown; Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations

The title is the worst thing about this book. Otherwise, it's a nice introduction to algorithmic game theory. The book covers basic game theory, auction theory, and other classic topics that CS folks might not already know, and then covers the intersection of CS with these topics. Assumes no particular background in the topic.

Nisan, Roughgarden, Tardos, and Vazirani; Algorithmic Game Theory

A survey of various results in algorithmic game theory. Requires a fair amount of background (consider reading Shoham and Leyton-Brown first). For example, chapter five is basically Devanur, Papadimitriou, Saberi, and Vazirani's JACM paper, Market Equilibrium via a Primal-Dual Algorithm for a Convex Program, with a bit more motivation and some related problems thrown in. The exposition is good and the result is interesting (if you're into that kind of thing), but it's not necessarily what you want if you want to read a book straight through and get an introduction to the field.

Misc

Beyer, Jones, Petoff, and Murphy; Site Reliability Engineering

A description of how Google handles operations. Has the typical Google tone, which is off-putting to a lot of folks with a “traditional” ops background, and assumes that many things can only be done with the SRE model when they can, in fact, be done without going full SRE.

For a much longer description, see this 22 page set of notes on Google's SRE book.

Fowler, Beck, Brant, Opdyke, and Roberts; Refactoring

At the time I read it, it was worth the price of admission for the section on code smells alone. But this book has been so successful that the ideas of refactoring and code smells have become mainstream.

Steve Yegge has a great pitch for this book:

When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you've been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: "Yeah, uh, you've read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course." Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.

...

If you're a relatively experienced engineer, you'll recognize 80% or more of the techniques in the book as things you've already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don't comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!

Demarco & Lister, Peopleware

This book seemed convincing when I read it in college. It even had all sorts of studies backing up what they said. No deadlines is better than having deadlines. Offices are better than cubicles. Basically all devs I talk to agree with this stuff.

But virtually every successful company is run the opposite way. Even Microsoft is remodeling buildings from individual offices to open plan layouts. Could it be that all of this stuff just doesn't matter that much? If it really is that important, how come companies that are true believers, like Fog Creek, aren't running roughshod over their competitors?

This book agrees with my biases and I'd love for this book to be right, but the meta evidence makes me want to re-read this with a critical eye and look up primary sources.

Drummond; Renegades of the Empire

This book explains how Microsoft's aggressive culture got to be the way it is today. The intro reads:

Microsoft didn't necessarily hire clones of Gates (although there were plenty on the corporate campus) so much as recruiter those who shared some of Gates's more notable traits -- arrogance, aggressiveness, and high intelligence.

Gates is infamous for ridiculing someone's idea as “stupid”, or worse, “random”, just to see how he or she defends a position. This hostile managerial technique invariably spread through the chain of command and created a culture of conflict.

Microsoft nurtures a Darwinian order where resources are often plundered and hoarded for power, wealth, and prestige. A manager who leaves on vacation might return to find his turf raided by a rival and his project put under a different command or canceled altogether

On interviewing at Microsoft:

“What do you like about Microsoft?” “Bill kicks ass”, St. John said. “I like kicking ass. I enjoy the feeling of killing competitors and dominating markets”.

He was unsure how he was doing and thought he stumbled then asked if he was a "people person". "No, I think most people are idiots", St. John replied.

These answers were exactly what Microsoft was looking for. They resulted in a strong offer and an aggresive courtship.

On developer evangalism at Microsoft:

At one time, Microsoft evangelists were also usually chartered with disrupting competitors by showing up at their conferences, securing positions on and then tangling standards commitees, and trying to influence the media.

"We're the group at Microsoft whose job is to fuck Microsoft's competitors"

Read this book if you're considering a job at Microsoft. Although it's been a long time since the events described in this book, you can still see strains of this culture in Microsoft today.

Bilton; Hatching Twitter

An entertaining book about the backstabbing, mismangement, and random firings that happened in Twitter's early days. When I say random, I mean that there were instances where critical engineers were allegedly fired so that the "decider" could show other important people that current management was still in charge.

I don't know folks who were at Twitter back then, but I know plenty of folks who were at the next generation of startups in their early days and there are a couple of companies where people had eerily similar experiences. Read this book if you're considering a job at a trendy startup.

Galenson; Old Masters and Young Geniuses

This book is about art and how productivity changes with age, but if its thesis is valid, it probably also applies to programming. Galenson applies statistics to determine the "greatness" of art and then uses that to draw conclusions about how the productivty of artists change as they age. I don't have time to go over the data in detail, so I'll have to remain skeptical of this until I have more free time, but I think it's interesting reading even for a skeptic.

Math

Why should you care? From a pure ROI perspective, I doubt learning math is “worth it” for 99% of jobs out there. AFAICT, I use math more often than most programmers, and I don't use it all that often. But having the right math background sometimes comes in handy and I really enjoy learning math. YMMV.

Bertsekas; Introduction to Probability

Introductory undergrad text that tends towards intuitive explanations over epsilon-delta rigor. For anyone who cares to do more rigorous derivations, there are some exercises at the back of the book that go into more detail.

Has many exercises with available solutions, making this a good text for self-study.

Ross; A First Course in Probability

This is one of those books where they regularly crank out new editions to make students pay for new copies of the book (this is presently priced at a whopping $174 on Amazon)2. This was the standard text when I took probability at Wisconsin, and I literally cannot think of a single person who found it helpful. Avoid.

Brualdi; Introductory Combinatorics

Brualdi is a great lecturer, one of the best I had in undergrad, but this book was full of errors and not particularly clear. There have been two new editions since I used this book, but according to the Amazon reviews the book still has a lot of errors.

For an alternate introductory text, I've heard good things about Camina & Lewis's book, but I haven't read it myself. Also, Lovasz is a great book on combinatorics, but it's not exactly introductory.

Apostol; Calculus

Volume 1 covers what you'd expect in a calculus I + calculus II book. Volume 2 covers linear algebra and multivariable calculus. It covers linear algebra before multivariable calculus, which makes multi-variable calculus a lot easier to understand.

It also makes a lot of sense from a programming standpoint, since a lot of the value I get out of calculus is its applications to approximations, etc., and that's a lot clearer when taught in this sequence.

This book is probably a rough intro if you don't have a professor or TA to help you along. The Spring SUMS series tends to be pretty good for self-study introductions to various areas, but I haven't actually read their intro calculus book so I can't actually recommend it.

Stewart; Calculus

Another one of those books where they crank out new editions with trivial changes to make money. This was the standard text for non-honors calculus at Wisconsin, and the result of that was I taught a lot of people to do complex integrals with the methods covered in Apostol, which are much more intuitive to many folks.

This book takes the approach that, for a type of problem, you should pattern match to one of many possible formulas and then apply the formula. Apostol is more about teaching you a few tricks and some intuition that you can apply to a wide variety of problems. I'm not sure why you'd buy this unless you were required to for some class.

Hardware basics

Why should you care? People often claim that, to be a good programmer, you have to understand every abstraction you use. That's nonsense. Modern computing is too complicated for any human to have a real full-stack understanding of what's going on. In fact, one reason modern computing can accomplish what it does is that it's possible to be productive without having a deep understanding of much of the stack that sits below the level you're operating at.

That being said, if you're curious about what sits below software, here are a few books that will get you started.

Nisan & Shocken; nand2tetris

If you only want to read one single thing, this should probably be it. It's a “101” level intro that goes down to gates and Boolean logic. As implied by the name, it takes you from NAND gates to a working tetris program.

Roth; Fundamentals of Logic Design

Much more detail on gates and logic design than you'll see in nand2tetris. The book is full of exercises and appears to be designed to work for self-study. Note that the link above is to the 5th edition. There are newer, more expensive, editions, but they don't seem to be much improved, have a lot of errors in the new material, and are much more expensive.

Weste; Harris, and Bannerjee; CMOS VLSI Design

One level below Boolean gates, you get to VLSI, a historical acronym (very large scale integration) that doesn't really have any meaning today.

Broader and deeper than the alternatives, with clear exposition. Explores the design space (e.g., the section on adders doesn't just mention a few different types in an ad hoc way, it explores all the tradeoffs you can make. Also, has both problems and solutions, which makes it great for self study.

Kang & Leblebici; CMOS Digital Integrated Circuits

This was the standard text at Wisconsin way back in the day. It was hard enough to follow that the TA basically re-explained pretty much everything necessary for the projects and the exams. I find that it's ok as a reference, but it wasn't a great book to learn from.

Compared to West et al., Weste spends a lot more effort talking about tradeoffs in design (e.g., when creating a parallel prefix tree adder, what does it really mean to be at some particular point in the design space?).

Pierret; Semiconductor Device Fundamentals

One level below VLSI, you have how transistors actually work.

Really beautiful explanation of solid state devices. The text nails the fundamentals of what you need to know to really understand this stuff (e.g., band diagrams), and then uses those fundamentals along with clear explanations to give you a good mental model of how different types of junctions and devices work.

Streetman & Bannerjee; Solid State Electronic Devices

Covers the same material as Pierret, but seems to substitute mathematical formulas for the intuitive understanding that Pierret goes for.

Ida; Engineering Electromagnetics

One level below transistors, you have electromagnetics.

Two to three times thicker than other intro texts because it has more worked examples and diagrams. Breaks things down into types of problems and subproblems, making things easy to follow. For self-study, A much gentler introduction than Griffiths or Purcell.

Shanley; Pentium Pro and Pentium II System Architecture

Unlike the other books in this section, this book is about practice instead of theory. It's a bit like Windows Internals, in that it goes into the details of a real, working, system. Topics include hardware bus protocols, how I/O actually works (e.g., APIC), etc.

The problem with a practical introduction is that there's been an exponential increase in complexity ever since the 8080. The further back you go, the easier it is to understand the most important moving parts in the system, and the more irrelevant the knowledge. This book seems like an ok compromise in that the bus and I/O protocols had to handle multiprocessors, and many of the elements that are in modern systems were in these systems, just in a simpler form.

Not covered

Of the books that I've liked, I'd say this captures at most 25% of the software books and 5% of the hardware books. On average, the books that have been left off the list are more specialized. This list is also missing many entire topic areas, like PL, practical books on how to learn languages, networking, etc.

The reasons for leaving off topic areas vary; I don't have any PL books listed because I don't read PL books. I don't have any networking books because, although I've read a couple, I don't know enough about the area to really say how useful the books are. The vast majority of hardware books aren't included because they cover material that you wouldn't care about unless you were a specialist (e.g., Skew-Tolerant Circuit Design or Ultrafast Optics). The same goes for areas like math and CS theory, where I left off a number of books that I think are great but have basically zero probability of being useful in my day-to-day programming life, e.g., Extremal Combinatorics. I also didn't include books I didn't read all or most of, unless I stopped because the book was atrocious. This means that I don't list classics I haven't finished like SICP and The Little Schemer, since those book seem fine and I just didn't finish them for one reason or another.

This list also doesn't include many books on history and culture, like Inside Intel or Masters of Doom. I'll probably add more at some point, but I've been trying an experiment where I try to write more like Julia Evans (stream of consciousness, fewer or no drafts). I'd have to go back and re-read the books I read 10+ years ago to write meaningful comments, which doesn't exactly fit with the experiment. On that note, since this list is from memory and I got rid of almost all of my books a couple years ago, I'm probably forgetting a lot of books that I meant to add.

_If you liked this, you might also like Thomas Ptacek's Application Security Reading List or this list of programming blogs, which is written in a similar style_


  1. Also, if you play board games, auction theory explains why fixing game imbalance via an auction mechanism is non-trivial and often makes the game worse. [return]
  2. I talked to the author of one of these books. He griped that the used book market destroys revenue from textbooks after a couple years, and that authors don't get much in royalties, so you have to charge a lot of money and keep producing new editions every couple of years to make money. That griping goes double in cases where a new author picks up a book classic book that someone else originally wrote, since the original author often has a much larger share of the royalties than the new author, despite doing no work no the later editions. [return]

Sun, 16 Oct 2016 01:06:34 -0700


Page created: Thu, Feb 21, 2019 - 09:00 PM GMT