Some Best Practices for Web App Authentication

It’s rare that a significant amount of time will go by without me hearing about yet another leak of user credentials from some well-known site. In the interest of incrementally increasing the security of the web as a whole, here’s a checklist to consider when writing your next (or current!) web application.

The Basics

These are things that effectively mandatory – if you’re not already doing them, you should probably start doing them as soon as you can.

Use SSL (https) for anything involving authentication.

Ideally, use SSL for everything; it’s the simplest way to make sure you’re using it in all the right places. If it’s not feasible to do that, however, you should at least be using it for anything related to authentication. This isn’t just limited to login pages – it also includes any pages that use your session cookie. If you protect the login request with SSL but then make requests to other pages (which send along the session cookie) over regular HTTP, your site can be attacked using what is known as session hijacking. In other words, any page that deals with a logged-in user probably should be served using SSL.

Use POST for anything submitting sensitive information (and in general, don’t write sensitive information into logs).

The URL of a request (where form variables submitted in a GET request) is typically written out to server logs; the contents of a POST body are not (unless you go out of your way to write them out somewhere). Storing hashed passwords in your database won’t buy you much if an attacker can just search through your logs and find the same information there.

On the same note, don’t write out any sensitive information to stdout or a log file. Avoid doing so even when debugging, just in case you happen to accidentally leave in such debugging code.

Hash your passwords with a strong, certified, and slow cryptographic one-way hashing scheme.

This means that you should be using something like bcrypt or a similar well-known hash function that has been designed to be slow. Don’t use MD5 or SHA-1, even though you’ll see a lot of existing code using them. Adding “HMAC” doesn’t magically make the hash slower, either – HMAC itself is not a hash function, and the most common implementations of it use fast hash functions (e.g. HMAC-MD5), not slow ones.

Why slow, you ask? Because a legitimate user won’t really mind if it takes a full second to process their login request – but someone trying to brute-force a password hash will have a much more difficult time if it takes a full second of CPU core time per password they try, as opposed to a millisecond.

It should hopefully go without saying, but don’t use 2-way encryption for passwords. There is absolutely no reason why you should ever need to recover the plaintext form of a password once it has been set – if a user forgets their password, you should use a separate process to reset it rather than giving it back to them. (See “Don’t use security questions” below for more on this topic.)

Ideally, design your system in such a way that you can easily change what hash algorithm you’re using – a common scheme is storing something like "algo|salt_string|hash" in the password field. That way, if you ever need to change the algorithm, you can simply require anyone with a stored value that starts with the old algorithm to reset their password, and store the new passwords in the same field (just using the new algorithm prefix). Note that bcrypt has this built in (the resulting strings it generates have metadata embedded, including a randomly-selected salt).

Salt your hashes, using a separate salt for each string you hash.

Note that this is completely separate from the previous point. Using salts doesn’t make your hash function slower, nor does it make it take any longer to crack an individual password. No, the purpose of salting is to make it so that the time it takes to crack a single password is not the same as the time it takes to crack every password in your database. The thing that salts protect you against is what is known as a rainbow table.

The fundamental idea behind rainbow tables is “if we’re going to try a trillion possible passwords, let’s run the hash function on each of them once, and then check all of the hashes from the database against the result – that way we only have to run the hash function 1 trillion times, rather than N trillion times where N is the number of hashes in the database.” Proper salting makes it so that a given rainbow table would only be applicable to a single user’s hash, so using such would be no more efficient than individually brute-forcing.

Store your sensitive data separate from your regular data.

This right here is an extremely low-hanging-fruit kind of thing. The vast majority of web applications I see have a “users” table in their database with fields like “id, username, password, name, email, join_date, favorite_color, …” and so on. This seems reasonable to most people at first glance; it’s all of the information specifically about a single user.

The problem is that it’s impossible to be perfect forever when it comes to security, and one day, a developer might write something like "SELECT * FROM users WHERE id = 5" and then dump the results directly into JSON to provide data to the AJAX call they just implemented on their new-and-improved profile page. Sure, they’re only using the join_date and favorite_color fields it returns, but anyone who happens to inspect the AJAX response now sees the password hash for the user nestled away there.

This can be really hard to catch in code review because there’s nothing in the new code that explicitly talks about sensitive data – the sensitive data just happened to be swept up along with the rest and dumped out into the (relatively speaking) public eye. If the sensitive data were in a separate table, it’d still be relatively easy to access along with other user data ("SELECT users.*, sensitive.* FROM users JOIN sensitive ON users.id = sensitive.id") but such accesses would be much more explicit and obvious to those writing and reviewing the code.

One might argue that the answer here is “never use SELECT *” but the simple response is that a lot of the time, developers aren’t even writing SQL – perhaps they dumped an ORM object to JSON, which faithfully dumped all of the fields. Putting sensitive data in a separate table is a simple solution to the problem that works no matter how you’re interfacing with the database. Of course, if you’re really concerned with user security you can go a step farther and put sensitive data like PII in a completely different database to allow you to better protect it and audit its usage – but that’s a separate topic.

Give your users the freedom to use whatever passwords they want, above minimum security thresholds.

It is reasonable (and a good idea) to require a minimum password length. If you don’t, some fraction of your users will inevitably choose “123” or similar. Likewise, it’s completely reasonable to disallow passwords that are nothing but a dictionary word plus a few characters. These are both the kinds of restrictions that encourage better passwords because they rule out the kinds of passwords that would be fundamentally insecure.

What’s unreasonable, though, are sites which place strict upper limits on passwords. The most notorious occurrence of this is probably “please choose a 4-digit PIN” – but it pops up in plenty of other less obvious forms. For instance, some sites place a limit on password length, e.g. “passwords must be 8-20 characters.” The minimum is fine, but why limit passwords to 20 characters at max? Assuming you’re hashing the password anyway (see the first item of this list), it’s not any harder to handle a 100 character password than it is to handle a 20 character one. The 29-character “correct horse battery staple” is far harder to randomly guess (assuming the person doing the guessing doesn’t read xkcd) than a 20-character password of a similar nature, but still very easy to remember.

The same also applies to restricting what characters can be used in passwords. It’s okay to require something other than a letter (in order to increase the number of potential characters most users will use in their passwords above just “the alphabet”), but arbitrarily disallowing things like spaces or non-alphanumeric characters is silly – if you’re hashing the password anyway, it won’t make a difference in how you handle passwords, and it will annoy the users who want to use a password that uses characters you don’t allow.

It’s okay to put sanity restrictions in place – for instance, a 1000-character maximum for a password field is reasonable to prevent someone from bogging down your password-handling algorithm by throwing a gigabyte of data at it (no one is going to type out a thousand characters by hand, and the small fraction of people using password managers probably aren’t going to be using quite that many characters either).

Put a reasonable upper limit on access attempts from a single location in a given time frame, and tell the user about failed access attempts.

In this context, “reasonable” is something on the order of 10-100 per 24 hour period. It’s high enough that no real user trying to remember what password they chose to use for your site is going to run into it, but low enough that an attacker trying to randomly guess passwords won’t make much progress.

Note that these kinds of limitations should be per location – you don’t want to make it possible for someone to lock someone else out of their account just by making a bunch of bogus login attempts. If a large number of different locations all try to access the same account, raise an alert and deal with the problem on a more specific basis.

Keep the user informed about attempts to access their account. This can be as simple as showing a “there were X failed login attempts since your last successful login” message when the user successfully signs in, but even better is to do this out of band – for instance, send the user an email after the 10th failed login attempt in a row. That way the user is made aware of the attack in a timely manner even if they aren’t frequently logging into your site. It also allows the user to figure out what might have happened if the attacker does eventually manage to access their account.

Use a single failure message regardless of whether a user is valid or a password was wrong.

Attackers can take advantage of separate responses for “invalid user” versus “wrong password” to check whether they have a valid account or not. By simply trying to log in with a bogus password and seeing whether or not the “invalid user” response comes back, they’re able to verify if that particular account exists. They can then try to figure out the password from other sources (perhaps checking to see if passwords are shared with a similarly-named account on a different, previously-compromised site).

Instead, just return the same response no matter what portion of the login attempt failed. For instance, “invalid username or password” is a simple, straightforward error message that doesn’t leak information about whether an account exists.

Don’t use “security questions” – they’re anything but.

Unlike passwords, which (theoretically) are completely arbitrary, security questions are generally the exact opposite – not arbitrary at all, but instead basic on specific, immutable, often publicly-available facts about the user. For password resets, just email the user a link with a single-use, randomly-generated reset token that can be used to change their password to something new. The big-name email providers have been handling the problem of account access for much longer than you have; let them do the hard work of dealing with account recovery in the case where a user’s email is also compromised. Keep your end of things simple.

This also applies to any over-the-phone or other out-of-band account support you provide – if a user forgets their password, email them a reset link. If they lost access to their email account as well, let the email provider handle that case. (In the very rare situation where a customer is completely unable to recover their email account, you can handle that on a case-by-case basis, or simply choose to go the “tough luck” route.)

Don’t reset the user’s password for them. Doing so opens up the door for an attacker causing trouble by resetting a legitimate user’s password. Only change a user’s password if they use a correct, not-already used reset link. (You can also make reset links expire after a certain period of time.)

Going the Extra Mile

These things are not necessarily mandatory, but are still good ideas. You should at least consider incorporating them.

Two-factor authentication

Two-factor authentication is based around the idea of needing two different things (factors) to log into an account. Generally, the first thing is “something you know” (usually a password), and the second thing is “something you have” (typically either a purpose-built device, or nowadays, a smartphone). The reasoning is that it is far harder to both manage to figure out your password and obtain access to your smartphone, than to just acquire one or the other. Your phone might get stolen, but the thief probably doesn’t know your password. Similarly, someone might guess your password, but they probably don’t have your smartphone. Either way, you’d likely notice if your smartphone went missing.

Before smartphones became so ubiquitous, two-factor authentication was a little arduous – it required dedicated devices which could generate one-time passwords (“OTPs”). Nowadays, however, it’s easy to load a simple app onto a smartphone (such as Google Authenticator; full disclosure – I work for Google, but the Authenticator app is open-source and based on open standards) and use it as an OTP-generating device.

On the server side, implementing two-factor support is actually very simple (for instance, about 20 lines of Python). There are even libraries for it in a number of common languages. You’re basically just storing a second, internally-generated password for each user that is used to verify their OTP codes.

Email verification for unknown access locations

If you don’t utilize two-factor authentication, consider at least requiring out-of-band verification for access from a location that has never been seen before for a given user. This helps prevent malicious access to an account without significantly interfering with regular usage, especially if the user is already using two-factor authentication for their email account.

Account knowledge test for unknown access locations

This is similar to the previous item in that it involves an additional hurdle when accessing an account from a new location for the first time. In this case, you would ask the user to enter some piece of information about the account that a regular user of the account would easily know, but is not publicly derivable from just the login credentials. (For instance, a game’s website might ask for the name of a character on the specified account.) This helps prevent “drive by” intrusions where the attacker is trying out a bunch of stolen credentials (perhaps from another compromised site) but doesn’t actually know anything else about the account.

Don’t force frequent password changes.

In this context, “frequent” means more often than once a year or so. If you make someone change their password too often, it’s quite possible that they’ll resort to less secure means of remembering it, which is worse than not having changed it in the first place. If you do require occasional password changes, give the user warning when their current password is about to expire – ideally out-of-band (e.g. via email) so that they’re aware of the impending expiration even if they’re not actively logging in on a regular basis.

See Also

If you want to learn more about web application security, here are some other resources you can explore:

Git, Dotfiles, and Hardlinks

One handy use for Git is keeping track of your dotfiles – all of those configuration files that live inside your home directory like .screenrc, .gitconfig, .vimrc, et cetera.

A typical first approach often winds up looking something like this:

~ $ git init dotfiles
Initialized empty Git repository in /home/aiiane/dotfiles/.git/
~ $ cd dotfiles
~/dotfiles $ ln -s ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Track my vim config file"

All seems well until you look at the diff for the commit you just made and see something like this:

diff --git a/.vimrc b/.vimrc
new file mode 120000
index 0000000..6ba8edc
--- /dev/null
+++ b/.vimrc
@@ -0,0 +1 @@
+/home/aiiane/.vimrc
\ No newline at end of file

As it turns out, Git knows about symlinks and thus faithfully records the symlink as just that – a symlink – instead of recording the contents of the symlinked file. Oftentimes this leads to the thought of “well, if Git knows symlinks, then I can probably use a hardlink instead.” And thus the second attempt generally continues something like this:

~/dotfiles $ rm .vimrc
~/dotfiles $ ln ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Hardlink in my config file"

And this seems at first to work – looking at the diff, you see the contents of your config file being added; you push your commit to GitHub or whatever and your config file shows up properly there…

…until you try to have Git update the file. Perhaps you made a change somewhere else and now want to git pull, or perhaps you made a change locally that you decided you didn’t want and so you use git checkout .vimrc in your repo to change it back to your committed version. At this point you discover that while it seems to change the file in your repo, it doesn’t update the file in your home directory – the hardlink has been broken.

The reason for this is that Git never modifies files in the working tree – instead, it unlinks them and then recreates them from scratch. This inherently breaks any hardlinks that might have been present.

The third attempt is generally what winds up working, when you realize that what you can do does actually use symlinks, but rather than symlinking files outside of the repository into it, doing it the other way around: symlinking files inside the repository out of it, into your home directory:

~/dotfiles $ rm ~/.vimrc
~/dotfiles $ cd ~
~ $ ln -s dotfiles/.vimrc .

Since most programs tend to handle symlinks transparently (unlike Git), this lets you use Git to update the actual copy of the file in the Git working tree, and have those changes reflected in the path where your programs expect to find it.

An alternative approach

Astute readers may notice that there is another possibility: why not just make your home directory the Git working tree?

~ $ git init
Initialized empty Git repository in /home/aiiane/.git/
~ $ git add .vimrc
~ $ git commit -m "Track vimrc"

While this does work, it has its own drawbacks. The most significant is probably the large number of things in your home directory that you typically don’t want to track which clutter up git status. For those who take the home-as-worktree path, a logical solution to this problem is to just ignore everything by default:

~ $ echo "*" > .gitignore
~ $ git add -f .gitignore
~ $ git commit -m "Ignore all by default"

Then only things you’ve explicitly added (and you’ll need to use git add -f the first time you add each file) will be tracked. If you do this, however, it’s harder to tell at a glance whether or not a given file is being tracked, and new files that you might want to add won’t stand out. There is a way to check, though:

~ $ git ls-files
.gitignore

The other main potential drawback of using your home directory as a working tree is that it effectively requires you to version the same files on every machine – since Git doesn’t really do partial checkouts gracefully, once a particular path in your home directory is added, it’ll be tracked anywhere that pulls the commit which added it. Most of the time this probably won’t be an issue – generally if you want to track something, you want it the same everywhere – but it’s something to bear in mind if you choose this approach.

Interlude

It’s been a while since my last post, and the majority of that can probably be attributed to the job change I’ve gone through over the past few weeks.

More specifically, my last day at Yelp was July 20th. On July 30th, I started at my new job as a Site Reliability Engineer (SRE) for Google. I’m still working out of San Francisco – the team I’m on is based in Google’s SF office – but I’ve been down in Mountain View for the past week as part of the on-boarding process.

More posts with actual content will be coming soon once things settle down. Stay tuned!

Comment Systems and Time-Ordering at Scale

Time is a tricky concept in software development, especially in large distributed systems. Not only do you have to worry about the intricacies of computer representations for time itself (and there are many), but you also have to deal with the fact that it’s nigh-impossible to perfectly sync system clocks across multiple servers. This, of course, results in some interesting problems.

One example of a large distributed system is Google+. Obviously, there are quite a few servers that power the Google+ webapp. When a user adds a comment to a post, one of those many servers will be handling the AJAX request to add the comment, and it’ll be talking to one of a number of servers in Google’s backing data store.

Of course, each of those servers has a separate system clock. If those system clocks are even a second or two off from one another, you can get situations where two comments are timestamped in opposite order from when they were actually submitted – because the first comment was processed by a server that had a lagging system clock. You can see an example of that in this public comment thread. Notice how the 2nd comment displayed is actually a reply to the 3rd, and the 11th a reply to the 12th.

Now, this could be a fun little blog post just pointing that out and calling it done – “here’s the reality you have to deal with when you scale this large.” As it turns out, however, there’s a way this particular problem could be solved. Let’s take a look.

Defining the problem

First, let’s define the part of the problem we actually care about. In general, we don’t actually care about perfect time ordering – if two people post unrelated comments within a second or two of one another, it’s fine to display them in either order. What we actually care about is that replies to comments are shown after the comment to which they are replying. If comment B is a reply to comment A, we want to show comment B after comment A. That’s all we really care about – beyond that, the existing rough time ordering is fine.

Outlining the solution

So how can we do this without requiring incredibly precise time synchronization? The trick here is that we’re going to send some extra info from the client to the server (and store it as part of the newly created comment). Specifically, at the same time as we send the contents of a new comment to the server, we’ll also send along a second integer value – call it “parent” for now – that we’ll store along with the comment. Why is this important? Because it’s what will let us help decide ordering when we go to create the comment list.

What will “parent” contain? Simple: the id of the last existing comment the client already loaded. For instance, if the post was loaded with three comments, which had ids 1, 3, and 4 respectively, then the client would send a “parent” value of 4 when it posted a new comment.

When fetching the comment list, we’ll start out how you’d expect: grab a list of all the relevant comments from the data store.  To generate the ordered comment list, we’ll first take all the comments with an empty “parent” value and add them to our result, ordered by timestamp. Next, we’ll take all the comments with a “parent” value corresponding to comments already in our result, and insert them into the result based on two rules:

  1. A comment must be inserted somewhere after its “parent” comment P.
  2. A comment must be inserted after any comments with an earlier timestamp, except where this would violate #1.

We keep repeating this until we have no more comments left to add to the ordering.

Why does this work? It works because of the simple fact that in order for someone to reply to a comment, they first have to read it – which means that their client has to have loaded the older comment. By sending that knowledge to the server, we can create what’s called a partially ordered set out of the comments. With that partial ordering, we can then generate a final ordering that meets our desired goals. The algorithm outlined above is basically a pared-down adaptation of a vector clock.

Proof of concept

I’ve created a small proof of concept as a Python script. Here’s the output (notice, in particular, the ordering of #3 and #4):

#0 (t=0)
#1 (t=2)
#2 (t=4) - reply to #1
#3 (t=6)
#4 (t=5) - reply to #3
#5 (t=8)
#6 (t=16)

The script is just a quick example thrown together in under an hour – I make no guarantees on whether it’s the most optimal code or not. Feel free to play around with it.

Closing notes

There is a trade-off involved in this: the algorithm that generates this ordering is worst-case O(n²), compared to just sorting the list based on timestamp in O(n log n). For most scenarios, however, this is acceptable – in the case of Google+, it’s extremely unlikely to be a problem, given that G+ comment threads cap out at something like 500 replies. With such a small n, the difference in time is negligible. The sorting can also be done client-side if desired – no extra server processing is required.

Catch-22: Tech Blogging As a Woman

One of the Hacker News comments on Anna Billstrom’s article regarding female users on StackOverflow caught my eye:

We have had article after article claiming it is obvious women are being oppressed in the tech industry. Every week there is one of these. Many make bigoted claims about male engineers, enforcing stereotypes of male geeks I have never actually seen in industry.

Where are the technical articles written by women? There are plenty of contributions complaining about oppression, while attacking men and claiming absurd stereotypes. Where are the technical contributions?

While there are multiple potential issues that could be raised with regards to this comment, I’m going to focus on the second half – the part asking “where are the technical articles written by women?” Well, let’s use an example that’s close at hand – the blog post I wrote about Git submodules. That blog post wound up on Hacker News and also on Reddit.

For now, let’s set aside whatever opinions you have on the content of the post itself – assuming you can at least agree with me that it’s an example of a technical article. In exchange, I’ll refrain from commenting on the merits of the comments I’m quoting here.

If you look through the comment threads on both sites, you’ll notice something: any instances of gendered language that the commenters use assume the author is male. They refer to “him” and talk about what they think “he” should do or their opinion of “his” thoughts on the matter. Some examples:

Hacker News:

It appears that one of the solutions he recommends, git-subtree, is going to be merged into git soon:

Reddit:

He complains about having to branch in both the parent and the sub project, but I have found that not to be a problem at all, and kind of nice in some situations. He’s blowing it way out of proportion.

Reddit:

On the other hand, the author prefaced all his “this is where submodules break” descriptions with “I forgot to run submodule update”, so I have trouble sympathizing.

Sure, I generally don’t go out of my way to make it obvious that I’m a woman on my blog – you’d have to first click over to the About page, and then click through to either my Google+ profile or my Twitter account. Then again, I don’t know many male developers who go out of their way to make it obvious that they’re male, either. After all, supposedly gender is irrelevant on the internet (Hacker News):

Not to mention that we’re on the fucking internet. There is no gender, race, colour or creed here. Everybody pick a neutral username and, hey, presto! Problem solved.

Sure, I could go out of my way to make it obvious that I’m a woman. I could put my name at the top of my blog or on my About page, or I could mention it in passing in my writing. That’s not something a male author has to do, though. Furthermore, doing so results in harassment and having my writing dismissed/trivialized/tokenized because of my gender. Hence why I don’t (or at least, hadn’t until this post).

The catch-22 here is that if I choose to blend in, then people like the commenters above assume that everything they see was written by men, and use that as an excuse to dismiss the concerns of women in the tech industry – because apparently, we don’t contribute and thus don’t matter. If I choose to not blend in, I’m dismissed as bringing up gender when it’s irrelevant (or worse, harassed by people who’ve decided gender is relevant).

It’s no surprise that when the default assumption is “male author” it winds up seeming like male authors write almost everything. So to the original commenter I quoted, here’s your answer: they’re where all the technical articles you read are: on Hacker News, on Reddit, on whatever other blogs and aggregators you frequent. Just because you don’t see them (or perhaps, don’t realize you see them) doesn’t mean they don’t exist.