Category Archives: Software Development

Documentation Lock-In

Documentation is important. It’s how all the collective knowledge about a piece of software (or many pieces of software) is recorded and organized for use by those who maintain it. Once you get beyond a couple of developers, documentation becomes not just important, but required – without it, it’s not feasible to keep everyone on the same page.

While you could theoretically decide to keep things bare bones and put all of your documentation in plain text files, most developers aren’t masochistic and opt for something more flexible. Tools abound for creating, editing, generating, and browsing documentation – everything from markdown to wikis to static site generators (and plenty of combinations, too).

Given that documentation becomes a necessity so quickly, a given project often winds up choosing some set of these tools fairly early on in its lifespan. Sometimes it’ll be whatever the current fad is, sometimes it’s the system one of the developers used on an earlier project and liked, sometimes it’s just whatever happened to pop up first on Google. In general, though, such tools are typically chosen with at most only the current needs of the project in mind.

The problematic reality

It’s a common situation that a few years or even just months down the road, the documentation system is no longer really suited to the task at hand. Perhaps it’s missing features that would be extremely useful now that the project has matured, or perhaps folks have discovered that the system they chose is a maintenance nightmare and would rather have something more robust.

The more complex a documentation system is, though, the harder it is to move to something else (while preserving existing knowledge). Even something as simple as links between different documentation pages can present a serious headache: not only do you have to worry about migrating the content of pages from one system to another, you also have to make sure all the links are pointing to the right places. In effect, you wind up writing a compiler+linker combo: a compiler to translate the content of your old documentation system to your new one, and a linker to fix up all the references.

Of course, that’s not even counting any potential mismatches or incompatibilities in the content itself. Perhaps your existing documentation system supported templates in one form, and your new system does too… but in a completely different format. You’ll most likely wind up having to translate all the templates by hand, and then adjust your “compiler” to invoke them in the new format. Alternately, maybe your old system supported JavaScript, and so you have to make a choice between finding a new system that also supports JavaScript (and fixing anything which breaks in the transition), or dumping JavaScript support and either removing or reworking anything that was using it.

All of this makes it a lot more painful to change documentation systems after the fact. It’s not surprising that so many large projects wind up having documentation systems that seem needlessly sub par: it’s probably not because they like them that way, it’s because they’re effectively stuck with them.

Thinking ahead

So what to take away from this? Nothing’s going to make the problem completely go away. As long as useful documentation exists, some of these problems are going to crop up, and getting rid of documentation obviously isn’t the solution. Instead, the best strategy is probably twofold.

First, try to plan for the future as much as possible. If a tool only barely meets your needs now, it’s almost certainly not going to meet them later on down the road. Note that “meeting your needs” can apply to many aspects – not just the raw numbers. In addition to such absolute performance, there’s also things like minor irritations and maintenance hassles. If the system you choose is “good enough” but almost makes you fed up now, chances are you will get fed up with it later. Avoid compromising on the underlying system, because chances are you’re going to be stuck with it for a while. Documentation often gets even less refactoring love than code does (and code often doesn’t get nearly enough).

Second, since knowing the future is often hard (and the services of those who can do it reliably are often priced in souls/hour), plan for change – especially in the early stages of a project. When things are just starting to take off and get a little more complex, hold off on adopting a giant flashy feature-rich documentation system. Keep things simple instead: rich in information, light on complexity. Avoiding complex features will make migration into another system a lot easier down the line, once you get a better sense of where the project is headed and what its documentation needs will be.

Bonus tip: for web-based documentation, use an editable URL shortener for external links into the documentation. When you change documentation systems, the URLs for pages are likely to change at least a bit, breaking external links. If external sources use the shorturls, though, you can just update where those shorturls point. It’s a lot easier to update a bunch of shorturls in a database you control than it is to try to get everyone else to update their links.

Embracing Vim, Step by Step

I’ve used Vim as my primary editor for years now after having mostly picked it up in college as a convenient editor for working on school servers. Having a lightweight editor that I was pretty much guaranteed to find already installed on any machine I happened to acquire SSH access to was what originally got me into using it; the speed and efficiency benefits were merely a nice side benefit.

As such, my first few years using Vim didn’t involve harnessing much of its power – I essentially used it at first like a version of pico that happened to have some use commands that were accessible by pressing Esc. (If you’re a Vim user, I can feel you cringing. I am too.) As the amount of time I spend coding has increased (from starting out as a hobbyist, somewhat time-starved college student to working full-time as a software engineer), however, I’ve slowly begun picking up more and more Vim tricks of the trade.

Along the way, plenty of people have suggested various books or tutorials or other such massive collections of Vim knowledge, but so far my experience has been that very little of that tends to stick. If I try to learn a giant chunk of Vim features all at once, none of them really have the chance to become a habit, ingrained in my usage patterns. As a result (and due to the perfectly reasonable terseness of the average Vim instruction), they tend to just sort of fade away.

Instead, I’ve adopted a slower but steadier pace when it comes to learning new Vim shortcuts. First, I find a handful of related commands that are both useful and relevant to something I’m doing. I then focus on integrating those commands into my daily usage. Once they become something I do out of habit rather than conscious recollection, I move on to another set. I’ve found this approach much more successful in changing my long-term usage habits, which in turn has made my overall editing experience much more enjoyable.

As an example, the most recent set of commands I’ve decided to work on incorporating are related to visual mode:

  • v% – select text from current side of a matching pair to the other side (for instance, from one parens to a matching parens)
  • vib – select all text from the parens block the cursor is currently in
  • vi' or vi" – select the entire quoted string the cursor is currently in (depending on which quotes are used)
  • viB or vi{ – select an entire curly-brace-enclosed block
  • The a variants of these (using a instead of i includes the surrounding markers, e.g. quote marks).

These particular commands are nicely demonstrated in this StackOverflow answer, including animated gifs to show the results. I’m almost certain I’ve run across these commands along with many others when perusing larger collections of Vim shortcuts before, but because I wasn’t focusing on them, they were lost in the noise. Time to fix that.

As I move on to other Vim shortcuts, you may find me writing other posts detailing each of them in turn. What you probably won’t see is a giant blog post summarizing every Vim shortcut I’ve ever learned. There are already enough monolithic guides out there.

Some Best Practices for Web App Authentication

It’s rare that a significant amount of time will go by without me hearing about yet another leak of user credentials from some well-known site. In the interest of incrementally increasing the security of the web as a whole, here’s a checklist to consider when writing your next (or current!) web application.

The Basics

These are things that effectively mandatory – if you’re not already doing them, you should probably start doing them as soon as you can.

Use SSL (https) for anything involving authentication.

Ideally, use SSL for everything; it’s the simplest way to make sure you’re using it in all the right places. If it’s not feasible to do that, however, you should at least be using it for anything related to authentication. This isn’t just limited to login pages – it also includes any pages that use your session cookie. If you protect the login request with SSL but then make requests to other pages (which send along the session cookie) over regular HTTP, your site can be attacked using what is known as session hijacking. In other words, any page that deals with a logged-in user probably should be served using SSL.

Use POST for anything submitting sensitive information (and in general, don’t write sensitive information into logs).

The URL of a request (where form variables submitted in a GET request) is typically written out to server logs; the contents of a POST body are not (unless you go out of your way to write them out somewhere). Storing hashed passwords in your database won’t buy you much if an attacker can just search through your logs and find the same information there.

On the same note, don’t write out any sensitive information to stdout or a log file. Avoid doing so even when debugging, just in case you happen to accidentally leave in such debugging code.

Hash your passwords with a strong, certified, and slow cryptographic one-way hashing scheme.

This means that you should be using something like bcrypt or a similar well-known hash function that has been designed to be slow. Don’t use MD5 or SHA-1, even though you’ll see a lot of existing code using them. Adding “HMAC” doesn’t magically make the hash slower, either – HMAC itself is not a hash function, and the most common implementations of it use fast hash functions (e.g. HMAC-MD5), not slow ones.

Why slow, you ask? Because a legitimate user won’t really mind if it takes a full second to process their login request – but someone trying to brute-force a password hash will have a much more difficult time if it takes a full second of CPU core time per password they try, as opposed to a millisecond.

It should hopefully go without saying, but don’t use 2-way encryption for passwords. There is absolutely no reason why you should ever need to recover the plaintext form of a password once it has been set – if a user forgets their password, you should use a separate process to reset it rather than giving it back to them. (See “Don’t use security questions” below for more on this topic.)

Ideally, design your system in such a way that you can easily change what hash algorithm you’re using – a common scheme is storing something like "algo|salt_string|hash" in the password field. That way, if you ever need to change the algorithm, you can simply require anyone with a stored value that starts with the old algorithm to reset their password, and store the new passwords in the same field (just using the new algorithm prefix). Note that bcrypt has this built in (the resulting strings it generates have metadata embedded, including a randomly-selected salt).

Salt your hashes, using a separate salt for each string you hash.

Note that this is completely separate from the previous point. Using salts doesn’t make your hash function slower, nor does it make it take any longer to crack an individual password. No, the purpose of salting is to make it so that the time it takes to crack a single password is not the same as the time it takes to crack every password in your database. The thing that salts protect you against is what is known as a rainbow table.

The fundamental idea behind rainbow tables is “if we’re going to try a trillion possible passwords, let’s run the hash function on each of them once, and then check all of the hashes from the database against the result – that way we only have to run the hash function 1 trillion times, rather than N trillion times where N is the number of hashes in the database.” Proper salting makes it so that a given rainbow table would only be applicable to a single user’s hash, so using such would be no more efficient than individually brute-forcing.

Store your sensitive data separate from your regular data.

This right here is an extremely low-hanging-fruit kind of thing. The vast majority of web applications I see have a “users” table in their database with fields like “id, username, password, name, email, join_date, favorite_color, …” and so on. This seems reasonable to most people at first glance; it’s all of the information specifically about a single user.

The problem is that it’s impossible to be perfect forever when it comes to security, and one day, a developer might write something like "SELECT * FROM users WHERE id = 5" and then dump the results directly into JSON to provide data to the AJAX call they just implemented on their new-and-improved profile page. Sure, they’re only using the join_date and favorite_color fields it returns, but anyone who happens to inspect the AJAX response now sees the password hash for the user nestled away there.

This can be really hard to catch in code review because there’s nothing in the new code that explicitly talks about sensitive data – the sensitive data just happened to be swept up along with the rest and dumped out into the (relatively speaking) public eye. If the sensitive data were in a separate table, it’d still be relatively easy to access along with other user data ("SELECT users.*, sensitive.* FROM users JOIN sensitive ON =") but such accesses would be much more explicit and obvious to those writing and reviewing the code.

One might argue that the answer here is “never use SELECT *” but the simple response is that a lot of the time, developers aren’t even writing SQL – perhaps they dumped an ORM object to JSON, which faithfully dumped all of the fields. Putting sensitive data in a separate table is a simple solution to the problem that works no matter how you’re interfacing with the database. Of course, if you’re really concerned with user security you can go a step farther and put sensitive data like PII in a completely different database to allow you to better protect it and audit its usage – but that’s a separate topic.

Give your users the freedom to use whatever passwords they want, above minimum security thresholds.

It is reasonable (and a good idea) to require a minimum password length. If you don’t, some fraction of your users will inevitably choose “123” or similar. Likewise, it’s completely reasonable to disallow passwords that are nothing but a dictionary word plus a few characters. These are both the kinds of restrictions that encourage better passwords because they rule out the kinds of passwords that would be fundamentally insecure.

What’s unreasonable, though, are sites which place strict upper limits on passwords. The most notorious occurrence of this is probably “please choose a 4-digit PIN” – but it pops up in plenty of other less obvious forms. For instance, some sites place a limit on password length, e.g. “passwords must be 8-20 characters.” The minimum is fine, but why limit passwords to 20 characters at max? Assuming you’re hashing the password anyway (see the first item of this list), it’s not any harder to handle a 100 character password than it is to handle a 20 character one. The 29-character “correct horse battery staple” is far harder to randomly guess (assuming the person doing the guessing doesn’t read xkcd) than a 20-character password of a similar nature, but still very easy to remember.

The same also applies to restricting what characters can be used in passwords. It’s okay to require something other than a letter (in order to increase the number of potential characters most users will use in their passwords above just “the alphabet”), but arbitrarily disallowing things like spaces or non-alphanumeric characters is silly – if you’re hashing the password anyway, it won’t make a difference in how you handle passwords, and it will annoy the users who want to use a password that uses characters you don’t allow.

It’s okay to put sanity restrictions in place – for instance, a 1000-character maximum for a password field is reasonable to prevent someone from bogging down your password-handling algorithm by throwing a gigabyte of data at it (no one is going to type out a thousand characters by hand, and the small fraction of people using password managers probably aren’t going to be using quite that many characters either).

Put a reasonable upper limit on access attempts from a single location in a given time frame, and tell the user about failed access attempts.

In this context, “reasonable” is something on the order of 10-100 per 24 hour period. It’s high enough that no real user trying to remember what password they chose to use for your site is going to run into it, but low enough that an attacker trying to randomly guess passwords won’t make much progress.

Note that these kinds of limitations should be per location – you don’t want to make it possible for someone to lock someone else out of their account just by making a bunch of bogus login attempts. If a large number of different locations all try to access the same account, raise an alert and deal with the problem on a more specific basis.

Keep the user informed about attempts to access their account. This can be as simple as showing a “there were X failed login attempts since your last successful login” message when the user successfully signs in, but even better is to do this out of band – for instance, send the user an email after the 10th failed login attempt in a row. That way the user is made aware of the attack in a timely manner even if they aren’t frequently logging into your site. It also allows the user to figure out what might have happened if the attacker does eventually manage to access their account.

Use a single failure message regardless of whether a user is valid or a password was wrong.

Attackers can take advantage of separate responses for “invalid user” versus “wrong password” to check whether they have a valid account or not. By simply trying to log in with a bogus password and seeing whether or not the “invalid user” response comes back, they’re able to verify if that particular account exists. They can then try to figure out the password from other sources (perhaps checking to see if passwords are shared with a similarly-named account on a different, previously-compromised site).

Instead, just return the same response no matter what portion of the login attempt failed. For instance, “invalid username or password” is a simple, straightforward error message that doesn’t leak information about whether an account exists.

Don’t use “security questions” – they’re anything but.

Unlike passwords, which (theoretically) are completely arbitrary, security questions are generally the exact opposite – not arbitrary at all, but instead basic on specific, immutable, often publicly-available facts about the user. For password resets, just email the user a link with a single-use, randomly-generated reset token that can be used to change their password to something new. The big-name email providers have been handling the problem of account access for much longer than you have; let them do the hard work of dealing with account recovery in the case where a user’s email is also compromised. Keep your end of things simple.

This also applies to any over-the-phone or other out-of-band account support you provide – if a user forgets their password, email them a reset link. If they lost access to their email account as well, let the email provider handle that case. (In the very rare situation where a customer is completely unable to recover their email account, you can handle that on a case-by-case basis, or simply choose to go the “tough luck” route.)

Don’t reset the user’s password for them. Doing so opens up the door for an attacker causing trouble by resetting a legitimate user’s password. Only change a user’s password if they use a correct, not-already used reset link. (You can also make reset links expire after a certain period of time.)

Going the Extra Mile

These things are not necessarily mandatory, but are still good ideas. You should at least consider incorporating them.

Two-factor authentication

Two-factor authentication is based around the idea of needing two different things (factors) to log into an account. Generally, the first thing is “something you know” (usually a password), and the second thing is “something you have” (typically either a purpose-built device, or nowadays, a smartphone). The reasoning is that it is far harder to both manage to figure out your password and obtain access to your smartphone, than to just acquire one or the other. Your phone might get stolen, but the thief probably doesn’t know your password. Similarly, someone might guess your password, but they probably don’t have your smartphone. Either way, you’d likely notice if your smartphone went missing.

Before smartphones became so ubiquitous, two-factor authentication was a little arduous – it required dedicated devices which could generate one-time passwords (“OTPs”). Nowadays, however, it’s easy to load a simple app onto a smartphone (such as Google Authenticator; full disclosure – I work for Google, but the Authenticator app is open-source and based on open standards) and use it as an OTP-generating device.

On the server side, implementing two-factor support is actually very simple (for instance, about 20 lines of Python). There are even libraries for it in a number of common languages. You’re basically just storing a second, internally-generated password for each user that is used to verify their OTP codes.

Email verification for unknown access locations

If you don’t utilize two-factor authentication, consider at least requiring out-of-band verification for access from a location that has never been seen before for a given user. This helps prevent malicious access to an account without significantly interfering with regular usage, especially if the user is already using two-factor authentication for their email account.

Account knowledge test for unknown access locations

This is similar to the previous item in that it involves an additional hurdle when accessing an account from a new location for the first time. In this case, you would ask the user to enter some piece of information about the account that a regular user of the account would easily know, but is not publicly derivable from just the login credentials. (For instance, a game’s website might ask for the name of a character on the specified account.) This helps prevent “drive by” intrusions where the attacker is trying out a bunch of stolen credentials (perhaps from another compromised site) but doesn’t actually know anything else about the account.

Don’t force frequent password changes.

In this context, “frequent” means more often than once a year or so. If you make someone change their password too often, it’s quite possible that they’ll resort to less secure means of remembering it, which is worse than not having changed it in the first place. If you do require occasional password changes, give the user warning when their current password is about to expire – ideally out-of-band (e.g. via email) so that they’re aware of the impending expiration even if they’re not actively logging in on a regular basis.

See Also

If you want to learn more about web application security, here are some other resources you can explore:

Git, Dotfiles, and Hardlinks

One handy use for Git is keeping track of your dotfiles – all of those configuration files that live inside your home directory like .screenrc, .gitconfig, .vimrc, et cetera.

A typical first approach often winds up looking something like this:

~ $ git init dotfiles
Initialized empty Git repository in /home/aiiane/dotfiles/.git/
~ $ cd dotfiles
~/dotfiles $ ln -s ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Track my vim config file"

All seems well until you look at the diff for the commit you just made and see something like this:

diff --git a/.vimrc b/.vimrc
new file mode 120000
index 0000000..6ba8edc
--- /dev/null
+++ b/.vimrc
@@ -0,0 +1 @@
\ No newline at end of file

As it turns out, Git knows about symlinks and thus faithfully records the symlink as just that – a symlink – instead of recording the contents of the symlinked file. Oftentimes this leads to the thought of “well, if Git knows symlinks, then I can probably use a hardlink instead.” And thus the second attempt generally continues something like this:

~/dotfiles $ rm .vimrc
~/dotfiles $ ln ~/.vimrc .
~/dotfiles $ git add .vimrc
~/dotfiles $ git commit -m "Hardlink in my config file"

And this seems at first to work – looking at the diff, you see the contents of your config file being added; you push your commit to GitHub or whatever and your config file shows up properly there…

…until you try to have Git update the file. Perhaps you made a change somewhere else and now want to git pull, or perhaps you made a change locally that you decided you didn’t want and so you use git checkout .vimrc in your repo to change it back to your committed version. At this point you discover that while it seems to change the file in your repo, it doesn’t update the file in your home directory – the hardlink has been broken.

The reason for this is that Git never modifies files in the working tree – instead, it unlinks them and then recreates them from scratch. This inherently breaks any hardlinks that might have been present.

The third attempt is generally what winds up working, when you realize that what you can do does actually use symlinks, but rather than symlinking files outside of the repository into it, doing it the other way around: symlinking files inside the repository out of it, into your home directory:

~/dotfiles $ rm ~/.vimrc
~/dotfiles $ cd ~
~ $ ln -s dotfiles/.vimrc .

Since most programs tend to handle symlinks transparently (unlike Git), this lets you use Git to update the actual copy of the file in the Git working tree, and have those changes reflected in the path where your programs expect to find it.

An alternative approach

Astute readers may notice that there is another possibility: why not just make your home directory the Git working tree?

~ $ git init
Initialized empty Git repository in /home/aiiane/.git/
~ $ git add .vimrc
~ $ git commit -m "Track vimrc"

While this does work, it has its own drawbacks. The most significant is probably the large number of things in your home directory that you typically don’t want to track which clutter up git status. For those who take the home-as-worktree path, a logical solution to this problem is to just ignore everything by default:

~ $ echo "*" > .gitignore
~ $ git add -f .gitignore
~ $ git commit -m "Ignore all by default"

Then only things you’ve explicitly added (and you’ll need to use git add -f the first time you add each file) will be tracked. If you do this, however, it’s harder to tell at a glance whether or not a given file is being tracked, and new files that you might want to add won’t stand out. There is a way to check, though:

~ $ git ls-files

The other main potential drawback of using your home directory as a working tree is that it effectively requires you to version the same files on every machine – since Git doesn’t really do partial checkouts gracefully, once a particular path in your home directory is added, it’ll be tracked anywhere that pulls the commit which added it. Most of the time this probably won’t be an issue – generally if you want to track something, you want it the same everywhere – but it’s something to bear in mind if you choose this approach.

Comment Systems and Time-Ordering at Scale

Time is a tricky concept in software development, especially in large distributed systems. Not only do you have to worry about the intricacies of computer representations for time itself (and there are many), but you also have to deal with the fact that it’s nigh-impossible to perfectly sync system clocks across multiple servers. This, of course, results in some interesting problems.

One example of a large distributed system is Google+. Obviously, there are quite a few servers that power the Google+ webapp. When a user adds a comment to a post, one of those many servers will be handling the AJAX request to add the comment, and it’ll be talking to one of a number of servers in Google’s backing data store.

Of course, each of those servers has a separate system clock. If those system clocks are even a second or two off from one another, you can get situations where two comments are timestamped in opposite order from when they were actually submitted – because the first comment was processed by a server that had a lagging system clock. You can see an example of that in this public comment thread. Notice how the 2nd comment displayed is actually a reply to the 3rd, and the 11th a reply to the 12th.

Now, this could be a fun little blog post just pointing that out and calling it done – “here’s the reality you have to deal with when you scale this large.” As it turns out, however, there’s a way this particular problem could be solved. Let’s take a look.

Defining the problem

First, let’s define the part of the problem we actually care about. In general, we don’t actually care about perfect time ordering – if two people post unrelated comments within a second or two of one another, it’s fine to display them in either order. What we actually care about is that replies to comments are shown after the comment to which they are replying. If comment B is a reply to comment A, we want to show comment B after comment A. That’s all we really care about – beyond that, the existing rough time ordering is fine.

Outlining the solution

So how can we do this without requiring incredibly precise time synchronization? The trick here is that we’re going to send some extra info from the client to the server (and store it as part of the newly created comment). Specifically, at the same time as we send the contents of a new comment to the server, we’ll also send along a second integer value – call it “parent” for now – that we’ll store along with the comment. Why is this important? Because it’s what will let us help decide ordering when we go to create the comment list.

What will “parent” contain? Simple: the id of the last existing comment the client already loaded. For instance, if the post was loaded with three comments, which had ids 1, 3, and 4 respectively, then the client would send a “parent” value of 4 when it posted a new comment.

When fetching the comment list, we’ll start out how you’d expect: grab a list of all the relevant comments from the data store.  To generate the ordered comment list, we’ll first take all the comments with an empty “parent” value and add them to our result, ordered by timestamp. Next, we’ll take all the comments with a “parent” value corresponding to comments already in our result, and insert them into the result based on two rules:

  1. A comment must be inserted somewhere after its “parent” comment P.
  2. A comment must be inserted after any comments with an earlier timestamp, except where this would violate #1.

We keep repeating this until we have no more comments left to add to the ordering.

Why does this work? It works because of the simple fact that in order for someone to reply to a comment, they first have to read it – which means that their client has to have loaded the older comment. By sending that knowledge to the server, we can create what’s called a partially ordered set out of the comments. With that partial ordering, we can then generate a final ordering that meets our desired goals. The algorithm outlined above is basically a pared-down adaptation of a vector clock.

Proof of concept

I’ve created a small proof of concept as a Python script. Here’s the output (notice, in particular, the ordering of #3 and #4):

#0 (t=0)
#1 (t=2)
#2 (t=4) - reply to #1
#3 (t=6)
#4 (t=5) - reply to #3
#5 (t=8)
#6 (t=16)

The script is just a quick example thrown together in under an hour – I make no guarantees on whether it’s the most optimal code or not. Feel free to play around with it.

Closing notes

There is a trade-off involved in this: the algorithm that generates this ordering is worst-case O(n²), compared to just sorting the list based on timestamp in O(n log n). For most scenarios, however, this is acceptable – in the case of Google+, it’s extremely unlikely to be a problem, given that G+ comment threads cap out at something like 500 replies. With such a small n, the difference in time is negligible. The sorting can also be done client-side if desired – no extra server processing is required.