Blog Archives

Be cautious about using Chartio (or at least, don’t follow their directions)

In the past couple of years the number of web startups aiming to help other web startups with various tasks has grown immensely. That’s probably a good thing; more innovation for the web is always welcome, and a great way to drive innovation is to have good tools readily available. On the other hand, any time you have a bunch of newcomers to a space, there’s going to be some rough edges involved as they learn some of the lessons that older participants learned the hard way. There’s also going to be some risks taken in the name of the aforementioned innovation, or perhaps in the name of “disruption” (a more vague goal).

Case in point: Chartio. To quote their site, “Chartio is your data’s interface. It’s simple to set up, easy to use, and provides business intelligence for the world’s most popular data sources.”

The service they offer is definitely a useful one – there’s a bunch of other companies that also try to provide it, each in their own way, and with various trade-offs. I’ve certainly seen some grotesque systems built for the purpose of providing business analysts with tools to do their job, and I’m willing to bet that Chartio’s internal code is a lot cleaner than a lot of that which I’ve come across in the past.

Where things go off the rails, however, is the frankly astounding compromises Chartio wants its clients to make in order to enable the “simple to set up and easy to use” parts of their pitch. If we click through the landing page to get to the setup instructions (we’ll use their MySQL instructions as an example), we find that Chartio essentially wants to connect directly to your site’s production database.

Wait, what?

If you’re a systems administrator, alarm bells are probably going off in your mind now. Oh, but supposedly, it’s okay – they’ll use an encrypted SSH tunnel so that instead of you opening a hole in your firewall for them, they’ll bypass your firewall for you. Well, unless you don’t have shell access, in which case you will have to open a hole in your firewall for them.

Wait, what?

Sure, that SSH tunnel might be difficult for an third-party attacker to break into, but what about compromises of Chartio’s servers? Whatever Chartio machines are on the other ends of these tunnels are veritable goldmines if a malicious user can compromise them, with active firewall-bypassing connections to a multitude of companies’ database servers. By opening up a tunnel, it’s effectively reducing your network’s defenses to the lowest common denominator of your existing defenses or Chartio’s. While I’m sure the people behind Chartio are just as dedicated to security as any of us, their entire company is 8 people, and only half of those are even engineers, let alone actively working on security.

It’s okay, though – even if someone did get control of Chartio’s servers and credentials, they specifically have you set it up so that they connect to your database as a read-only user. So an attacker couldn’t delete all your data or anything malicious  like that. Except, well, there’s still that matter of being able to read all of your data. At least, if you follow their setup instructions (again using MySQL as an example):

GRANT SELECT, SHOW VIEW
    ON $database_name.*
    TO $user@`rackspace1.chart.io` IDENTIFIED BY '$password';
FLUSH PRIVILEGES;

See the * wildcard on the end of that second line, giving access to every single table in the database? After all, it’d be a hassle to grant access to only specific tables that it would make sense for business analysts to examine, and disallow access to things like your users’ sensitive data. It might also mean that your analysts’ time is wasted asking engineers to add new tables for them to read when they need access to data that’s not on the whitelist.

Of course, that doesn’t even take into account the harm that non-malicious users can do. While I don’t know the exact extent of the queries that a user of Chartio can cause it to run, it’s certainly possible to impact the performance of a database by issuing read-only queries that happen to result in large, inefficient scans of tables. One might hope that Chartio has built-in protections against this, but given the wide variety of databases they inter-operate with, all of which has varying levels of similarity in their query semantics, it seems unlikely that every query coming from Chartio is going to be perfectly optimized for the data it’s running against.

Afterword

Let me make it clear that I don’t hate Chartio or anything – and they’re certainly not unique in making some of the choices I’ve highlighted above. My real goal here is just to make people more aware of the security trade-offs they are making when they use these kinds of methods to enable third-party services. It’s quite possible that the risks I’ve highlighted above are ones that you feel it’s okay to take, and in that case, go for it – as long as you’re respecting your end users’ interests as well. Just try to be cognizant of the risks you’re taking, and not just plug in new things because they’re shiny. Also realize that there may be ways that you can adjust the risks you’re taking – such as not using wildcard grants, as I mentioned above.

would love to see Chartio develop some alternative methods of data acquisition that didn’t involve plugging their servers directly into your database, or at least have some guides on their site about good data isolation practices (e.g. restricting access to only tables that are really relevant to business analysts, and partitioning other sensitive user data into separate tables). That would be good for both Chartio’s customers (in that their overall approach to data security would improve) and also for Chartio (who might garner a little extra goodwill for helping that to happen). I expect that it might take some time before that happens, though, given that Chartio is a startup and has limited personnel resources to devote to all of their endeavors.

Some Best Practices for Web App Authentication

It’s rare that a significant amount of time will go by without me hearing about yet another leak of user credentials from some well-known site. In the interest of incrementally increasing the security of the web as a whole, here’s a checklist to consider when writing your next (or current!) web application.

The Basics

These are things that effectively mandatory – if you’re not already doing them, you should probably start doing them as soon as you can.

Use SSL (https) for anything involving authentication.

Ideally, use SSL for everything; it’s the simplest way to make sure you’re using it in all the right places. If it’s not feasible to do that, however, you should at least be using it for anything related to authentication. This isn’t just limited to login pages – it also includes any pages that use your session cookie. If you protect the login request with SSL but then make requests to other pages (which send along the session cookie) over regular HTTP, your site can be attacked using what is known as session hijacking. In other words, any page that deals with a logged-in user probably should be served using SSL.

Use POST for anything submitting sensitive information (and in general, don’t write sensitive information into logs).

The URL of a request (where form variables submitted in a GET request) is typically written out to server logs; the contents of a POST body are not (unless you go out of your way to write them out somewhere). Storing hashed passwords in your database won’t buy you much if an attacker can just search through your logs and find the same information there.

On the same note, don’t write out any sensitive information to stdout or a log file. Avoid doing so even when debugging, just in case you happen to accidentally leave in such debugging code.

Hash your passwords with a strong, certified, and slow cryptographic one-way hashing scheme.

This means that you should be using something like bcrypt or a similar well-known hash function that has been designed to be slow. Don’t use MD5 or SHA-1, even though you’ll see a lot of existing code using them. Adding “HMAC” doesn’t magically make the hash slower, either – HMAC itself is not a hash function, and the most common implementations of it use fast hash functions (e.g. HMAC-MD5), not slow ones.

Why slow, you ask? Because a legitimate user won’t really mind if it takes a full second to process their login request – but someone trying to brute-force a password hash will have a much more difficult time if it takes a full second of CPU core time per password they try, as opposed to a millisecond.

It should hopefully go without saying, but don’t use 2-way encryption for passwords. There is absolutely no reason why you should ever need to recover the plaintext form of a password once it has been set – if a user forgets their password, you should use a separate process to reset it rather than giving it back to them. (See “Don’t use security questions” below for more on this topic.)

Ideally, design your system in such a way that you can easily change what hash algorithm you’re using – a common scheme is storing something like "algo|salt_string|hash" in the password field. That way, if you ever need to change the algorithm, you can simply require anyone with a stored value that starts with the old algorithm to reset their password, and store the new passwords in the same field (just using the new algorithm prefix). Note that bcrypt has this built in (the resulting strings it generates have metadata embedded, including a randomly-selected salt).

Salt your hashes, using a separate salt for each string you hash.

Note that this is completely separate from the previous point. Using salts doesn’t make your hash function slower, nor does it make it take any longer to crack an individual password. No, the purpose of salting is to make it so that the time it takes to crack a single password is not the same as the time it takes to crack every password in your database. The thing that salts protect you against is what is known as a rainbow table.

The fundamental idea behind rainbow tables is “if we’re going to try a trillion possible passwords, let’s run the hash function on each of them once, and then check all of the hashes from the database against the result – that way we only have to run the hash function 1 trillion times, rather than N trillion times where N is the number of hashes in the database.” Proper salting makes it so that a given rainbow table would only be applicable to a single user’s hash, so using such would be no more efficient than individually brute-forcing.

Store your sensitive data separate from your regular data.

This right here is an extremely low-hanging-fruit kind of thing. The vast majority of web applications I see have a “users” table in their database with fields like “id, username, password, name, email, join_date, favorite_color, …” and so on. This seems reasonable to most people at first glance; it’s all of the information specifically about a single user.

The problem is that it’s impossible to be perfect forever when it comes to security, and one day, a developer might write something like "SELECT * FROM users WHERE id = 5" and then dump the results directly into JSON to provide data to the AJAX call they just implemented on their new-and-improved profile page. Sure, they’re only using the join_date and favorite_color fields it returns, but anyone who happens to inspect the AJAX response now sees the password hash for the user nestled away there.

This can be really hard to catch in code review because there’s nothing in the new code that explicitly talks about sensitive data – the sensitive data just happened to be swept up along with the rest and dumped out into the (relatively speaking) public eye. If the sensitive data were in a separate table, it’d still be relatively easy to access along with other user data ("SELECT users.*, sensitive.* FROM users JOIN sensitive ON users.id = sensitive.id") but such accesses would be much more explicit and obvious to those writing and reviewing the code.

One might argue that the answer here is “never use SELECT *” but the simple response is that a lot of the time, developers aren’t even writing SQL – perhaps they dumped an ORM object to JSON, which faithfully dumped all of the fields. Putting sensitive data in a separate table is a simple solution to the problem that works no matter how you’re interfacing with the database. Of course, if you’re really concerned with user security you can go a step farther and put sensitive data like PII in a completely different database to allow you to better protect it and audit its usage – but that’s a separate topic.

Give your users the freedom to use whatever passwords they want, above minimum security thresholds.

It is reasonable (and a good idea) to require a minimum password length. If you don’t, some fraction of your users will inevitably choose “123” or similar. Likewise, it’s completely reasonable to disallow passwords that are nothing but a dictionary word plus a few characters. These are both the kinds of restrictions that encourage better passwords because they rule out the kinds of passwords that would be fundamentally insecure.

What’s unreasonable, though, are sites which place strict upper limits on passwords. The most notorious occurrence of this is probably “please choose a 4-digit PIN” – but it pops up in plenty of other less obvious forms. For instance, some sites place a limit on password length, e.g. “passwords must be 8-20 characters.” The minimum is fine, but why limit passwords to 20 characters at max? Assuming you’re hashing the password anyway (see the first item of this list), it’s not any harder to handle a 100 character password than it is to handle a 20 character one. The 29-character “correct horse battery staple” is far harder to randomly guess (assuming the person doing the guessing doesn’t read xkcd) than a 20-character password of a similar nature, but still very easy to remember.

The same also applies to restricting what characters can be used in passwords. It’s okay to require something other than a letter (in order to increase the number of potential characters most users will use in their passwords above just “the alphabet”), but arbitrarily disallowing things like spaces or non-alphanumeric characters is silly – if you’re hashing the password anyway, it won’t make a difference in how you handle passwords, and it will annoy the users who want to use a password that uses characters you don’t allow.

It’s okay to put sanity restrictions in place – for instance, a 1000-character maximum for a password field is reasonable to prevent someone from bogging down your password-handling algorithm by throwing a gigabyte of data at it (no one is going to type out a thousand characters by hand, and the small fraction of people using password managers probably aren’t going to be using quite that many characters either).

Put a reasonable upper limit on access attempts from a single location in a given time frame, and tell the user about failed access attempts.

In this context, “reasonable” is something on the order of 10-100 per 24 hour period. It’s high enough that no real user trying to remember what password they chose to use for your site is going to run into it, but low enough that an attacker trying to randomly guess passwords won’t make much progress.

Note that these kinds of limitations should be per location – you don’t want to make it possible for someone to lock someone else out of their account just by making a bunch of bogus login attempts. If a large number of different locations all try to access the same account, raise an alert and deal with the problem on a more specific basis.

Keep the user informed about attempts to access their account. This can be as simple as showing a “there were X failed login attempts since your last successful login” message when the user successfully signs in, but even better is to do this out of band – for instance, send the user an email after the 10th failed login attempt in a row. That way the user is made aware of the attack in a timely manner even if they aren’t frequently logging into your site. It also allows the user to figure out what might have happened if the attacker does eventually manage to access their account.

Use a single failure message regardless of whether a user is valid or a password was wrong.

Attackers can take advantage of separate responses for “invalid user” versus “wrong password” to check whether they have a valid account or not. By simply trying to log in with a bogus password and seeing whether or not the “invalid user” response comes back, they’re able to verify if that particular account exists. They can then try to figure out the password from other sources (perhaps checking to see if passwords are shared with a similarly-named account on a different, previously-compromised site).

Instead, just return the same response no matter what portion of the login attempt failed. For instance, “invalid username or password” is a simple, straightforward error message that doesn’t leak information about whether an account exists.

Don’t use “security questions” – they’re anything but.

Unlike passwords, which (theoretically) are completely arbitrary, security questions are generally the exact opposite – not arbitrary at all, but instead basic on specific, immutable, often publicly-available facts about the user. For password resets, just email the user a link with a single-use, randomly-generated reset token that can be used to change their password to something new. The big-name email providers have been handling the problem of account access for much longer than you have; let them do the hard work of dealing with account recovery in the case where a user’s email is also compromised. Keep your end of things simple.

This also applies to any over-the-phone or other out-of-band account support you provide – if a user forgets their password, email them a reset link. If they lost access to their email account as well, let the email provider handle that case. (In the very rare situation where a customer is completely unable to recover their email account, you can handle that on a case-by-case basis, or simply choose to go the “tough luck” route.)

Don’t reset the user’s password for them. Doing so opens up the door for an attacker causing trouble by resetting a legitimate user’s password. Only change a user’s password if they use a correct, not-already used reset link. (You can also make reset links expire after a certain period of time.)

Going the Extra Mile

These things are not necessarily mandatory, but are still good ideas. You should at least consider incorporating them.

Two-factor authentication

Two-factor authentication is based around the idea of needing two different things (factors) to log into an account. Generally, the first thing is “something you know” (usually a password), and the second thing is “something you have” (typically either a purpose-built device, or nowadays, a smartphone). The reasoning is that it is far harder to both manage to figure out your password and obtain access to your smartphone, than to just acquire one or the other. Your phone might get stolen, but the thief probably doesn’t know your password. Similarly, someone might guess your password, but they probably don’t have your smartphone. Either way, you’d likely notice if your smartphone went missing.

Before smartphones became so ubiquitous, two-factor authentication was a little arduous – it required dedicated devices which could generate one-time passwords (“OTPs”). Nowadays, however, it’s easy to load a simple app onto a smartphone (such as Google Authenticator; full disclosure – I work for Google, but the Authenticator app is open-source and based on open standards) and use it as an OTP-generating device.

On the server side, implementing two-factor support is actually very simple (for instance, about 20 lines of Python). There are even libraries for it in a number of common languages. You’re basically just storing a second, internally-generated password for each user that is used to verify their OTP codes.

Email verification for unknown access locations

If you don’t utilize two-factor authentication, consider at least requiring out-of-band verification for access from a location that has never been seen before for a given user. This helps prevent malicious access to an account without significantly interfering with regular usage, especially if the user is already using two-factor authentication for their email account.

Account knowledge test for unknown access locations

This is similar to the previous item in that it involves an additional hurdle when accessing an account from a new location for the first time. In this case, you would ask the user to enter some piece of information about the account that a regular user of the account would easily know, but is not publicly derivable from just the login credentials. (For instance, a game’s website might ask for the name of a character on the specified account.) This helps prevent “drive by” intrusions where the attacker is trying out a bunch of stolen credentials (perhaps from another compromised site) but doesn’t actually know anything else about the account.

Don’t force frequent password changes.

In this context, “frequent” means more often than once a year or so. If you make someone change their password too often, it’s quite possible that they’ll resort to less secure means of remembering it, which is worse than not having changed it in the first place. If you do require occasional password changes, give the user warning when their current password is about to expire – ideally out-of-band (e.g. via email) so that they’re aware of the impending expiration even if they’re not actively logging in on a regular basis.

See Also

If you want to learn more about web application security, here are some other resources you can explore:

Yet Another Way You Shouldn’t Implement Password Recovery

A couple of weeks ago I had the dubious pleasure of calling a company to tell them that they had a security vulnerability in their site that you could drive a truck through. I’m not going to name names, but given that the issue is now patched, I think it’s a good lesson on things to avoid when designing your own application.

I’ll show you the relevant source code first and let you see if you can figure it out.

// forgot.php ("forgot password" functionality)

if(!empty($_POST['username'])) {
    ob_start();
    // send the request
    CURLHandler::Post(LOGIN_APP_URL . 'resettoken', array('username' => $_POST['username']));
    $result = ob_get_contents();
    ob_end_clean();

    $result = json_decode($result);
    if ($result->success == true) {
        $resetUrl = SECURE_SERVER_URL . 'resetpass.php?un=' . base64_encode($_POST['username']) . '&token=' . $result->token;
        $resetUrl = '<a title="Password Recovery" href="' . $resetUrl . '">' . $resetUrl . '</a>';
        sendTemplateEmail($_POST['username'], 'recovery', array('url' => $resetUrl));
        $msg= '<p>Login information will be sent if the email address ' . $_POST['username'] . ' is registered.</p>';
    } else {
        $msg = '<p>Sorry, unable to send password reset information. Try again or contact an administrator.</p>';
    }
}

There’s actually a couple of things wrong here.

The most problematic flaw (and the one that actually caused the security vulnerability) is that LOGIN_APP_URL happens to point to a server that is accessible to the public internet. This means that anyone who wanted to could bypass this script and make a request directly to its resettoken endpoint, supplying any username they want, and get a valid password reset token for that username! Then all they had to do was manually construct the URL for the reset password page with the username and their acquired token to be able to acquire access to that account.

There’s another more subtle issue here, though: why is an HTTP request being made that returns a password reset token in the first place? What should be happening here is that the same code which generates the reset token should also send the email, thus avoiding the need to expose the reset token generator via an external interface in the first place.

(There’s also the slight “wtf” of using PHP’s output buffer to get the results of a cURL call, rather than using CURLOPT_RETURNTRANSFER, but that’s not a security issue.)