Category Archives: Software Development

Civil vs. Nice

I’m not going to write a giant spiel on the current LKML kerfuffle. Instead, I’m just going to contrast a few (made-up) examples, none of which are “nice” per se.

The weak and potentially ineffective:

This code doesn’t seem very good. I’d prefer you didn’t submit these kinds of patches.

The needlessly abusive:

Are you an idiot? This code makes me think you are. We don’t need people like you submitting code like this.

The strong and civil:

This code is terribly written and does not at all meet our standards. We will not accept this patch, and if you keep submitting patches of this quality, we’ll be forced to stop wasting time considering them at all.

Just something to think about.

Be cautious about using Chartio (or at least, don’t follow their directions)

In the past couple of years the number of web startups aiming to help other web startups with various tasks has grown immensely. That’s probably a good thing; more innovation for the web is always welcome, and a great way to drive innovation is to have good tools readily available. On the other hand, any time you have a bunch of newcomers to a space, there’s going to be some rough edges involved as they learn some of the lessons that older participants learned the hard way. There’s also going to be some risks taken in the name of the aforementioned innovation, or perhaps in the name of “disruption” (a more vague goal).

Case in point: Chartio. To quote their site, “Chartio is your data’s interface. It’s simple to set up, easy to use, and provides business intelligence for the world’s most popular data sources.”

The service they offer is definitely a useful one – there’s a bunch of other companies that also try to provide it, each in their own way, and with various trade-offs. I’ve certainly seen some grotesque systems built for the purpose of providing business analysts with tools to do their job, and I’m willing to bet that Chartio’s internal code is a lot cleaner than a lot of that which I’ve come across in the past.

Where things go off the rails, however, is the frankly astounding compromises Chartio wants its clients to make in order to enable the “simple to set up and easy to use” parts of their pitch. If we click through the landing page to get to the setup instructions (we’ll use their MySQL instructions as an example), we find that Chartio essentially wants to connect directly to your site’s production database.

Wait, what?

If you’re a systems administrator, alarm bells are probably going off in your mind now. Oh, but supposedly, it’s okay – they’ll use an encrypted SSH tunnel so that instead of you opening a hole in your firewall for them, they’ll bypass your firewall for you. Well, unless you don’t have shell access, in which case you will have to open a hole in your firewall for them.

Wait, what?

Sure, that SSH tunnel might be difficult for an third-party attacker to break into, but what about compromises of Chartio’s servers? Whatever Chartio machines are on the other ends of these tunnels are veritable goldmines if a malicious user can compromise them, with active firewall-bypassing connections to a multitude of companies’ database servers. By opening up a tunnel, it’s effectively reducing your network’s defenses to the lowest common denominator of your existing defenses or Chartio’s. While I’m sure the people behind Chartio are just as dedicated to security as any of us, their entire company is 8 people, and only half of those are even engineers, let alone actively working on security.

It’s okay, though – even if someone did get control of Chartio’s servers and credentials, they specifically have you set it up so that they connect to your database as a read-only user. So an attacker couldn’t delete all your data or anything malicious  like that. Except, well, there’s still that matter of being able to read all of your data. At least, if you follow their setup instructions (again using MySQL as an example):

GRANT SELECT, SHOW VIEW
    ON $database_name.*
    TO $user@`rackspace1.chart.io` IDENTIFIED BY '$password';
FLUSH PRIVILEGES;

See the * wildcard on the end of that second line, giving access to every single table in the database? After all, it’d be a hassle to grant access to only specific tables that it would make sense for business analysts to examine, and disallow access to things like your users’ sensitive data. It might also mean that your analysts’ time is wasted asking engineers to add new tables for them to read when they need access to data that’s not on the whitelist.

Of course, that doesn’t even take into account the harm that non-malicious users can do. While I don’t know the exact extent of the queries that a user of Chartio can cause it to run, it’s certainly possible to impact the performance of a database by issuing read-only queries that happen to result in large, inefficient scans of tables. One might hope that Chartio has built-in protections against this, but given the wide variety of databases they inter-operate with, all of which has varying levels of similarity in their query semantics, it seems unlikely that every query coming from Chartio is going to be perfectly optimized for the data it’s running against.

Afterword

Let me make it clear that I don’t hate Chartio or anything – and they’re certainly not unique in making some of the choices I’ve highlighted above. My real goal here is just to make people more aware of the security trade-offs they are making when they use these kinds of methods to enable third-party services. It’s quite possible that the risks I’ve highlighted above are ones that you feel it’s okay to take, and in that case, go for it – as long as you’re respecting your end users’ interests as well. Just try to be cognizant of the risks you’re taking, and not just plug in new things because they’re shiny. Also realize that there may be ways that you can adjust the risks you’re taking – such as not using wildcard grants, as I mentioned above.

would love to see Chartio develop some alternative methods of data acquisition that didn’t involve plugging their servers directly into your database, or at least have some guides on their site about good data isolation practices (e.g. restricting access to only tables that are really relevant to business analysts, and partitioning other sensitive user data into separate tables). That would be good for both Chartio’s customers (in that their overall approach to data security would improve) and also for Chartio (who might garner a little extra goodwill for helping that to happen). I expect that it might take some time before that happens, though, given that Chartio is a startup and has limited personnel resources to devote to all of their endeavors.

Small Projects, Big Projects

The most valuable skill for the average software engineer doesn’t get taught in the typical CS curriculum. It’s also almost impossible to teach it to yourself. Nearly every major programming outfit spends a significant amount of time trying to instill it in their employees, and they still often wind up failing to do so. Many programmers don’t realize they need this skill, and look down on those who have it and put it to good use.

You’re probably wondering what I’m talking about. The skill that I’m referring to is the ability to write code in such a way as to make it maintainable and amenable to large-scale collaboration.

I’m going to refer to code written with this skill in mind as “big-project code,” because it’s the kind of code that you need to write to make the most valuable contributions to a project of any significant size (as measured in people, not lines of code). In contrast, code that isn’t written with this skill in mind (or knowingly takes a pass on it) is “small-project code.”

Small-project code is not inherently bad code. There are tons of incredibly useful and clever programs out there written as small-project code. I’ve written many myself. No, the key distinction here is how easy it is for others to collaborate with you on your project.

There are many aspects of code that can vary between small-project and big-project. One example is its mental state requirement: how much of the code you have to hold in your head to be able to reason about its behavior? A program written in a small-project style will often need you to be familiar with a significant portion of its code to reason about what a particular handful of lines will do. Big-project code, on the other hand, will attempt to minimize the necessary mental state involved in figuring out what a particular function’s code does. Consistent naming, more verbose comments/docstrings, and clear object interfaces are some of the factors that play into big-project style.

Another aspect is what you might call walk-in tolerance: how simple is it for someone who has never worked with your code to start making useful changes to it? The lower the mental state requirement, the higher the walk-in tolerance, but other factors come into play as well. Unit tests help boost developer confidence (even for seasoned contributors) by making it easier to notice if new code breaks existing functionality. Style guides prevent bickering over personal preferences and lead to more consistently readable code. Good logging makes it easier to diagnose and debug problems.

The big-project mentality can also be applied at a higher level than the code itself. For example, the concept of a service-oriented architecture is inherently a big-project ideology. It uses well-defined APIs to create a modular environment in which a given developer can focus on a particular service and not have to put much thought into the specific implementations of other services. In doing so, it trades some up-front development time (good APIs take effort to design and implement) in return for a long-term payout (as the overall project grows, the amount of developer time saved not keeping mental state on other services increases drastically).

For the typical hobbyist project or college CS classwork, small-project code is fine – even encouraged. After all, the thing you’re hacking on probably isn’t going to wind up with tens, hundreds, or even thousands of developers contributing to it. Big-project code is a long-term investment, and it only makes sense to make that investment in projects where it will pay off.

For the average software engineer, however, their day job isn’t a hobbyist project. It tends to involve a project with a significant number of developers working together to produce a product that is hopefully greater than the sum of its parts. This is what big-project code is about. It’s about making not only your own job, but the job of others, easier. It’s about spending the time greasing the gears so that when it’s time to get things done, you’re not losing some of your energy to grind.

You generally don’t tend to truly appreciate it (or even get a good grasp on what it is) until you’ve actually spent time working on such a project. Small-project code often feels more “fun” because you’re not spending as much time and effort on long-term investments. It’s easy to shy away from ever working on big-project code because the barrier to entry seems so high – but it can be worth it. In the long run, collaboration often leads to results you never even imagined on your own.

So my advice to you is this:

  • If you’re already a software engineer, try to put some extra effort into figuring out what parts of your work are worth investing in for the future, and how you can approach them in more of a big-project way.
  • If you’re a student, but looking to eventually get a job as a software engineer, consider trying to get some experience with big-project code. One great way to do this is to contribute to an open-source project with a significant developer community. Not only will it give you experience with big-project code, it also stands out on a résumé. Not sure how to get started? Check out OpenHatch.
  • If you’re a CS teacher, consider trying to find a way to work a large collaborative project into your curriculum. An example might be an overarching project for a class with modular services designed and developed by smaller teams of students. Most college CS programs today focus heavily on the theoretical side of computer science and give little in the way of practical programming experience.

Piping Into Vim (and I/O Tidbits)

The Vim-related portion of this post is actually very small. In fact, it can be summed up in a single sentence: you can take the output of a pipe and place it into Vim for editing, without needing to use an intermediate file, by piping it to vim - (yes, the dash is significant). For example, grep foo bar.txt | vim - will grab all of the lines from bar.txt which contain “foo” and place them into a Vim buffer for editing.

If you look in the Vim manpage, you’ll see this functionality described as such:

The file to edit is read from stdin. Commands are read from stderr, which should be a tty.

At first glance, this might seem odd – the way many people think about the stdin, stdout, and stderr, it wouldn’t make much sense to “read” from something associated with “error output”.

The reality, however, is that all of these are merely pointers to devices. The devices themselves may support reading, writing, or both. In fact, for your normal interactive shell session, all 3 of the “standard” IO pointers all point to the same device. That device represents the terminal interface you’re using.

Note that this is a different concept of ‘device’ from that of hardware devices (e.g. keyboard, mouse, screen) – as far as command line I/O is concerned, “devices” are just things that can have data read from and/or written to them. For instance, an SSH session creates a virtual terminal device on the machine to which you connect, that proxies its input and output via your connection to the machine.

Once you understand this, the concept of “reading from stderr” makes a lot more sense – you’re not actually “reading from your output” but rather “reading from the same device that error output goes to by default”. Since stderr is typically output to your terminal, even if you redirect stdout (say, to a file), getting your input from stderr is a convenient way of accepting keyboard input from the terminal when stdin is being used by something else – in this case, from the pipe that’s filling the buffer with content for you to edit.

What I Want From “Social”

Facebook. Google+. Tumblr. Pinterest. Any of the other myriad social content services now thoroughly entrenched in the web – they’re all going after the same kind of interaction, if in subtly (or even obviously) different ways. Even with so many options, however, I have yet to find one that really embodies what I’m looking to get out of a social content experience.

(Disclosure and disclaimer: I work for Google, though not for any of the Social product teams. The opinions in this post are completely my own, do not reflect those of my employer, and should not be taken as indicative of current or future development efforts.)

Ease and versatility of content creation

Text composition and formatting

Blogging platforms (e.g. WordPress, Tumblr, etc.) probably do this best… in that they do it at all. Social content sites can generally be broken down into two fairly distinct categories: blogging platforms, and platforms with little-to-no formatting options. At best, non-blogging platforms tend to support bold, italic, auto-linked URLs, and maybe a handful of other styles (e.g. Google+’s strike through).

My ideal social content platform would support a broader set of formatting options, while maintaining a consistent style. A reasonable set of formatting options would be those represented by attribute-less HTML tags – for instance, superscript and subscript, headings, and monospace/preformatted text. This avoids the possibility of too much eyesore (I’m specifically not looking for unrestricted colored text support, for example) while still allowing the usage of well-established formatting standards to present content in a more organized and elegant manner.

An ideal content platform would also make these formatting tools accessible to any level of user. While I as a technical user might prefer writing my posts in Markdown, someone else might be much more comfortable with a WYSIWYG or WYMIWYG creation process.

Multimedia sharing

An ideal platform also needs to support other media beyond text – photo galleries, shared videos, and so on. These should be presentable in a consistent and clean format that doesn’t need extra formatting effort on the part of the content creator. Modern social networks tend to handle this fairly well for embedded content, while most blogging platforms handle this poorly.

Clear and powerful access control

Google+ circles are pretty much the reference implementation I have in mind here – it’s the first service that I’ve used any significant amount that I actually felt got access control right. From the start it makes it clear what you’re sharing and with whom, with the concept of circles baked into every aspect of the site.

I stopped using Facebook months ago, but when I left, my perception was that while they were trying to get some of the same concept implemented via friend lists, it was very much an uphill battle due to how much of the site was implemented around the symmetric friending model.

Many other content sharing platforms have very little access control at all – Twitter basically only has 2 options for sharing (all tweets are followers-only, or all tweets are public), and others have none (e.g. Tumblr).

Interest-based curation tools

This is an area in which Pinterest shines. By giving users the ability to create separate personal boards for each topic, and then subscribe to other users’ topic-specific boards, Pinterest allows a very fine-grained control over not only whose posts you see, but also what they’re about. As it turns out, just because you like someone’s posts about one topic doesn’t necessarily mean you’ll want to see everything they’re interested in (case in point: politics).

Note that this is different from, say, Facebook groups (or a classic example, LiveJournal communities). With such groups, access control and interests are conflated – to see any posts from the group, you must see all posts from the group, and to share something with the group, you must share it with everyone in the group. As a result, it’s generally necessary to have a group moderator if any modicum of privacy is desired.

The Pinterest model avoids the need for group moderators by keeping the topic separate from the set of users a post is shared with (though in Pinterest’s case, this is because they have basically no access control). Keeping topics separate from access control allows for the asymmetric access control model of circles without sacrificing the ability to customize topic consumption.

Intuitive discussion and moderation capabilities

Discussion format

Content creation and distribution is only half of the social media platform equation – the other half is dialogue. Forums are devoted entirely to encouraging dialogue with features like the ability to easily refer to past parts of a discussion (via quotes or links to individual posts) and the ability to display both threaded and sequential views of a conversation. Many other social platforms, however, have neglected such functionality, instead adopting the stripped-down “comment stream” model of dialogue. While there is plenty of dialogue that can go in within a flat comment stream, it’s primitive at best compared to the interaction capabilities of more modern forum software. In a complex conversation with multiple lines of discussion, it can be difficult to tell who is referring to what without more explicit indication.

Moderation

There’s also the matter of moderation, an area with which forum administrators are generally quite familiar. As the number of people conversing in a single place grows, so does the chance that one or more users will behave in a way not conducive to the kind of conversation the content creator is interested in promoting. The tools provided by many social networks to moderate ensuing discussion are lackluster at best – typically nothing more than the ability to remove a comment or globally block a user (if that).

There’s also the possibility of allowing other commenters to participate in the moderation process. This includes both a community voting system a la Reddit, or pointing things out to a designated moderator (such as the post owner). If there’s an ability to report a problematic user on a social platform, it usually reports them to the owners of the platform and not the creator of the post, which means that it’s useless for situations that aren’t against the platform’s TOS but are against the wishes of the original poster. Providing more ways for commenters to point out troublemakers to content creators would lessen the burden on content creators to police discussions.

So to sum it all up…

You could roughly say that what I’d really like to see is a social content sharing platform with the circle model from Google+, the topic-subscription functionality from Pinterest, the content creation tools of WordPress and Markdown, the discussion and moderation power of a forum, and the multimedia sharing capabilities of Facebook/Google+.

Put that way, it almost sounds simple… until you realize that each of these is something that an entire site has devoted itself to doing well. Doing all of them in a single site would be at the very least challenging, and doing them well in a single site would be a truly herculean effort. Doing all of it, doing it well, and gaining enough market share to hit the critical mass of users necessary to give a social platform real staying power… well, let’s just say I’m not holding my breath.

That said – if you manage to pull it off, make sure to let me know.