Category Archives: Software Development
I’m not going to write a giant spiel on the current LKML kerfuffle. Instead, I’m just going to contrast a few (made-up) examples, none of which are “nice” per se.
The weak and potentially ineffective:
This code doesn’t seem very good. I’d prefer you didn’t submit these kinds of patches.
The needlessly abusive:
Are you an idiot? This code makes me think you are. We don’t need people like you submitting code like this.
The strong and civil:
This code is terribly written and does not at all meet our standards. We will not accept this patch, and if you keep submitting patches of this quality, we’ll be forced to stop wasting time considering them at all.
Just something to think about.
In the past couple of years the number of web startups aiming to help other web startups with various tasks has grown immensely. That’s probably a good thing; more innovation for the web is always welcome, and a great way to drive innovation is to have good tools readily available. On the other hand, any time you have a bunch of newcomers to a space, there’s going to be some rough edges involved as they learn some of the lessons that older participants learned the hard way. There’s also going to be some risks taken in the name of the aforementioned innovation, or perhaps in the name of “disruption” (a more vague goal).
Case in point: Chartio. To quote their site, “Chartio is your data’s interface. It’s simple to set up, easy to use, and provides business intelligence for the world’s most popular data sources.”
The service they offer is definitely a useful one – there’s a bunch of other companies that also try to provide it, each in their own way, and with various trade-offs. I’ve certainly seen some grotesque systems built for the purpose of providing business analysts with tools to do their job, and I’m willing to bet that Chartio’s internal code is a lot cleaner than a lot of that which I’ve come across in the past.
Where things go off the rails, however, is the frankly astounding compromises Chartio wants its clients to make in order to enable the “simple to set up and easy to use” parts of their pitch. If we click through the landing page to get to the setup instructions (we’ll use their MySQL instructions as an example), we find that Chartio essentially wants to connect directly to your site’s production database.
If you’re a systems administrator, alarm bells are probably going off in your mind now. Oh, but supposedly, it’s okay – they’ll use an encrypted SSH tunnel so that instead of you opening a hole in your firewall for them, they’ll bypass your firewall for you. Well, unless you don’t have shell access, in which case you will have to open a hole in your firewall for them.
Sure, that SSH tunnel might be difficult for an third-party attacker to break into, but what about compromises of Chartio’s servers? Whatever Chartio machines are on the other ends of these tunnels are veritable goldmines if a malicious user can compromise them, with active firewall-bypassing connections to a multitude of companies’ database servers. By opening up a tunnel, it’s effectively reducing your network’s defenses to the lowest common denominator of your existing defenses or Chartio’s. While I’m sure the people behind Chartio are just as dedicated to security as any of us, their entire company is 8 people, and only half of those are even engineers, let alone actively working on security.
It’s okay, though – even if someone did get control of Chartio’s servers and credentials, they specifically have you set it up so that they connect to your database as a read-only user. So an attacker couldn’t delete all your data or anything malicious like that. Except, well, there’s still that matter of being able to read all of your data. At least, if you follow their setup instructions (again using MySQL as an example):
GRANT SELECT, SHOW VIEW ON $database_name.* TO $user@`rackspace1.chart.io` IDENTIFIED BY '$password'; FLUSH PRIVILEGES;
See the * wildcard on the end of that second line, giving access to every single table in the database? After all, it’d be a hassle to grant access to only specific tables that it would make sense for business analysts to examine, and disallow access to things like your users’ sensitive data. It might also mean that your analysts’ time is wasted asking engineers to add new tables for them to read when they need access to data that’s not on the whitelist.
Of course, that doesn’t even take into account the harm that non-malicious users can do. While I don’t know the exact extent of the queries that a user of Chartio can cause it to run, it’s certainly possible to impact the performance of a database by issuing read-only queries that happen to result in large, inefficient scans of tables. One might hope that Chartio has built-in protections against this, but given the wide variety of databases they inter-operate with, all of which has varying levels of similarity in their query semantics, it seems unlikely that every query coming from Chartio is going to be perfectly optimized for the data it’s running against.
Let me make it clear that I don’t hate Chartio or anything – and they’re certainly not unique in making some of the choices I’ve highlighted above. My real goal here is just to make people more aware of the security trade-offs they are making when they use these kinds of methods to enable third-party services. It’s quite possible that the risks I’ve highlighted above are ones that you feel it’s okay to take, and in that case, go for it – as long as you’re respecting your end users’ interests as well. Just try to be cognizant of the risks you’re taking, and not just plug in new things because they’re shiny. Also realize that there may be ways that you can adjust the risks you’re taking – such as not using wildcard grants, as I mentioned above.
I would love to see Chartio develop some alternative methods of data acquisition that didn’t involve plugging their servers directly into your database, or at least have some guides on their site about good data isolation practices (e.g. restricting access to only tables that are really relevant to business analysts, and partitioning other sensitive user data into separate tables). That would be good for both Chartio’s customers (in that their overall approach to data security would improve) and also for Chartio (who might garner a little extra goodwill for helping that to happen). I expect that it might take some time before that happens, though, given that Chartio is a startup and has limited personnel resources to devote to all of their endeavors.
The most valuable skill for the average software engineer doesn’t get taught in the typical CS curriculum. It’s also almost impossible to teach it to yourself. Nearly every major programming outfit spends a significant amount of time trying to instill it in their employees, and they still often wind up failing to do so. Many programmers don’t realize they need this skill, and look down on those who have it and put it to good use.
You’re probably wondering what I’m talking about. The skill that I’m referring to is the ability to write code in such a way as to make it maintainable and amenable to large-scale collaboration.
I’m going to refer to code written with this skill in mind as “big-project code,” because it’s the kind of code that you need to write to make the most valuable contributions to a project of any significant size (as measured in people, not lines of code). In contrast, code that isn’t written with this skill in mind (or knowingly takes a pass on it) is “small-project code.”
Small-project code is not inherently bad code. There are tons of incredibly useful and clever programs out there written as small-project code. I’ve written many myself. No, the key distinction here is how easy it is for others to collaborate with you on your project.
There are many aspects of code that can vary between small-project and big-project. One example is its mental state requirement: how much of the code you have to hold in your head to be able to reason about its behavior? A program written in a small-project style will often need you to be familiar with a significant portion of its code to reason about what a particular handful of lines will do. Big-project code, on the other hand, will attempt to minimize the necessary mental state involved in figuring out what a particular function’s code does. Consistent naming, more verbose comments/docstrings, and clear object interfaces are some of the factors that play into big-project style.
Another aspect is what you might call walk-in tolerance: how simple is it for someone who has never worked with your code to start making useful changes to it? The lower the mental state requirement, the higher the walk-in tolerance, but other factors come into play as well. Unit tests help boost developer confidence (even for seasoned contributors) by making it easier to notice if new code breaks existing functionality. Style guides prevent bickering over personal preferences and lead to more consistently readable code. Good logging makes it easier to diagnose and debug problems.
The big-project mentality can also be applied at a higher level than the code itself. For example, the concept of a service-oriented architecture is inherently a big-project ideology. It uses well-defined APIs to create a modular environment in which a given developer can focus on a particular service and not have to put much thought into the specific implementations of other services. In doing so, it trades some up-front development time (good APIs take effort to design and implement) in return for a long-term payout (as the overall project grows, the amount of developer time saved not keeping mental state on other services increases drastically).
For the typical hobbyist project or college CS classwork, small-project code is fine – even encouraged. After all, the thing you’re hacking on probably isn’t going to wind up with tens, hundreds, or even thousands of developers contributing to it. Big-project code is a long-term investment, and it only makes sense to make that investment in projects where it will pay off.
For the average software engineer, however, their day job isn’t a hobbyist project. It tends to involve a project with a significant number of developers working together to produce a product that is hopefully greater than the sum of its parts. This is what big-project code is about. It’s about making not only your own job, but the job of others, easier. It’s about spending the time greasing the gears so that when it’s time to get things done, you’re not losing some of your energy to grind.
You generally don’t tend to truly appreciate it (or even get a good grasp on what it is) until you’ve actually spent time working on such a project. Small-project code often feels more “fun” because you’re not spending as much time and effort on long-term investments. It’s easy to shy away from ever working on big-project code because the barrier to entry seems so high – but it can be worth it. In the long run, collaboration often leads to results you never even imagined on your own.
So my advice to you is this:
- If you’re already a software engineer, try to put some extra effort into figuring out what parts of your work are worth investing in for the future, and how you can approach them in more of a big-project way.
- If you’re a student, but looking to eventually get a job as a software engineer, consider trying to get some experience with big-project code. One great way to do this is to contribute to an open-source project with a significant developer community. Not only will it give you experience with big-project code, it also stands out on a résumé. Not sure how to get started? Check out OpenHatch.
- If you’re a CS teacher, consider trying to find a way to work a large collaborative project into your curriculum. An example might be an overarching project for a class with modular services designed and developed by smaller teams of students. Most college CS programs today focus heavily on the theoretical side of computer science and give little in the way of practical programming experience.
The Vim-related portion of this post is actually very small. In fact, it can be summed up in a single sentence: you can take the output of a pipe and place it into Vim for editing, without needing to use an intermediate file, by piping it to
vim - (yes, the dash is significant). For example,
grep foo bar.txt | vim - will grab all of the lines from bar.txt which contain “foo” and place them into a Vim buffer for editing.
If you look in the Vim manpage, you’ll see this functionality described as such:
The file to edit is read from stdin. Commands are read from stderr, which should be a tty.
At first glance, this might seem odd – the way many people think about the stdin, stdout, and stderr, it wouldn’t make much sense to “read” from something associated with “error output”.
The reality, however, is that all of these are merely pointers to devices. The devices themselves may support reading, writing, or both. In fact, for your normal interactive shell session, all 3 of the “standard” IO pointers all point to the same device. That device represents the terminal interface you’re using.
Note that this is a different concept of ‘device’ from that of hardware devices (e.g. keyboard, mouse, screen) – as far as command line I/O is concerned, “devices” are just things that can have data read from and/or written to them. For instance, an SSH session creates a virtual terminal device on the machine to which you connect, that proxies its input and output via your connection to the machine.
Once you understand this, the concept of “reading from stderr” makes a lot more sense – you’re not actually “reading from your output” but rather “reading from the same device that error output goes to by default”. Since stderr is typically output to your terminal, even if you redirect stdout (say, to a file), getting your input from stderr is a convenient way of accepting keyboard input from the terminal when stdin is being used by something else – in this case, from the pipe that’s filling the buffer with content for you to edit.