Monthly Archives: July 2012

Comment Systems and Time-Ordering at Scale

Time is a tricky concept in software development, especially in large distributed systems. Not only do you have to worry about the intricacies of computer representations for time itself (and there are many), but you also have to deal with the fact that it’s nigh-impossible to perfectly sync system clocks across multiple servers. This, of course, results in some interesting problems.

One example of a large distributed system is Google+. Obviously, there are quite a few servers that power the Google+ webapp. When a user adds a comment to a post, one of those many servers will be handling the AJAX request to add the comment, and it’ll be talking to one of a number of servers in Google’s backing data store.

Of course, each of those servers has a separate system clock. If those system clocks are even a second or two off from one another, you can get situations where two comments are timestamped in opposite order from when they were actually submitted – because the first comment was processed by a server that had a lagging system clock. You can see an example of that in this public comment thread. Notice how the 2nd comment displayed is actually a reply to the 3rd, and the 11th a reply to the 12th.

Now, this could be a fun little blog post just pointing that out and calling it done – “here’s the reality you have to deal with when you scale this large.” As it turns out, however, there’s a way this particular problem could be solved. Let’s take a look.

Defining the problem

First, let’s define the part of the problem we actually care about. In general, we don’t actually care about perfect time ordering – if two people post unrelated comments within a second or two of one another, it’s fine to display them in either order. What we actually care about is that replies to comments are shown after the comment to which they are replying. If comment B is a reply to comment A, we want to show comment B after comment A. That’s all we really care about – beyond that, the existing rough time ordering is fine.

Outlining the solution

So how can we do this without requiring incredibly precise time synchronization? The trick here is that we’re going to send some extra info from the client to the server (and store it as part of the newly created comment). Specifically, at the same time as we send the contents of a new comment to the server, we’ll also send along a second integer value – call it “parent” for now – that we’ll store along with the comment. Why is this important? Because it’s what will let us help decide ordering when we go to create the comment list.

What will “parent” contain? Simple: the id of the last existing comment the client already loaded. For instance, if the post was loaded with three comments, which had ids 1, 3, and 4 respectively, then the client would send a “parent” value of 4 when it posted a new comment.

When fetching the comment list, we’ll start out how you’d expect: grab a list of all the relevant comments from the data store.  To generate the ordered comment list, we’ll first take all the comments with an empty “parent” value and add them to our result, ordered by timestamp. Next, we’ll take all the comments with a “parent” value corresponding to comments already in our result, and insert them into the result based on two rules:

  1. A comment must be inserted somewhere after its “parent” comment P.
  2. A comment must be inserted after any comments with an earlier timestamp, except where this would violate #1.

We keep repeating this until we have no more comments left to add to the ordering.

Why does this work? It works because of the simple fact that in order for someone to reply to a comment, they first have to read it – which means that their client has to have loaded the older comment. By sending that knowledge to the server, we can create what’s called a partially ordered set out of the comments. With that partial ordering, we can then generate a final ordering that meets our desired goals. The algorithm outlined above is basically a pared-down adaptation of a vector clock.

Proof of concept

I’ve created a small proof of concept as a Python script. Here’s the output (notice, in particular, the ordering of #3 and #4):

#0 (t=0)
#1 (t=2)
#2 (t=4) - reply to #1
#3 (t=6)
#4 (t=5) - reply to #3
#5 (t=8)
#6 (t=16)

The script is just a quick example thrown together in under an hour – I make no guarantees on whether it’s the most optimal code or not. Feel free to play around with it.

Closing notes

There is a trade-off involved in this: the algorithm that generates this ordering is worst-case O(n²), compared to just sorting the list based on timestamp in O(n log n). For most scenarios, however, this is acceptable – in the case of Google+, it’s extremely unlikely to be a problem, given that G+ comment threads cap out at something like 500 replies. With such a small n, the difference in time is negligible. The sorting can also be done client-side if desired – no extra server processing is required.

Catch-22: Tech Blogging As a Woman

One of the Hacker News comments on Anna Billstrom’s article regarding female users on StackOverflow caught my eye:

We have had article after article claiming it is obvious women are being oppressed in the tech industry. Every week there is one of these. Many make bigoted claims about male engineers, enforcing stereotypes of male geeks I have never actually seen in industry.

Where are the technical articles written by women? There are plenty of contributions complaining about oppression, while attacking men and claiming absurd stereotypes. Where are the technical contributions?

While there are multiple potential issues that could be raised with regards to this comment, I’m going to focus on the second half – the part asking “where are the technical articles written by women?” Well, let’s use an example that’s close at hand – the blog post I wrote about Git submodules. That blog post wound up on Hacker News and also on Reddit.

For now, let’s set aside whatever opinions you have on the content of the post itself – assuming you can at least agree with me that it’s an example of a technical article. In exchange, I’ll refrain from commenting on the merits of the comments I’m quoting here.

If you look through the comment threads on both sites, you’ll notice something: any instances of gendered language that the commenters use assume the author is male. They refer to “him” and talk about what they think “he” should do or their opinion of “his” thoughts on the matter. Some examples:

Hacker News:

It appears that one of the solutions he recommends, git-subtree, is going to be merged into git soon:


He complains about having to branch in both the parent and the sub project, but I have found that not to be a problem at all, and kind of nice in some situations. He’s blowing it way out of proportion.


On the other hand, the author prefaced all his “this is where submodules break” descriptions with “I forgot to run submodule update”, so I have trouble sympathizing.

Sure, I generally don’t go out of my way to make it obvious that I’m a woman on my blog – you’d have to first click over to the About page, and then click through to either my Google+ profile or my Twitter account. Then again, I don’t know many male developers who go out of their way to make it obvious that they’re male, either. After all, supposedly gender is irrelevant on the internet (Hacker News):

Not to mention that we’re on the fucking internet. There is no gender, race, colour or creed here. Everybody pick a neutral username and, hey, presto! Problem solved.

Sure, I could go out of my way to make it obvious that I’m a woman. I could put my name at the top of my blog or on my About page, or I could mention it in passing in my writing. That’s not something a male author has to do, though. Furthermore, doing so results in harassment and having my writing dismissed/trivialized/tokenized because of my gender. Hence why I don’t (or at least, hadn’t until this post).

The catch-22 here is that if I choose to blend in, then people like the commenters above assume that everything they see was written by men, and use that as an excuse to dismiss the concerns of women in the tech industry – because apparently, we don’t contribute and thus don’t matter. If I choose to not blend in, I’m dismissed as bringing up gender when it’s irrelevant (or worse, harassed by people who’ve decided gender is relevant).

It’s no surprise that when the default assumption is “male author” it winds up seeming like male authors write almost everything. So to the original commenter I quoted, here’s your answer: they’re where all the technical articles you read are: on Hacker News, on Reddit, on whatever other blogs and aggregators you frequent. Just because you don’t see them (or perhaps, don’t realize you see them) doesn’t mean they don’t exist.